Re: [CLucene-dev] Problem tokenizing dash-prefixed words - GIT 2.3.2

Itamar Syn-Hershko Tue, 08 Dec 2009 15:36:12 -0800

I will have a look soon. Anyway, JFYI, CLucene's implementation of
StandardAnalyzer (mainly StandardTokenizer) differs from the current Java
Lucene's one. Porting the current Java implementation shouldn't be too hard
a task since it's jflex generated code -- perhaps if someone could
contribute this that'd help us avoid fighting to fix the current
implementation, which was not designed with reusableTokenStreams in mind...


Itamar.

-----Original Message-----
From: cel tix44 [mailto:[email protected]] 
Sent: Tuesday, December 08, 2009 3:14 AM
To: [email protected]
Subject: [CLucene-dev] Problem tokenizing dash-prefixed words - GIT 2.3.2

Team

I have encountered a problem when adding documents where words starting with
the "minus" character, which manifests as follows:

1) add a document containing just one word starting with an alpha character:
   doc->add( *_CLNEW Field(MY_FLD, "one", Field::STORE_NO |
Field::INDEX_TOKENIZED )); ..........This word gets into the index
correctly, as "one".

2) add another document containing just one word starting with the minus
character:
   doc->add( *_CLNEW Field(MY_FLD, "-onetwo", Field::STORE_NO |
Field::INDEX_TOKENIZED )); .........Out of this word, only 2 rightmost
character -- "wo" -- will get into the index.

To see this happening:
- in StandardTokenizer.cpp, set a breakpoint on line 154:
  tokenStart = rdPos; <<< you'll see that rdPos is 4,
  whereas it should be 0 -- as we're adding the first token

- do "Step Over" until you'll return to line 143: ch = readChar();
- step into line 143: ch = readChar();
- that'll take you into StandardTokenizer::readChar()
- line 116: rdPos++; <<<- note rdPos becomes 5 here
- step into return rd->GetNext();
- that'll take you into FastCharStream.cpp -> FastCharStream::GetNext()
- line 49: ++pos; <<<- note pos becomes 6 here
- line 51: readChar(ch); <<<- this reads the 6th character, which is "w",

Why this is happening:
- in StandardAnalyzer.cpp, TokenStream*
StandardAnalyzer::reusableTokenStream()
  calls streams->tokenStream->reset(reader);
- that invokes StandardTokenizer::reset(Reader* _input)
- upon entry, rd->input is NULL, but rd->pos/col/line have not been reset
from previous use.

Stopgap fix:
1) in _FastCharStream.h -> class FastCharStream, add declaration:
                void my_rewind();
2) in FastCharStream.cpp, add implementation:
                void FastCharStream::my_rewind(){
                        pos = 0;
                }
3) in StandardTokenizer.cpp, change StandardTokenizer::reset to read this:
        this->input = _input;
        rdPos = -1; //*add this line: to mimic rdPos upon first entry
        if (rd->input==NULL) {
                rd->input = _input->__asBufferedReader();
                rd->my_rewind(); //*add this line: to reset rd->pos to 0
        }

I have posted a test-case for Visual Studio 9 2008 to the Tracker, Item ID:
2910395.

Regards
Celto

----------------------------------------------------------------------------
--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
CLucene-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/clucene-developers



------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
CLucene-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/clucene-developers

Re: [CLucene-dev] Problem tokenizing dash-prefixed words - GIT 2.3.2

Reply via email to