Team

I have encountered a problem when adding documents where words
starting with the "minus" character, which manifests as follows:

1) add a document containing just one word starting with an alpha character:
   doc->add( *_CLNEW Field(MY_FLD, "one", Field::STORE_NO |
Field::INDEX_TOKENIZED ));
..........This word gets into the index correctly, as "one".

2) add another document containing just one word starting with the
minus character:
   doc->add( *_CLNEW Field(MY_FLD, "-onetwo", Field::STORE_NO |
Field::INDEX_TOKENIZED ));
.........Out of this word, only 2 rightmost character -- "wo" -- will
get into the index.

To see this happening:
- in StandardTokenizer.cpp, set a breakpoint on line 154:
  tokenStart = rdPos; <<< you'll see that rdPos is 4,
  whereas it should be 0 -- as we're adding the first token

- do "Step Over" until you'll return to line 143: ch = readChar();
- step into line 143: ch = readChar();
- that'll take you into StandardTokenizer::readChar()
- line 116: rdPos++; <<<- note rdPos becomes 5 here
- step into return rd->GetNext();
- that'll take you into FastCharStream.cpp -> FastCharStream::GetNext()
- line 49: ++pos; <<<- note pos becomes 6 here
- line 51: readChar(ch); <<<- this reads the 6th character, which is "w",

Why this is happening:
- in StandardAnalyzer.cpp, TokenStream* StandardAnalyzer::reusableTokenStream()
  calls streams->tokenStream->reset(reader);
- that invokes StandardTokenizer::reset(Reader* _input)
- upon entry, rd->input is NULL, but rd->pos/col/line have not been
reset from previous use.

Stopgap fix:
1) in _FastCharStream.h -> class FastCharStream, add declaration:
                void my_rewind();
2) in FastCharStream.cpp, add implementation:
                void FastCharStream::my_rewind(){
                        pos = 0;
                }
3) in StandardTokenizer.cpp, change StandardTokenizer::reset to read this:
        this->input = _input;
        rdPos = -1; //*add this line: to mimic rdPos upon first entry
        if (rd->input==NULL) {
                rd->input = _input->__asBufferedReader();
                rd->my_rewind(); //*add this line: to reset rd->pos to 0
        }

I have posted a test-case for Visual Studio 9 2008 to the Tracker,
Item ID: 2910395.

Regards
Celto

------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
CLucene-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/clucene-developers

Reply via email to