Team
I have encountered a problem when adding documents where words
starting with the "minus" character, which manifests as follows:
1) add a document containing just one word starting with an alpha character:
doc->add( *_CLNEW Field(MY_FLD, "one", Field::STORE_NO |
Field::INDEX_TOKENIZED ));
..........This word gets into the index correctly, as "one".
2) add another document containing just one word starting with the
minus character:
doc->add( *_CLNEW Field(MY_FLD, "-onetwo", Field::STORE_NO |
Field::INDEX_TOKENIZED ));
.........Out of this word, only 2 rightmost character -- "wo" -- will
get into the index.
To see this happening:
- in StandardTokenizer.cpp, set a breakpoint on line 154:
tokenStart = rdPos; <<< you'll see that rdPos is 4,
whereas it should be 0 -- as we're adding the first token
- do "Step Over" until you'll return to line 143: ch = readChar();
- step into line 143: ch = readChar();
- that'll take you into StandardTokenizer::readChar()
- line 116: rdPos++; <<<- note rdPos becomes 5 here
- step into return rd->GetNext();
- that'll take you into FastCharStream.cpp -> FastCharStream::GetNext()
- line 49: ++pos; <<<- note pos becomes 6 here
- line 51: readChar(ch); <<<- this reads the 6th character, which is "w",
Why this is happening:
- in StandardAnalyzer.cpp, TokenStream* StandardAnalyzer::reusableTokenStream()
calls streams->tokenStream->reset(reader);
- that invokes StandardTokenizer::reset(Reader* _input)
- upon entry, rd->input is NULL, but rd->pos/col/line have not been
reset from previous use.
Stopgap fix:
1) in _FastCharStream.h -> class FastCharStream, add declaration:
void my_rewind();
2) in FastCharStream.cpp, add implementation:
void FastCharStream::my_rewind(){
pos = 0;
}
3) in StandardTokenizer.cpp, change StandardTokenizer::reset to read this:
this->input = _input;
rdPos = -1; //*add this line: to mimic rdPos upon first entry
if (rd->input==NULL) {
rd->input = _input->__asBufferedReader();
rd->my_rewind(); //*add this line: to reset rd->pos to 0
}
I have posted a test-case for Visual Studio 9 2008 to the Tracker,
Item ID: 2910395.
Regards
Celto
------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
CLucene-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/clucene-developers