I will have a look soon. Anyway, JFYI, CLucene's implementation of StandardAnalyzer (mainly StandardTokenizer) differs from the current Java Lucene's one. Porting the current Java implementation shouldn't be too hard a task since it's jflex generated code -- perhaps if someone could contribute this that'd help us avoid fighting to fix the current implementation, which was not designed with reusableTokenStreams in mind...
Itamar. -----Original Message----- From: cel tix44 [mailto:[email protected]] Sent: Tuesday, December 08, 2009 3:14 AM To: [email protected] Subject: [CLucene-dev] Problem tokenizing dash-prefixed words - GIT 2.3.2 Team I have encountered a problem when adding documents where words starting with the "minus" character, which manifests as follows: 1) add a document containing just one word starting with an alpha character: doc->add( *_CLNEW Field(MY_FLD, "one", Field::STORE_NO | Field::INDEX_TOKENIZED )); ..........This word gets into the index correctly, as "one". 2) add another document containing just one word starting with the minus character: doc->add( *_CLNEW Field(MY_FLD, "-onetwo", Field::STORE_NO | Field::INDEX_TOKENIZED )); .........Out of this word, only 2 rightmost character -- "wo" -- will get into the index. To see this happening: - in StandardTokenizer.cpp, set a breakpoint on line 154: tokenStart = rdPos; <<< you'll see that rdPos is 4, whereas it should be 0 -- as we're adding the first token - do "Step Over" until you'll return to line 143: ch = readChar(); - step into line 143: ch = readChar(); - that'll take you into StandardTokenizer::readChar() - line 116: rdPos++; <<<- note rdPos becomes 5 here - step into return rd->GetNext(); - that'll take you into FastCharStream.cpp -> FastCharStream::GetNext() - line 49: ++pos; <<<- note pos becomes 6 here - line 51: readChar(ch); <<<- this reads the 6th character, which is "w", Why this is happening: - in StandardAnalyzer.cpp, TokenStream* StandardAnalyzer::reusableTokenStream() calls streams->tokenStream->reset(reader); - that invokes StandardTokenizer::reset(Reader* _input) - upon entry, rd->input is NULL, but rd->pos/col/line have not been reset from previous use. Stopgap fix: 1) in _FastCharStream.h -> class FastCharStream, add declaration: void my_rewind(); 2) in FastCharStream.cpp, add implementation: void FastCharStream::my_rewind(){ pos = 0; } 3) in StandardTokenizer.cpp, change StandardTokenizer::reset to read this: this->input = _input; rdPos = -1; //*add this line: to mimic rdPos upon first entry if (rd->input==NULL) { rd->input = _input->__asBufferedReader(); rd->my_rewind(); //*add this line: to reset rd->pos to 0 } I have posted a test-case for Visual Studio 9 2008 to the Tracker, Item ID: 2910395. Regards Celto ---------------------------------------------------------------------------- -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev _______________________________________________ CLucene-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/clucene-developers ------------------------------------------------------------------------------ Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev _______________________________________________ CLucene-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/clucene-developers
