Just a thought - are the files you're indexing larger than 10,000 words
(MAX_FIELD_LENGTH)? If so, maybe either your code or Lucene 2.3.* have
changed something in maxFieldLength implementation...

Itamar.

-----Original Message-----
From: Dan Rugg [mailto:[EMAIL PROTECTED] 
Sent: Friday, May 16, 2008 6:37 PM
To: java-user@lucene.apache.org
Subject: Version 2.3 Does Not Index/Digest All Document Tokens

After upgrading to version 2.3.x from 2.2.0, we started experiencing issues
with our index searches.  Some searches produced false positives, while
others produce no hits for terms known to be in specific documents that
where digested.  After setting up tests that created indexes containing
single documents we found that version 2.3.x did not add all the tokens from
a document the index while 2.2.0 did.  The only thing that changed between
the tests were the lucene jar being used, and a fresh index was created for
each test.

 

It seems to be some random action that 2.3.x is taking, or not taking.
While tokens such as 'traffic' will not be digested in one document, it will
in another.  Token frequency, order, and relative position seem to not
matter, as indexed and non-indexed tokens where across the board.
The documents being ingested where XML, and the tokenizer for the documents
were the same for 2.2.0 and 2.3.x.  We even did a token dump of the
documents and verified the documents where being tokenized correctly.

 

I did notice rebuilding the index was quicker with 2.3.x and the index was
smaller, but I guess if you aren't adding tokens to the index it is bound to
smaller.  BTW, we tested versions 2.3.1, 2.3.2, and 2.2.0.  We are now back
to using 2.2.0.

 

Daniel Rugg




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to