: I'm using an anlyzer with WordDelimiterFilter before a LenghtFilter; When I
: ran org.apache.lucene.index.CheckIndex on my solr index, I get some errors.
: CheckIndex complains that some doc have terms uncorrectly positioned in -1,
: beware that solr doesn't complain in any way, but Luke "Reconstruct&Edit" and
: CheckIndex complain about "damaged" index.

i can reproduce the behavior you observed, but i'm not sure if the entire 
problem is the fault of LengthFilter, the Lucene indexing code, or 
CheckIndex.

First question: other then Luke's Reconstruct&Edit feature or CheckIndex, 
are you seeing any other evidence of a problem?  Solr not working? this 
documents not showing up in your results?

The root issue seems to be that when LengthFilter throws away Tokens 
because they are too short (or too long) it doesn't seem to account for 
the positionIncrement of the tokens that came before it -- i'm 
fairly certain it should, i can't think of any legitimate reason for it 
not to -- so if the first token LengthFilter lets pass says it's 
increment is "0" that seems to throw off the internal counter when the 
document is added (LengthFilter is at fault, but IndexWriter should catch 
it)

that said: even with this error from CheckIndex, Lucene+Solr seems to be 
able to use the index fine (but i didn't stress test it)

: If you analyze with analisys.jsp the query "U.S.A. and U.K." you will notice
: USA to be positioned to 0, while initially it was on 1 - I don't know if this
: just mean "the token wasn't originally present in the doc", but as CheckIndex
: complains about it, it make me thing this is a bug.

...bingo, but more specificly, "USA" is at the same position as "A" after 
WD processes the string, and then LengthFilter gets rid of U, S, and A 
leaving USA at the same position as ... "nothing"


Would you mind opening a Solr bug to fix LengthFilter, and i'll spin up a 
[EMAIL PROTECTED] thread asking if IndexWriter should protect against 
this or if CheckIndex is paranoid and it's considered a legitimate state 
for the data.


-Hoss

Reply via email to