: I'm using an anlyzer with WordDelimiterFilter before a LenghtFilter; When I : ran org.apache.lucene.index.CheckIndex on my solr index, I get some errors. : CheckIndex complains that some doc have terms uncorrectly positioned in -1, : beware that solr doesn't complain in any way, but Luke "Reconstruct&Edit" and : CheckIndex complain about "damaged" index.
i can reproduce the behavior you observed, but i'm not sure if the entire problem is the fault of LengthFilter, the Lucene indexing code, or CheckIndex. First question: other then Luke's Reconstruct&Edit feature or CheckIndex, are you seeing any other evidence of a problem? Solr not working? this documents not showing up in your results? The root issue seems to be that when LengthFilter throws away Tokens because they are too short (or too long) it doesn't seem to account for the positionIncrement of the tokens that came before it -- i'm fairly certain it should, i can't think of any legitimate reason for it not to -- so if the first token LengthFilter lets pass says it's increment is "0" that seems to throw off the internal counter when the document is added (LengthFilter is at fault, but IndexWriter should catch it) that said: even with this error from CheckIndex, Lucene+Solr seems to be able to use the index fine (but i didn't stress test it) : If you analyze with analisys.jsp the query "U.S.A. and U.K." you will notice : USA to be positioned to 0, while initially it was on 1 - I don't know if this : just mean "the token wasn't originally present in the doc", but as CheckIndex : complains about it, it make me thing this is a bug. ...bingo, but more specificly, "USA" is at the same position as "A" after WD processes the string, and then LengthFilter gets rid of U, S, and A leaving USA at the same position as ... "nothing" Would you mind opening a Solr bug to fix LengthFilter, and i'll spin up a [EMAIL PROTECTED] thread asking if IndexWriter should protect against this or if CheckIndex is paranoid and it's considered a legitimate state for the data. -Hoss
