Thank you Steven, I have problems while providing those searches, I think it is because of the StandardAnalyzer is taking those bad-encoding characters as separators hence not creating such tokens when reading...
Regarding the other idea you provided, did you mean then, that if a document contains many unseen terms that may mean encoding problems? Also, what I would like is to be able to at least, measure the impact of such problems, so I can decide whether the effort will be paid back :) Cheers P Steven Rowe wrote: > > poeta simbolista wrote: >> I'd want to know the best way to look for strange encodings on a Lucene >> index. >> i have several inputs where input can have been encoded on different >> sets. I >> not always know if my guess about the encoding has been ok. Hence, I'd >> thought of querying the index for some typical strings that would show >> bad >> encodings. > > In my experience, the best thing to do first is to look at a random > sample of the data you suspect to be problematic, and keep track of what > you find. Then, decide based on what you find whether it's worth it to > pursue it further. (Data is messy, and sometimes it's not worth the > effort to find and fix everything, as long as you know that the > probability of problems is relatively low.) > > If you do find that it's worth pursuing, I'd guess that the best spot to > find problems is at index time rather than query time, mostly because at > query time, you don't necessarily know what to look for. If you did, > then you could already improve your guesser at index time, right? > > One technique that you might find useful is to see if a document > contains too many previously unseen terms. You could index documents in > the same language and subject domain as those which might have > problematic charset conversion issues, but which do not have those > issues themselves, and then tokenize potentially problematically > converted documents, checking for the existence of each term in the > index[1] and keeping track of the ratio of previously unseen terms to > the total number of terms. If you compare this ratio to that for the > average known good document (and/or the worst-case near-last addition to > the index), you could get an idea about whether or not the document in > question has issues. > > Steve > > [1] > <http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/index/IndexReader.html#terms(org.apache.lucene.index.Term)> > > -- > Steve Rowe > Center for Natural Language Processing > http://www.cnlp.org/tech/lucene.asp > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Look-for-strange-encodings----tokenization-tf4378064.html#a12504196 Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]