poeta simbolista wrote: > I'd want to know the best way to look for strange encodings on a Lucene > index. > i have several inputs where input can have been encoded on different sets. I > not always know if my guess about the encoding has been ok. Hence, I'd > thought of querying the index for some typical strings that would show bad > encodings.
In my experience, the best thing to do first is to look at a random sample of the data you suspect to be problematic, and keep track of what you find. Then, decide based on what you find whether it's worth it to pursue it further. (Data is messy, and sometimes it's not worth the effort to find and fix everything, as long as you know that the probability of problems is relatively low.) If you do find that it's worth pursuing, I'd guess that the best spot to find problems is at index time rather than query time, mostly because at query time, you don't necessarily know what to look for. If you did, then you could already improve your guesser at index time, right? One technique that you might find useful is to see if a document contains too many previously unseen terms. You could index documents in the same language and subject domain as those which might have problematic charset conversion issues, but which do not have those issues themselves, and then tokenize potentially problematically converted documents, checking for the existence of each term in the index[1] and keeping track of the ratio of previously unseen terms to the total number of terms. If you compare this ratio to that for the average known good document (and/or the worst-case near-last addition to the index), you could get an idea about whether or not the document in question has issues. Steve [1] <http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/index/IndexReader.html#terms(org.apache.lucene.index.Term)> -- Steve Rowe Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]