Thanks Robert, I've been thinking about this since you suggested it on another thread. One problem is that it would also remove real words. Apparently 40-60% of the words in large corpora occur only once (http://en.wikipedia.org/wiki/Hapax_legomenon.)
There are a couple of use cases where removing words that occur only once might be a problem. One is for genealogical searches where a user might want to retrieve a document if their relative is only mentioned once in the document. We have quite a few government documents and other resources such as the "Lineage Book" of the Daughters of the American Revolution. Another use case is humanities researchers doing phrase searching for quotes. In this case, if we remove one of the words in the quote because it occurs only once in a document, then the phrase search would fail. For example if someone were searching Macbeth and entered the phrase query: "Eye of newt and toe of frog" it would fail if we had removed "newt" from the index because "newt" occurs only once in Macbeth. I ran a quick check against a couple of our copies of Macbeth and found out of about 5,000 unique words about 3,000 occurred only once. Of these about 1,800 were in the unix dictionary, so at least 1800 words that would be removed would be "real" words as opposed to OCR errors (a spot check of the words not in the unix /usr/share/dict/words file revealed most of them also as real words rather than OCR errors.) I also ran a quick check against a document with bad OCR and out of about 30,000 unique words, 20,000 occurred only once. Of those 20,000 only about 300 were in the unix dictionary so your intuition that a lot of OCR errors will occur only once seems spot on. A quick look at the words not in the dictionary revealed a mix of technical terms, common names, and obvious OCR nonsense such as "ffllllll.lj'slall'lm" " I guess the question I need to determine is whether the benefit of removing words that occur only once outweighs the costs in terms of the two use cases outlined above. When we get our new test server set up, sometime in the next month, I think I will go ahead and prune a test index of 500K docs and do some performance testing just to get an idea of the potential performance gains of pruning the index. I have some other questions about index pruning, but I want to do a bit more reading and then I'll post a question to either the Solr or Lucene list. Can you suggest which list I should post an index pruning question to? Tom -----Original Message----- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Tuesday, March 09, 2010 2:36 PM To: solr-user@lucene.apache.org Subject: Re: Cleaning up dirty OCR > Can anyone suggest any practical solutions to removing some fraction of the > tokens containing OCR errors from our input stream? one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812 and filter terms that only appear once in the document. -- Robert Muir rcm...@gmail.com