RE: Cleaning up dirty OCR

Burton-West, Tom Thu, 11 Mar 2010 12:38:05 -0800

Thanks Robert,

I've been thinking about this since you suggested it on another thread.  One 
problem is that it would also remove real words. Apparently 40-60% of the words 
in large corpora occur only once 
(http://en.wikipedia.org/wiki/Hapax_legomenon.)


There are a couple of use cases where removing words that occur only once might 
be a problem.  

One is for genealogical searches where a user might want to retrieve a document 
if their relative is only mentioned once in the document.  We have quite a few 
government documents and other resources such as the "Lineage Book" of the 
Daughters of the American Revolution.  

Another use case is humanities researchers doing phrase searching for quotes.  
In this case, if we remove one of the words in the quote because it occurs only 
once in a document, then the phrase search would fail.  For example if someone 
were searching Macbeth and entered the phrase query: "Eye of newt and toe of 
frog" it would fail if we had removed "newt" from the index because "newt" 
occurs only once in Macbeth.

I ran a quick check against a couple of our copies of Macbeth and found out of 
about 5,000 unique words about 3,000 occurred only once.  Of these about 1,800 
were in the unix dictionary, so at least 1800 words that would be removed would 
be "real" words as opposed to OCR errors (a spot check of the words not in the 
unix /usr/share/dict/words file revealed most of them also as real words rather 
than OCR errors.)

I also ran a quick check against a document with bad OCR and out of about 
30,000 unique words, 20,000 occurred only once.  Of those 20,000 only about 300 
were in the unix dictionary so your intuition that a lot of OCR errors will 
occur only once seems spot on.  A quick look at the words not in the dictionary 
revealed a mix of technical terms, common names, and obvious OCR nonsense such 
as "ffllllll.lj'slall'lm" "

I guess the question I need to determine is whether the benefit of removing 
words that occur only once outweighs the costs in terms of the two use cases 
outlined above.   When we get our new test server set up, sometime in the next 
month, I think I will go ahead and prune a test index of 500K docs and do some 
performance testing just to get an idea of the potential performance gains of 
pruning the index.

I have some other questions about index pruning, but I want to do a bit more 
reading and then I'll post a question to either the Solr or Lucene list.  Can 
you suggest which list I should post an index pruning question to?

Tom








-----Original Message-----
From: Robert Muir [mailto:[email protected]] 
Sent: Tuesday, March 09, 2010 2:36 PM
To: [email protected]
Subject: Re: Cleaning up dirty OCR

> Can anyone suggest any practical solutions to removing some fraction of the 
> tokens containing OCR errors from our input stream?

one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812

and filter terms that only appear once in the document.


-- 
Robert Muir
[email protected]

RE: Cleaning up dirty OCR

Reply via email to