RE: Cleaning up dirty OCR

2010-03-11 Thread Burton-West, Tom
: Cleaning up dirty OCR Can anyone suggest any practical solutions to removing some fraction of the tokens containing OCR errors from our input stream? one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812 and filter terms that only appear once in the document

Re: Cleaning up dirty OCR

2010-03-11 Thread Robert Muir
On Thu, Mar 11, 2010 at 3:37 PM, Burton-West, Tom tburt...@umich.edu wrote: Thanks Robert, I've been thinking about this since you suggested it on another thread.  One problem is that it would also remove real words. Apparently 40-60% of the words in large corpora occur only once

Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West
for runs of punctuation, unlikely mixes of alpha/numeric/punctuation, and also eliminated longer words which consisted of runs of not-ocurring-in-English bigrams. Hope this helps -Simon -- -- View this message in context: http://old.nabble.com/Cleaning-up-dirty-OCR

Re: Cleaning up dirty OCR

2010-03-11 Thread Robert Muir
On Thu, Mar 11, 2010 at 4:14 PM, Tom Burton-West tburtonw...@gmail.com wrote: Thanks Simon, We can probably implement your suggestion about runs of punctuation and unlikely mixes of alpha/numeric/punctuation.  I'm also thinking about looking for unlikely mixes of unicode character blocks.  

Re: Cleaning up dirty OCR

2010-03-11 Thread Chris Hostetter
: We can probably implement your suggestion about runs of punctuation and : unlikely mixes of alpha/numeric/punctuation. I'm also thinking about : looking for unlikely mixes of unicode character blocks. For example some of : the CJK material ends up with Cyrillic characters. (except we would

Re: Cleaning up dirty OCR

2010-03-11 Thread Walter Underwood
On Mar 11, 2010, at 1:34 PM, Chris Hostetter wrote: I wonder if one way to try and generalize the idea of unlikely letter combinations into a math problem (instead of grammer/spelling problem) would be to score all the hapax legomenon words in your index Hmm, how about a classifier?

Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West
-- View this message in context: http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871353.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West
words are the yes training set, hapax legomenons are the no set, and n-grams are the features. But why isn't the OCR program already doing this? wunder -- View this message in context: http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871444.html Sent from the Solr

Re: Cleaning up dirty OCR

2010-03-11 Thread Chris Hostetter
: Interesting. I wonder though if we have 4 million English documents and 250 : in Urdu, if the Urdu words would score badly when compared to ngram : statistics for the entire corpus. Well it doesn't have to be a strict ratio cutoff .. you could look at the average frequency of all character

Re: Cleaning up dirty OCR

2010-03-11 Thread Robert Muir
I don't deal with a lot of multi-lingual stuff, but my understanding is that this sort of thing gets a lot easier if you can partition your docs by language -- and even if you can't, doing some langauge detection on the (dirty) OCRed text to get a language guess (and then partition by

Cleaning up dirty OCR

2010-03-09 Thread Burton-West, Tom
Hello all, We have been indexing a large collection of OCR'd text. About 5 million books in over 200 languages. With 1.5 billion OCR'd pages, even a small OCR error rate creates a relatively large number of meaningless unique terms. (See

Re: Cleaning up dirty OCR

2010-03-09 Thread Robert Muir
Can anyone suggest any practical solutions to removing some fraction of the tokens containing OCR errors from our input stream? one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812 and filter terms that only appear once in the document. -- Robert Muir

Re: Cleaning up dirty OCR

2010-03-09 Thread simon
On Tue, Mar 9, 2010 at 2:35 PM, Robert Muir rcm...@gmail.com wrote: Can anyone suggest any practical solutions to removing some fraction of the tokens containing OCR errors from our input stream? one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812 and filter