: Cleaning up dirty OCR
Can anyone suggest any practical solutions to removing some fraction of the
tokens containing OCR errors from our input stream?
one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812
and filter terms that only appear once in the document
On Thu, Mar 11, 2010 at 3:37 PM, Burton-West, Tom tburt...@umich.edu wrote:
Thanks Robert,
I've been thinking about this since you suggested it on another thread. One
problem is that it would also remove real words. Apparently 40-60% of the
words in large corpora occur only once
for runs of
punctuation, unlikely mixes of alpha/numeric/punctuation, and also
eliminated longer words which consisted of runs of not-ocurring-in-English
bigrams.
Hope this helps
-Simon
--
--
View this message in context:
http://old.nabble.com/Cleaning-up-dirty-OCR
On Thu, Mar 11, 2010 at 4:14 PM, Tom Burton-West tburtonw...@gmail.com wrote:
Thanks Simon,
We can probably implement your suggestion about runs of punctuation and
unlikely mixes of alpha/numeric/punctuation. I'm also thinking about
looking for unlikely mixes of unicode character blocks.
: We can probably implement your suggestion about runs of punctuation and
: unlikely mixes of alpha/numeric/punctuation. I'm also thinking about
: looking for unlikely mixes of unicode character blocks. For example some of
: the CJK material ends up with Cyrillic characters. (except we would
On Mar 11, 2010, at 1:34 PM, Chris Hostetter wrote:
I wonder if one way to try and generalize
the idea of unlikely letter combinations into a math problem (instead of
grammer/spelling problem) would be to score all the hapax legomenon
words in your index
Hmm, how about a classifier?
--
View this message in context:
http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871353.html
Sent from the Solr - User mailing list archive at Nabble.com.
words are the yes training set,
hapax legomenons are the no set, and n-grams are the features.
But why isn't the OCR program already doing this?
wunder
--
View this message in context:
http://old.nabble.com/Cleaning-up-dirty-OCR-tp27840753p27871444.html
Sent from the Solr
: Interesting. I wonder though if we have 4 million English documents and 250
: in Urdu, if the Urdu words would score badly when compared to ngram
: statistics for the entire corpus.
Well it doesn't have to be a strict ratio cutoff .. you could look at the
average frequency of all character
I don't deal with a lot of multi-lingual stuff, but my understanding is
that this sort of thing gets a lot easier if you can partition your docs
by language -- and even if you can't, doing some langauge detection on the
(dirty) OCRed text to get a language guess (and then partition by
Hello all,
We have been indexing a large collection of OCR'd text. About 5 million books
in over 200 languages. With 1.5 billion OCR'd pages, even a small OCR error
rate creates a relatively large number of meaningless unique terms. (See
Can anyone suggest any practical solutions to removing some fraction of the
tokens containing OCR errors from our input stream?
one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812
and filter terms that only appear once in the document.
--
Robert Muir
On Tue, Mar 9, 2010 at 2:35 PM, Robert Muir rcm...@gmail.com wrote:
Can anyone suggest any practical solutions to removing some fraction of
the tokens containing OCR errors from our input stream?
one approach would be to try
http://issues.apache.org/jira/browse/LUCENE-1812
and filter
13 matches
Mail list logo