Re: Cleaning up dirty OCR

Robert Muir Thu, 11 Mar 2010 13:19:02 -0800

On Thu, Mar 11, 2010 at 4:14 PM, Tom Burton-West <[email protected]> wrote:
>
> Thanks Simon,
>
> We can probably implement your suggestion about runs of punctuation and
> unlikely mixes of alpha/numeric/punctuation.  I'm also thinking about
> looking for unlikely mixes of unicode character blocks.  For example some of
> the CJK material ends up with Cyrillic characters. (except we would have to
> watch out for any Russian-Chinese dictionaries:)
>


Ok this is a new one for me, I am just curious, have you figured out
why this is happening?

Separately, i would love to know some sort of character frequency data
for your non-english text, are you OCR'ing that data too? Are you
using Unicode normalization or anything to prevent explosion of terms
that are really the same?

-- 
Robert Muir
[email protected]

Re: Cleaning up dirty OCR

Reply via email to