On Thu, Mar 11, 2010 at 4:14 PM, Tom Burton-West <tburtonw...@gmail.com> wrote:
>
> Thanks Simon,
>
> We can probably implement your suggestion about runs of punctuation and
> unlikely mixes of alpha/numeric/punctuation.  I'm also thinking about
> looking for unlikely mixes of unicode character blocks.  For example some of
> the CJK material ends up with Cyrillic characters. (except we would have to
> watch out for any Russian-Chinese dictionaries:)
>

Ok this is a new one for me, I am just curious, have you figured out
why this is happening?

Separately, i would love to know some sort of character frequency data
for your non-english text, are you OCR'ing that data too? Are you
using Unicode normalization or anything to prevent explosion of terms
that are really the same?

-- 
Robert Muir
rcm...@gmail.com

Reply via email to