On Thu, Mar 11, 2010 at 4:14 PM, Tom Burton-West <tburtonw...@gmail.com> wrote: > > Thanks Simon, > > We can probably implement your suggestion about runs of punctuation and > unlikely mixes of alpha/numeric/punctuation. I'm also thinking about > looking for unlikely mixes of unicode character blocks. For example some of > the CJK material ends up with Cyrillic characters. (except we would have to > watch out for any Russian-Chinese dictionaries:) >
Ok this is a new one for me, I am just curious, have you figured out why this is happening? Separately, i would love to know some sort of character frequency data for your non-english text, are you OCR'ing that data too? Are you using Unicode normalization or anything to prevent explosion of terms that are really the same? -- Robert Muir rcm...@gmail.com