Re: Detecting CJKV / Asian language pages

Gavin Thomas Nicol Tue, 02 Aug 2005 08:25:50 -0700


On Aug 1, 2005, at 5:31 PM, Ken Krugler wrote:

Or you can derive the language from the host URL, if it includes acountry code.

That's not really sufficient... many Japanese sites also have pagesin English. Actually, that's true for most non-English sites fromwhat I've seen.

It's hard to detect all the various encodings... EUC-JP, SHIFT-JIS, ISO-2022-KR/JP, BIG5, etc. and many servers do not correctlyidentify the encodings.
See the latest release of ICU (3.4), which now supports charsetdetection.

Yes, I forgot about that... but even then I wonder how well it willdo. For largish blocks of text (1k or so) it's not bad... you can usestatistical modelling to give you accurate probabilities, but forsmallish blocks (e.g. query strings) you have a much harder time.

Re: Detecting CJKV / Asian language pages

Reply via email to