On Aug 1, 2005, at 5:31 PM, Ken Krugler wrote:
Or you can derive the language from the host URL, if it includes a
country code.
That's not really sufficient... many Japanese sites also have pages
in English. Actually, that's true for most non-English sites from
what I've seen.
It's hard to detect all the various encodings... EUC-JP, SHIFT-
JIS, ISO-2022-KR/JP, BIG5, etc. and many servers do not correctly
identify the encodings.
See the latest release of ICU (3.4), which now supports charset
detection.
Yes, I forgot about that... but even then I wonder how well it will
do. For largish blocks of text (1k or so) it's not bad... you can use
statistical modelling to give you accurate probabilities, but for
smallish blocks (e.g. query strings) you have a much harder time.