RE: Detecting encoding in Plain text

Tom Emerson Mon, 12 Jan 2004 10:45:22 -0800

Perhaps a meta question is this: how often are you going to encounter
unBOMed UTF-32 or UTF-16 text? It's pretty rare --- certainly I've never
seen it during the development of our language/encoding identifier.


Sure, it's an interesting thought problem, but it doesn't happen.
And fortunately detecting UTF-8 is relatively easy.

The real problem is differentiating between the ISO 8859-x family and
EUC-CN vs. EUC-KR. These are wondefully ambiguous.

The key to doing this right is having _a_lot_ of valid training data.
You also have to deal with oddities of language: I tried one open
source implementation of the Cavnar and Trenkel algorithm THAT CLAIMED
THAT SHOUTED ENGLISH WAS ACTUALLY CZECH.

It's difficult to separate the language detection from the encoding
Detection when dealing with non-Unicode text.

    -tree  

--
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"

RE: Detecting encoding in Plain text

Reply via email to