Perhaps a meta question is this: how often are you going to encounter unBOMed UTF-32 or UTF-16 text? It's pretty rare --- certainly I've never seen it during the development of our language/encoding identifier.
Sure, it's an interesting thought problem, but it doesn't happen. And fortunately detecting UTF-8 is relatively easy. The real problem is differentiating between the ISO 8859-x family and EUC-CN vs. EUC-KR. These are wondefully ambiguous. The key to doing this right is having _a_lot_ of valid training data. You also have to deal with oddities of language: I tried one open source implementation of the Cavnar and Trenkel algorithm THAT CLAIMED THAT SHOUTED ENGLISH WAS ACTUALLY CZECH. It's difficult to separate the language detection from the encoding Detection when dealing with non-Unicode text. -tree -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever"