Brijesh Sharma <bssharma at quark dot co dot in> wrote: > I writing a small tool to get text from a txt file into a edit box. > Now this txt file could be in any encoding for eg(UTF-8,UTF-16,Mac > Roman,Windows ANSI,Western (ISO-8859-1),JIS,Shift-JIS etc) > My problem is that I can distinguish between UTF-8 or UTF-16 using > the BOM. > But how do I auto detect the others. > Any kind of help will be appreciated.
This has always been an interesting topic to me, even before the Unicode era. The best information I have ever seen on this topic is Li and Momoi's paper. To reiterate the URL: http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html If you are "writing a small tool," however, you may not have the space or time to implement everything Li and Momoi described. You probably need to divide the problem into (1) detection of Unicode encodings and (2) detection of non-Unicode encodings, because these are really different problems. Detecting Unicode encodings, of course, is trivial if the stream begins with a signature (BOM, U+FEFF), but as Jon Hanna pointed out, you can't always count on the signature being present. You need to rely primarily on what Li and Momoi call the "coding scheme method," searching for valid (and invalid) sequences in the various encoding schemes. This works well for UTF-8 in particular; most non-contrived text that contains at least one valid multibyte UTF-8 sequence and no invalid UTF-8 sequences is very likely to be UTF-8. In UTF-16 practically any sequence of bytes is valid, and since you can't assume you know the language, you can't employ distribution statistics. Twelve years ago, when most text was not Unicode and all Unicode text was UTF-16, Microsoft documentation suggested a heuristic of checking every other byte to see if it was zero, which of course would only work for Latin-1 text encoded in UTF-16. If you need to detect the encoding of non-Western-European text, you would have to be more sophisticated than this. Here are some notes I've taken on detecting a byte stream known to be in a Unicode encoding scheme (UTF-8, UTF-16, UTF-32, or SCSU). This is a work in progress and is not expected to be complete or perfect, so feel free to send corrections and enhancements but not flames: 0A 00 â inverse of U+000A LINE FEED â U+0A00 = unassigned Gurmukhi code point â may indicate little-endian UTF-16 0A 0D â 8-bit line-feed + carriage return â U+0A0D = unassigned Gurmukhi code point â probably indicates 8-bit encoding 0D 00 â inverse of U+000D CARRIAGE RETURN â U+0D00 = unassigned Malayalam code point â may indicate little-endian UTF-16 0D 0A â 8-bit carriage return + line feed â U+0D0A = MALAYALAM LETTER UU â text should include other Malayalam characters (U+0D00âU+0D7F) â otherwise, probably indicates 8-bit encoding 20 00 â inverse of U+0020 SPACE â U+2000 = EN QUAD (infrequent character) â may indicate UTF-16 (probably little-endian) 28 20 â inverse of U+2028 LINE SEPARATOR â U+2820 = BRAILLE PATTERN DOTS-6 â text should include other Braille characters (U+2800âU+28FF) â may indicate little-endian UTF-16 â but may also indicate 8-bit 20 28 or 28 20 (space + left parenthesis) E2 80 A8 â UTF-8 representation of U+2028 LINE SEPARATOR â probably indicates UTF-8 05 28 â SCSU representation of U+2028 LINE SEPARATOR â U+0528 is unassigned â U+2805 is BRAILLE PATTERN DOTS-13 â should be surrounded by other Braille characters â otherwise, probably indicates SCSU 00 00 00 â probably a Basic Latin character in UTF-32 (either byte order) Detecting non-Unicode encodings is quite another matter, and here you really need to study the techniques described by Li and Momoi. Distinguishing straight ASCII from ISO 8859-1 from Windows-1252 is easy -- just check which subsets of Windows-1252 are present -- but throwing Mac Roman and East Asian double-byte sets into the mix is another matter. I once wrote a program to detect the encoding of a text sample known to be in one of the following Cyrillic encodings: â KOI8-R â Windows code page 1251 â ISO 8859-5 â MS-DOS code page 866 â MS-DOS code page 855 â Mac Cyrillic Given the Unicode scalar values corresponding to each byte value, the program calculates the proportion of Cyrillic characters (as opposed to punctuation and dingbats) when interpreted in each possible encoding, and picks the encoding with the highest proportion (confidence level). This is a dumbed-down version of Li and Momoi's character distribution method, but works surprisingly well so long as the text really is in one of these Cyrillic encodings. It fails spectacularly for text in Latin-1, Mac Roman, UTF-8, etc. It would probably also be unable to detect differences between almost-identical character sets, like KOI8-R and KOI8-U. The smaller your list of "possible" encodings, the easier your job of detecting one of them. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/

