One thing I have done in the past that was along similar lines: If you know that it is a UTF, and if you know that you support the latest version of Unicode, then you can walk through the bytes in 7 parallel paths, with each fetching a code point in each of the 7 encoding schemes and testing it. If you hit an illegal sequence or unassigned code point, then you 'turn off' that path. If you have a single path at any point, then jump to a faster routine to do the rest of the conversion. (I actually had 8 paths, since I also could have Latin-1.)
I never put in anything to settle the cases where you end up with more than one path, except for a simple priority order. In those rare cases where necessary, I suspect something simple like capturing the frequency of a some common characters, such as new lines, space, and certain punctuation, and some uncommon characters (most controls) would go a long way. Mark __________________________________ http://www.macchiato.com â ààààààààààààààààààààà â ----- Original Message ----- From: "Doug Ewell" <[EMAIL PROTECTED]> To: "Unicode Mailing List" <[EMAIL PROTECTED]> Cc: "Brijesh Sharma" <[EMAIL PROTECTED]> Sent: Sun, 2004 Jan 11 21:48 Subject: Re: Detecting encoding in Plain text > Brijesh Sharma <bssharma at quark dot co dot in> wrote: > > > I writing a small tool to get text from a txt file into a edit box. > > Now this txt file could be in any encoding for eg(UTF-8,UTF-16,Mac > > Roman,Windows ANSI,Western (ISO-8859-1),JIS,Shift-JIS etc) > > My problem is that I can distinguish between UTF-8 or UTF-16 using > > the BOM. > > But how do I auto detect the others. > > Any kind of help will be appreciated. > > This has always been an interesting topic to me, even before the Unicode > era. The best information I have ever seen on this topic is Li and > Momoi's paper. To reiterate the URL: > > http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html > > If you are "writing a small tool," however, you may not have the space > or time to implement everything Li and Momoi described. > > You probably need to divide the problem into (1) detection of Unicode > encodings and (2) detection of non-Unicode encodings, because these are > really different problems. > > Detecting Unicode encodings, of course, is trivial if the stream begins > with a signature (BOM, U+FEFF), but as Jon Hanna pointed out, you can't > always count on the signature being present. You need to rely primarily > on what Li and Momoi call the "coding scheme method," searching for > valid (and invalid) sequences in the various encoding schemes. This > works well for UTF-8 in particular; most non-contrived text that > contains at least one valid multibyte UTF-8 sequence and no invalid > UTF-8 sequences is very likely to be UTF-8. > > In UTF-16 practically any sequence of bytes is valid, and since you > can't assume you know the language, you can't employ distribution > statistics. Twelve years ago, when most text was not Unicode and all > Unicode text was UTF-16, Microsoft documentation suggested a heuristic > of checking every other byte to see if it was zero, which of course > would only work for Latin-1 text encoded in UTF-16. If you need to > detect the encoding of non-Western-European text, you would have to be > more sophisticated than this. > > Here are some notes I've taken on detecting a byte stream known to be in > a Unicode encoding scheme (UTF-8, UTF-16, UTF-32, or SCSU). This is a > work in progress and is not expected to be complete or perfect, so feel > free to send corrections and enhancements but not flames: > > 0A 00 > â inverse of U+000A LINE FEED > â U+0A00 = unassigned Gurmukhi code point > â may indicate little-endian UTF-16 > > 0A 0D > â 8-bit line-feed + carriage return > â U+0A0D = unassigned Gurmukhi code point > â probably indicates 8-bit encoding > > 0D 00 > â inverse of U+000D CARRIAGE RETURN > â U+0D00 = unassigned Malayalam code point > â may indicate little-endian UTF-16 > > 0D 0A > â 8-bit carriage return + line feed > â U+0D0A = MALAYALAM LETTER UU > â text should include other Malayalam characters (U+0D00âU+0D7F) > â otherwise, probably indicates 8-bit encoding > > 20 00 > â inverse of U+0020 SPACE > â U+2000 = EN QUAD (infrequent character) > â may indicate UTF-16 (probably little-endian) > > 28 20 > â inverse of U+2028 LINE SEPARATOR > â U+2820 = BRAILLE PATTERN DOTS-6 > â text should include other Braille characters (U+2800âU+28FF) > â may indicate little-endian UTF-16 > â but may also indicate 8-bit 20 28 or 28 20 (space + left parenthesis) > > E2 80 A8 > â UTF-8 representation of U+2028 LINE SEPARATOR > â probably indicates UTF-8 > > 05 28 > â SCSU representation of U+2028 LINE SEPARATOR > â U+0528 is unassigned > â U+2805 is BRAILLE PATTERN DOTS-13 > â should be surrounded by other Braille characters > â otherwise, probably indicates SCSU > > 00 00 00 > â probably a Basic Latin character in UTF-32 (either byte order) > > Detecting non-Unicode encodings is quite another matter, and here you > really need to study the techniques described by Li and Momoi. > Distinguishing straight ASCII from ISO 8859-1 from Windows-1252 is > easy -- just check which subsets of Windows-1252 are present -- but > throwing Mac Roman and East Asian double-byte sets into the mix is > another matter. > > I once wrote a program to detect the encoding of a text sample known to > be in one of the following Cyrillic encodings: > > â KOI8-R > â Windows code page 1251 > â ISO 8859-5 > â MS-DOS code page 866 > â MS-DOS code page 855 > â Mac Cyrillic > > Given the Unicode scalar values corresponding to each byte value, the > program calculates the proportion of Cyrillic characters (as opposed to > punctuation and dingbats) when interpreted in each possible encoding, > and picks the encoding with the highest proportion (confidence level). > This is a dumbed-down version of Li and Momoi's character distribution > method, but works surprisingly well so long as the text really is in one > of these Cyrillic encodings. It fails spectacularly for text in > Latin-1, Mac Roman, UTF-8, etc. It would probably also be unable to > detect differences between almost-identical character sets, like KOI8-R > and KOI8-U. > > The smaller your list of "possible" encodings, the easier your job of > detecting one of them. > > -Doug Ewell > Fullerton, California > http://users.adelphia.net/~dewell/ > > >