Doug Ewell wrote: > In UTF-16 practically any sequence of bytes is valid, and since you > can't assume you know the language, you can't employ distribution > statistics. Twelve years ago, when most text was not Unicode and all > Unicode text was UTF-16, Microsoft documentation suggested a heuristic > of checking every other byte to see if it was zero, which of course > would only work for Latin-1 text encoded in UTF-16.
I beg to differ. IMHO, analyzing zero bytes is a viable for detecting BOM-less UTF-16 and UTF-32. BTW, I didn't know (and I don't quite care) that this method was suggested first by Microsoft: to me, it seems quite self-evident. It is extremely unlikely that a text file encoded in any single- or multi-byte encoding (including UTF-8) would contain a zero byte, so the presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or UTF-32. The next step is distinguishing between UTF-16 and UTF-32. A bullet-proof negative heuristic for UTF-32, is that a text file *cannot* be UTF-32 unless at least 1/4 of its bytes are zero. A positive heuristics for UTF-32 is detecting sequences of two consecutive zero bytes, the first of which having an odd index: as it is very unlikely that a UTF-16 file would a NULL character, zero 16-bit words must be part of a UTF-32 character. The combination of these two methods is pretty enough to tell apart UTF-16 and UTF-32. Once you determined whether the file is in UTF-16 or in UTF-32, a statistical analysis of the *indexes* of zero bytes should be pretty enough to determine the UTF's endianness. UTF-16 is likely to be little-endian if zero bytes are more frequent at even indexes than at odd indexed, and vice versa. This is due to the fact that, in any language, shared characters in the Latin-1 range (controls, space, digits, punctuation, etc.) should be more frequent than occasional code points of form <U+??00>. For UTF-32, determining endianness is even simpler: if *all* bytes whose index is divisible by 4 are zero, then it is little-endian, else it is big-endian. Of course, all this works only if it is true the basic assumption that the file is a plain text file: this method is not quite enough for telling apart text files from binary files. _ Marco

