RE: Detecting encoding in Plain text

Marco Cimarosti Mon, 12 Jan 2004 04:49:55 -0800

Doug Ewell wrote:
> In UTF-16 practically any sequence of bytes is valid, and since you
> can't assume you know the language, you can't employ distribution
> statistics.  Twelve years ago, when most text was not Unicode and all
> Unicode text was UTF-16, Microsoft documentation suggested a heuristic
> of checking every other byte to see if it was zero, which of course
> would only work for Latin-1 text encoded in UTF-16.


I beg to differ. IMHO, analyzing zero bytes is a viable for detecting
BOM-less UTF-16 and UTF-32. BTW, I didn't know (and I don't quite care) that
this method was suggested first by Microsoft: to me, it seems quite
self-evident.

It is extremely unlikely that a text file encoded in any single- or
multi-byte encoding (including UTF-8) would contain a zero byte, so the
presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or
UTF-32.

The next step is distinguishing between UTF-16 and UTF-32. A bullet-proof
negative heuristic for UTF-32, is that a text file *cannot* be UTF-32 unless
at least 1/4 of its bytes are zero. A positive heuristics for UTF-32 is
detecting sequences of two consecutive zero bytes, the first of which having
an odd index: as it is very unlikely that a UTF-16 file would a NULL
character, zero 16-bit words must be part of a UTF-32 character. The
combination of these two methods is pretty enough to tell apart UTF-16 and
UTF-32.

Once you determined whether the file is in UTF-16 or in UTF-32, a
statistical analysis of the *indexes* of zero bytes should be pretty enough
to determine the UTF's endianness. UTF-16 is likely to be little-endian if
zero bytes are more frequent at even indexes than at odd indexed, and vice
versa. This is due to the fact that, in any language, shared characters in
the Latin-1 range (controls, space, digits, punctuation, etc.) should be
more frequent than occasional code points of form <U+??00>. For UTF-32,
determining endianness is even simpler: if *all* bytes whose index is
divisible by 4 are zero, then it is little-endian, else it is big-endian.

Of course, all this works only if it is true the basic assumption that the
file is a plain text file: this method is not quite enough for telling apart
text files from binary files.

_ Marco

RE: Detecting encoding in Plain text

Reply via email to