RE: Detecting encoding in Plain text

jon Mon, 12 Jan 2004 05:37:39 -0800

Quoting Marco Cimarosti <[EMAIL PROTECTED]>:

> Doug Ewell wrote:
> > In UTF-16 practically any sequence of bytes is valid, and since you
> > can't assume you know the language, you can't employ distribution
> > statistics.  Twelve years ago, when most text was not Unicode and all
> > Unicode text was UTF-16, Microsoft documentation suggested a heuristic
> > of checking every other byte to see if it was zero, which of course
> > would only work for Latin-1 text encoded in UTF-16.
> 
> I beg to differ. IMHO, analyzing zero bytes is a viable for detecting
> BOM-less UTF-16 and UTF-32. BTW, I didn't know (and I don't quite care) that
> this method was suggested first by Microsoft: to me, it seems quite
> self-evident.
> 
> It is extremely unlikely that a text file encoded in any single- or
> multi-byte encoding (including UTF-8) would contain a zero byte, so the
> presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or
> UTF-32.


False positives can be caused by the use of U+0000 (which is most often encoded 
as 0x00) which some applications do use in text files. Hence you need to look 
for sequences where there is a null octet every other octet, which increases 
the risk of false negatives:

False negatives can be caused by text that doesn't contain any Latin-1 
characters.

The method can be used reliably with text files that are guaranteed to contain 
large amounts of Latin-1 - in particular files for which certain ASCII 
characters are given an application-specific meaning; for instance XML and HTML 
files, comma-delimited files, tab-delimited files, vCards and so on. It can be 
particularly reliable in cases where certain ASCII characters will always begin 
the document (e.g. XML).

--
Jon Hanna
<http://www.hackcraft.net/>
*Thought provoking quote goes here*

RE: Detecting encoding in Plain text

Reply via email to