From: "Peter Kirk" <[EMAIL PROTECTED]> > On 12/01/2004 03:09, Marco Cimarosti wrote: > > > ... > > > >It is extremely unlikely that a text file encoded in any single- or > >multi-byte encoding (including UTF-8) would contain a zero byte, so the > >presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or > >UTF-32. > > > > > > > Is it not dangerous to assume that U+0000 is not used? This is a valid > character and is commonly used e.g. as a string terminator. Perhaps it > should not be used in truly plain text. But it is likely to occur in > files which are basically text but include certain kinds of markup.
This character is invalid at least in HTML, XML, XHTML, SGML and text/plain files. It's presence in a file will just indicate that this is not a plain text file, so it could have any arbitrary supplementary content which does not use any relevenat text encoding. More precisely, I think it's safer to consider that any file that seems to contain NUL characters is not a text file or, if it is really so, it uses a non-8-bit Uncode encoding scheme like UTF-16 or UTF-32 or a legacy 16-bit charset. Any attempt to try matching the file containing any NUL byte as a plain-text file with a 8-bit charset should fail (at least if the autodetection is needed to parse an HTML or XML text file in a browser). Note that this check is extended to the byte 0x01 which also unambiguously indicates that the file, if it's really plain-text, cannot use a legacy 8bit charset but could be matched with UTF-16, UTF-32, SCSU or a legacy 16-bit charset. (However I can't remember if this applies to VISCII: does it encode a plain-text Unicode character at position 0x01, instead of a C0 control?) My opinion is that most C0 and C1 controls are used as part of an out-of-band protocol, and they are not valid and should not be present in plain text files once they have been decoded and converted to Unicode, where only a few should remain: TAB, LF, FF, CR, NEL. Some controls are needed in encoded plain-text files only for some encoding schemes, but they do not encode actual characters after the encoding scheme has been parsed: BS, SO, SI, ESC, DLE, SS2, SS3... If there's no specific precise support for these legacy encoding schemes, there should not be any attempt to "detect" them by assuming they could be present in a plain-text file.