On 12/01/2004 03:09, Marco Cimarosti wrote:
...
It is extremely unlikely that a text file encoded in any single- or
multi-byte encoding (including UTF-8) would contain a zero byte, so the
presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or
UTF-32.
Is it not dangerous to assume that U+0000 is not used? This is a valid
character and is commonly used e.g. as a string terminator. Perhaps it
should not be used in truly plain text. But it is likely to occur in
files which are basically text but include certain kinds of markup.
... This is due to the fact that, in any language, shared characters in
the Latin-1 range (controls, space, digits, punctuation, etc.) should be
more frequent than occasional code points of form <U+??00>. ...
This one also looks dangerous. Some scripts include their own digits and
punctuation; not all scripts use spaces; and controls are not
necessarily used, if U+2028 LINE SEPARATOR is used for new lines. But
there may be some characters U+??00 which are used rather commonly in a
particular script and so occur commonly in some text files.
--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/