Re: Detecting encoding in Plain text

Peter Kirk Mon, 12 Jan 2004 05:53:58 -0800

On 12/01/2004 03:09, Marco Cimarosti wrote:

...

It is extremely unlikely that a text file encoded in any single- or
multi-byte encoding (including UTF-8) would contain a zero byte, so the
presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or
UTF-32.

Is it not dangerous to assume that U+0000 is not used? This is a valid character and is commonly used e.g. as a string terminator. Perhaps it should not be used in truly plain text. But it is likely to occur in files which are basically text but include certain kinds of markup.

... This is due to the fact that, in any language, shared characters in
the Latin-1 range (controls, space, digits, punctuation, etc.) should be
more frequent than occasional code points of form <U+??00>. ...

This one also looks dangerous. Some scripts include their own digits and punctuation; not all scripts use spaces; and controls are not necessarily used, if U+2028 LINE SEPARATOR is used for new lines. But there may be some characters U+??00 which are used rather commonly in a particular script and so occur commonly in some text files.


--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Detecting encoding in Plain text

Reply via email to