Re: unicode on Linux

Markus Scherer Wed, 29 Oct 2003 11:06:34 -0800

Philippe Verdy wrote:

the input:determine strategy will work fine for UTF-8 or SCSU, provided that
the leading BOM is explicitly encoded. ...

With "determine" I do not mean to restrict to checking for a BOM. There are several ways to determine the input charset, depending on the protocol and document type etc., including but not limited to BOM, protocol field, in-doc specification, heuristics (guessing)...

About the BOM, or more precisely the Unicode signature byte sequences: Despite a theoretical ambiguity, it works quite well for discovering a Unicode charset, but unprepared and Unicode-unaware tools may choke on it.

The idea that "if a text (without BOM) looks like valid UTF-8, then it is
UTF-8; else it uses another legacy encoding" does not work in practice and
also leads to too many false positives.

It may not work in all cases, but working in >95% or so of cases in practice seems like it works quite well to me.

- if you are absolutely certain that they suffice - use US-ASCII or ISO
8859-1.


OK for US-ASCII, but even ISO-8859-1 should no more be used without explicit
labelling (with meta-data or other means) of its encoding: ...

If possible, *all* text should have its charset specified in some way.

I just wonder why Unicode still maintains that a BOM _should_ not be used in
UTF-8 texts.

I believe that "Unicode" does not say that. It is a concern among users of Unicode-unaware tools like classic Unix-y command-line tools that are slow to add good Unicode support. You are right that the signatures work quite well with more modern tools.

markus

Re: unicode on Linux

Reply via email to