Philippe Verdy wrote:
the input:determine strategy will work fine for UTF-8 or SCSU, provided that
the leading BOM is explicitly encoded. ...

With "determine" I do not mean to restrict to checking for a BOM. There are several ways to determine the input charset, depending on the protocol and document type etc., including but not limited to BOM, protocol field, in-doc specification, heuristics (guessing)...


About the BOM, or more precisely the Unicode signature byte sequences: Despite a theoretical ambiguity, it works quite well for discovering a Unicode charset, but unprepared and Unicode-unaware tools may choke on it.

The idea that "if a text (without BOM) looks like valid UTF-8, then it is
UTF-8; else it uses another legacy encoding" does not work in practice and
also leads to too many false positives.

It may not work in all cases, but working in >95% or so of cases in practice seems like it works quite well to me.


- if you are absolutely certain that they suffice - use US-ASCII or ISO
8859-1.

OK for US-ASCII, but even ISO-8859-1 should no more be used without explicit labelling (with meta-data or other means) of its encoding: ...

If possible, *all* text should have its charset specified in some way.


I just wonder why Unicode still maintains that a BOM _should_ not be used in
UTF-8 texts.

I believe that "Unicode" does not say that. It is a concern among users of Unicode-unaware tools like classic Unix-y command-line tools that are slow to add good Unicode support. You are right that the signatures work quite well with more modern tools.


markus




Reply via email to