John Burger wrote:

Asmus distinguishes between two kinds of cases: The first is guessing
the charset incorrectly in a way that completely degrades the text,
e.g. 8859-1 vs. 8859-2.  Second is a more subtle kind of mistakes, and
arguably much less objectionable, e.g., 8859-1 vs. 1252, or the "smart
quotes" problem.

I'd say the key distinction is between protocol-incorrect behavior (like ignoring a character encoding specified properly, due to some “heuristics”) and error handling. If a document is declared as ISO-8859-1 encoded, it is protocol-incorrect to treat it as anything else, if all the octets are defined in ISO-8859-1 and allowed in the data format. However, an HTML 4.01 document declared as ISO-8859-1 encoded an containing, say, octet 80 (hexadecimal) is by definition malformed. A browser may decide to refuse to display it at all (not a good decision in practice) or to perform some error correction, like interpreting the data as windows-1252 encoded instead.

I like this distinction, and would point out that we can probably
quantify this into a continuum,

No, I think this requires discretion. Incorrect behavior vs. error handling (which may vary, though strong arguments may favor one or another approach).

If you ask me, error recovery should be signalled to end user, though perhaps discretely (pun intended) in cases where it seems “obvious”.

--
Yucca, http://www.cs.tut.fi/~jkorpela/

Reply via email to