Re: charset parameter in Google Groups

Jukka K. Korpela Fri, 02 Jul 2010 10:18:49 -0700

John Burger wrote:

Asmus distinguishes between two kinds of cases: The first is guessing
the charset incorrectly in a way that completely degrades the text,
e.g. 8859-1 vs. 8859-2.  Second is a more subtle kind of mistakes, and
arguably much less objectionable, e.g., 8859-1 vs. 1252, or the "smart
quotes" problem.

I'd say the key distinction is between protocol-incorrect behavior (likeignoring a character encoding specified properly, due to some “heuristics”)and error handling. If a document is declared as ISO-8859-1 encoded, it isprotocol-incorrect to treat it as anything else, if all the octets aredefined in ISO-8859-1 and allowed in the data format. However, an HTML 4.01document declared as ISO-8859-1 encoded an containing, say, octet 80(hexadecimal) is by definition malformed. A browser may decide to refuse todisplay it at all (not a good decision in practice) or to perform some errorcorrection, like interpreting the data as windows-1252 encoded instead.

I like this distinction, and would point out that we can probably
quantify this into a continuum,

No, I think this requires discretion. Incorrect behavior vs. error handling(which may vary, though strong arguments may favor one or another approach).

If you ask me, error recovery should be signalled to end user, thoughperhaps discretely (pun intended) in cases where it seems “obvious”.

--

Yucca, http://www.cs.tut.fi/~jkorpela/

Re: charset parameter in Google Groups

Reply via email to