Re: charset parameter in Google Groups

Asmus Freytag Thu, 01 Jul 2010 14:59:59 -0700

On 7/1/2010 11:29 AM, John Burger wrote:

Andreas Prilop wrote:
The problem with slavishly following the charset parameter is
that it is often incorrect.
I wonder how you could draw such a conclusion. In order to make
such a statement, there must be some other (god-given?) parameter,
which is the "real charset".
If you have never encountered a web page in which the charsetparameter encoded in the page (or in the HTTP headers) did notaccurately reflect the "real charset", as indicated by the actual datain the page, then your experience differs sharply from mine, and fromeveryone else I have ever met.

Let's unravel this.

First, there's qualitative vs. quantitative arguments. Yes, mis-taggingoccurs (for all the reasons Shawn gave in his reply). But Andreas' pointwas that for languages needing more than ASCII, there's a nicecorrective. If many (most) viewers now base their display on charset,then more documents would be expected to be correctly tagged for thosetypes of text, because they tend to degrade dramatically otherwise andusers (authors) would take action to correct the situation. The exampleof this is reading a text as 8859-1 when it is 8859-2 (Eastern European)

This is different from the issue the issue of selecting the correctcharset, if it only affects some special symbols (copyright, punctuationmarks, the euro sign). In these cases, the text degrades in much moresubtle ways, and usually remains readable. I would expect that theincidence of mis-tagging in such a situation is larger. The example forthis is reading a text as 8859-1 when it was 1252 (Windows code pagewith extra characters not in ISO 8859-1 - Shawn mentioned this case aswell).

If I were to design a charset-verifier, I would distinguish betweenthese two cases. If something came tagged with a region-specificcharset, I would honor that, unless I found strong evidence of the "thiscan't be right" nature. In some cases, to collect such evidence wouldrequire significant statistics. The rule here should be "do no harm",that is, destroying a document by incorrectly changing a true charsetshould receive a nuch higher penalty than failing to detect a brokencharset. (That way, you don't penalize people who live by the rules :).

When it comes to a document tagged with 8859-1, I might relax thisslightly, as that tag is one of the common default tags and is morelikely to have been applied blindly.

When it comes to deciding whether something is Windows code page or atrue ISO charset, the bar can be set lower - one is a superset of theother usually, and detecting any characters in the superset shouldtrigger a reassignment. Unlike the other case, the "penalties" forgetting this wrong are much less severe.

A./

Re: charset parameter in Google Groups

Reply via email to