On 7/1/2010 11:29 AM, John Burger wrote:
Andreas Prilop wrote:

The problem with slavishly following the charset parameter is
that it is often incorrect.

I wonder how you could draw such a conclusion. In order to make
such a statement, there must be some other (god-given?) parameter,
which is the "real charset".


If you have never encountered a web page in which the charset parameter encoded in the page (or in the HTTP headers) did not accurately reflect the "real charset", as indicated by the actual data in the page, then your experience differs sharply from mine, and from everyone else I have ever met.

Let's unravel this.

First, there's qualitative vs. quantitative arguments. Yes, mis-tagging occurs (for all the reasons Shawn gave in his reply). But Andreas' point was that for languages needing more than ASCII, there's a nice corrective. If many (most) viewers now base their display on charset, then more documents would be expected to be correctly tagged for those types of text, because they tend to degrade dramatically otherwise and users (authors) would take action to correct the situation. The example of this is reading a text as 8859-1 when it is 8859-2 (Eastern European)

This is different from the issue the issue of selecting the correct charset, if it only affects some special symbols (copyright, punctuation marks, the euro sign). In these cases, the text degrades in much more subtle ways, and usually remains readable. I would expect that the incidence of mis-tagging in such a situation is larger. The example for this is reading a text as 8859-1 when it was 1252 (Windows code page with extra characters not in ISO 8859-1 - Shawn mentioned this case as well).

If I were to design a charset-verifier, I would distinguish between these two cases. If something came tagged with a region-specific charset, I would honor that, unless I found strong evidence of the "this can't be right" nature. In some cases, to collect such evidence would require significant statistics. The rule here should be "do no harm", that is, destroying a document by incorrectly changing a true charset should receive a nuch higher penalty than failing to detect a broken charset. (That way, you don't penalize people who live by the rules :).

When it comes to a document tagged with 8859-1, I might relax this slightly, as that tag is one of the common default tags and is more likely to have been applied blindly.

When it comes to deciding whether something is Windows code page or a true ISO charset, the bar can be set lower - one is a superset of the other usually, and detecting any characters in the superset should trigger a reassignment. Unlike the other case, the "penalties" for getting this wrong are much less severe.

A./

Reply via email to