Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

Øistein E . Andersen Tue, 09 Jun 2009 15:08:40 -0700

Le 3 juin 09 à 23h19, Ian Hickson écrivit :

On Tue, 14 Apr 2009, Øistein E. Andersen wrote:
HTML5 currently contains a table of encodings aliases,
[...]
GB2312 and GB_2312-80 technically refer to the *character set* GB2312-80,
[...]. GBK, on the other hand, is an encoding.
[...]
There is
a large number of unregistered charset strings, however, and theothermappings in this table are between encodings. Unless x-x-big5 isactuallysupposed to refer to an encoding distinct from Big5, [this mapping]should be
removed.
[...]
I believe you misunderstand the purpose of this table. The idea isto give
a mapping of _labels_ to encodings, not encodings to encodings. I've
clarified the text to this effect.

You seem to have added "specified by a label" to the phrase which nowreads "an encoding specified by a label given in the first column ofthe following table" without changing the column heading ("Inputencoding") and without defining what a "label" actually is. Thereference to "encoding aliasing" is also intact, which seemsmisleading if the table is not supposed to map between encodings.

The concept of "misinterpret[ation] for compatibility" seemsinappropriate for the mapping from x-x-big5 to Big5 unless the "label"x-x-big5 is actually supposed to specify an encoding distinct from Big5.

It is not at all clear to me what you mean by "label". It might be theMIME charset string with which the HTML document is labelled, but thatwould require an inordinate number of strings to be specified (e.g.,iso-ir-100, latin1 and IBM819 amongst others alongside ISO-8859-1), sothis cannot possibly be the intended meaning. It might be a normalisedform of the MIME charset string, using the IANA charset registry tomap an "alias" to its corresponding "name" (or to the "alias"qualified as "preferred MIME name" if there is such an entry), butthat does not quite seem to work either, since aliases not registeredin the IANA charset registry would then not be covered by the aliasingmechanism (e.g., it would cause content labelled as x-sjis to behandled as unaugmented Shift_JIS despite the mapping from Shift_JIS toWindows-31J, since x-sjis does not and cannot figure in the IANAcharset registry).

I did indeed believe that the table was supposed to map betweenencodings, and this interpretation still seems to give the correctresult in practice for non-CJK encodings (unless, of course, contentlabelled TIS-620-2533 should actually be interpreted as TIS-620 ratherthan windows-874).



Le 9 juin 09 à 10h55, Anne van Kesteren écrivit :

On Tue, 09 Jun 2009 01:42:57 +0200, Øistein E. Andersen wrote:


Shift-JIS and Windows-932 are commonly used names/labels for the
encodings that are registered as Shift_JIS and Windows-31J

(respectively) in the IANA charset registry. [...]
So should HTML5 mention that Windows-932 maps to Windows-31J? (Itdoes not appear in the IANA registry.)

That is an interesting question. My (apparently wrong) understandingwas that the table was merely supposed to provide mappings betweenencodings, since such mappings are inappropriate in non-HTML contextsand cannot be added to the IANA registry. It might be to useful toinclude a set of MIME charset strings which cannot be or have not yetbeen registered (e.g., x-x-big5, x-sjis, windows-932) as well asinformation on how CJK character sets are implemented in practice,both of which seem to be necessary for compatibility.


Such information does not fit comfortably in the current table, though.


--
Øistein E. Andersen

Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

Reply via email to