Le 3 juin 09 à 23h19, Ian Hickson écrivit :

On Tue, 14 Apr 2009, Øistein E. Andersen wrote:

HTML5 currently contains a table of encodings aliases,
[...]
GB2312 and GB_2312-80 technically refer to the *character set* GB 2312-80,
[...]. GBK, on the other hand, is an encoding.
[...]
There is
a large number of unregistered charset strings, however, and the other mappings in this table are between encodings. Unless x-x-big5 is actually supposed to refer to an encoding distinct from Big5, [this mapping] should be
removed.
[...]

I believe you misunderstand the purpose of this table. The idea is to give
a mapping of _labels_ to encodings, not encodings to encodings. I've
clarified the text to this effect.

You seem to have added "specified by a label" to the phrase which now reads "an encoding specified by a label given in the first column of the following table" without changing the column heading ("Input encoding") and without defining what a "label" actually is. The reference to "encoding aliasing" is also intact, which seems misleading if the table is not supposed to map between encodings.

The concept of "misinterpret[ation] for compatibility" seems inappropriate for the mapping from x-x-big5 to Big5 unless the "label" x-x-big5 is actually supposed to specify an encoding distinct from Big5.

It is not at all clear to me what you mean by "label". It might be the MIME charset string with which the HTML document is labelled, but that would require an inordinate number of strings to be specified (e.g., iso-ir-100, latin1 and IBM819 amongst others alongside ISO-8859-1), so this cannot possibly be the intended meaning. It might be a normalised form of the MIME charset string, using the IANA charset registry to map an "alias" to its corresponding "name" (or to the "alias" qualified as "preferred MIME name" if there is such an entry), but that does not quite seem to work either, since aliases not registered in the IANA charset registry would then not be covered by the aliasing mechanism (e.g., it would cause content labelled as x-sjis to be handled as unaugmented Shift_JIS despite the mapping from Shift_JIS to Windows-31J, since x-sjis does not and cannot figure in the IANA charset registry).

I did indeed believe that the table was supposed to map between encodings, and this interpretation still seems to give the correct result in practice for non-CJK encodings (unless, of course, content labelled TIS-620-2533 should actually be interpreted as TIS-620 rather than windows-874).


Le 9 juin 09 à 10h55, Anne van Kesteren écrivit :

On Tue, 09 Jun 2009 01:42:57 +0200, Øistein E. Andersen wrote:

Shift-JIS and Windows-932 are commonly used names/labels for the
encodings that are registered as Shift_JIS and Windows-31J

(respectively) in the IANA charset registry. [...]

So should HTML5 mention that Windows-932 maps to Windows-31J? (It does not appear in the IANA registry.)


That is an interesting question. My (apparently wrong) understanding was that the table was merely supposed to provide mappings between encodings, since such mappings are inappropriate in non-HTML contexts and cannot be added to the IANA registry. It might be to useful to include a set of MIME charset strings which cannot be or have not yet been registered (e.g., x-x-big5, x-sjis, windows-932) as well as information on how CJK character sets are implemented in practice, both of which seem to be necessary for compatibility.

Such information does not fit comfortably in the current table, though.


--
Øistein E. Andersen

Reply via email to