On 2010/11/11 6:28, Mark Davis ☕ wrote:

That is actually not the case. There are superset relations among some of
the CJK character sets, and also -- practically speaking -- between some of
the windows and ISO-8859 sets. I say practically speaking because in general
environments, the C1 controls are really unused, so where a non ISO-8859 set
is same except for 80..9F you can treat it pragmatically as a superset.

Yes, except that the terms superset/subset (and set in general) shouldn't be used unless you really strictly speak about the repertoire of characters, and not the encoding itself. So e.g. the repertoire of iso-8859-1 is a subset of the repertoire of UTF-8. However, iso-8859-1 is not a subset of UTF-8, not because you can't label some text encoded as iso-8859-1, but because subset relationships among the encodings themselves don't make sense). Also, US-ASCII is not a subset of UTF-8, because when you just use the names of the character encodings, you mean the character encodings, and character encodings don't have subset relationships.

It may as well be possible to use (create?) the term sub-encoding, saying that an encoding A is a sub-encoding of encoding B if all (legal) byte sequences in encoding A are also legal byte sequences in encoding B and are interpreted as the same characters in both cases. In this sense, US-ASCII is clearly a sub-encoding of UTF-8, as well as a sub-encoding of many other encodings. You can also say that iso-8859-1 is a sub-encoding of windows-1252 if the former is interpreted as not including the C1 range.

Regards,   Martin.

--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp   mailto:due...@it.aoyama.ac.jp

Reply via email to