Hello, Jungshik! http://tagunov.tripod.com/survey2.html is largely an answer, so, if you allow, I will comment with links into this page :)
JS> On the other hand, no one with *sufficient understanding* JS> of the issue uses 'character set' to mean encoding. ISO> coded character set; code ISO> A set of unambiguous rules that establishes a ISO> character set and the one-to-one relationship between the ISO> characters of the set and their coded representation. AT> Hmmm... can this potentially lead to messing "character set" for AT> a short form of "coded character set" (in the ISO meaning)? JS> I think Dan is right when he wrote that EUC-JP,EUC-KR,EUC-CN, JS> EUC-TW and even UTF-8 could be regarded as both CCS and CES. They can :) http://tagunov.tripod.com/survey2.html#BD classifies it as the ISO point of view: every encoding inevitably defines a "Character Set" too. I understand that this is CCS, not a character repertoire. And you? JS> Even though JS> they involve multiple character set standards, the mapping from abstract JS> characters in those multiple character set standards to integers (despite JS> being of multiple 'lengths') is strictly one-to-one. I didn't realize JS> that it's possible to view things that way until he wrote that. Neither did I! JS> On the other hand, as he wrote, any encoding that utilize any form of JS> escape sequence (locking/single shift, designator, etc) , whether JS> defined in ISO 2022 or not (I have HZ in mind here) cannot be called JS> a CCS because just providing the mapping alone cannot fully specify JS> the way actual text in that encoding is 'serialized' in octet-sequence. I agree that EUC-JP is "more" a CCS then ISO-2022-JP :-) Still, as I write at http://tagunov.tripod.com/survey2.html#BD I think that the [RFC 2130] approach is better then ISO, and you? ;) JS> Therefore, I believe the below doesn't hold true for all encodings JS> we have to deal with although it's the case for some encodings. I'm afraid I just do not understand you well here, Jungshik. AT> "coded character set" (= CCS + encoding = CCS + CES), My statement is "ISO coded character set" = CCS + CES This does always hold, does not it? JS> Then, I realize that RFC 1345 has the following after quoting JS> ISO definition of coded character set which you quoted above. 1345> This memo does not put further 1345> restrictions on the term of "coded character set" than the following: 1345> "A coded character set is a set of rules that unambiguously and 1345> completely determines which sequence of characters, if any, is 1345> represented by each possible sequence of n-bit bytes for a certain 1345> value of n." This implies that e.g. a coded character set extended 1345> with one or more other coded character sets by means of the extension 1345> techniques of ISO 2022 constitutes a coded character set in its own 1345> right. In this memo the term "charset" is used to refer to the above 1345> interpretation of the ISO term "coded character set". JS> However, even RFC 1345 came up with a new term 'charset' for its JS> *extended* definition of 'coded character set' to distinguish it from JS> the original ISO definition. The definition of 'charset' in RFC 1345 JS> is actually in line with RFC 2130/2278. I just more then happy when I opened 2277. The 'charset' definition there is the best I have seen :-)) Yes 1345 second definition of "coded character set", also named 'charset' is identical to RFC 2130/2277/2278. JS> Therefore, what I wrote about JS> the statement that "coded character set" (= CCS + encoding = CCS + CES) JS> is still the case, IMO. I'm sorry, Jungshik. I'm afraid I did not understand that. Could you explain that again? DOC> Is a collection of characters in which each character is distinguished DOC> with unique ID (in most cases, ID is number). JS> Some people like to distinguish between a mere collection of characters JS> and a collection of characters with uniq(numeric) ID /code points. JS> The former is sometimes refered to as a character repertoire JS> or a character set whereas the latter is called a 'coded character set'. AT> or rather CCS to rule out the ISO understanding JS> I don't see any conflict between RFC 2130 CCS and ISO coded character JS> set _quoted_ in RFC 1345. Thanks to Markus G. Kuhn we how have the http://www.evertype.com/standards/iso8859/8859-14-en.pdf link :) Both 8859-14-en.pdf and ECMA 35 contain a very close, a bit reworded wording: ISO 8859-14> coded character set; code ISO 8859-14> A set of unambiguous rules that establishes a ISO 8859-14> character set and the one-to-one relationship between the ISO 8859-14> characters of the set and their bit combinations. 2130> A Coded Character Set (CCS) is a mapping from a set of abstract 2130> characters to a set of integers. Does the conflict look more evident now? [RFC 2130] CCS is not at all about encoding. It rather is about _enumerating_ set of characters IMO. Here's how I try to reword the [RFC 2130] CCS defintion: http://tagunov.tripod.com/survey2.html#BB what do you think of it? ;-) JS> It's not the original ISO definition of 'coded JS> character set' but RFC 1345's extension of the definition that made JS> things complicated. However, even RFC 1345 gave it a new term 'charset' JS> to tell it from the original ISO defintion. Yes, it does conflict, '[RFC 2130] CCS' and '[RFC 2277] charset'==encoding And furthermore, my opinion is that http://tagunov.tripod.com/survey2.html#A3.1 ISO coded character set == CCS + CES Do you approve? So, 'ISO coded character set' is a 'charset' (not vice versa) 'ISO coded character set' is a CCS (not vice versa) 'charset' == 'encoding' == 'RFC 1345 second definition' DOC> =item Character I<Encoding> DOC> A character encoding may also encode character set as-is (also called DOC> a I<raw> encoding. i.e. US-ascii) or processed (i.e. EUC-JP, US-ascii is JS> In a strict sense, the concept of 'raw' or 'as-is' (which you JS> apparently use to mean a coded character set invoked on GL) is not JS> appropriate. Because JIS X 0208, JIS X 0208 and KS X 1001 don't map JS> characters to their GL position when enumerating characters in their JS> charts. AT> Looks like RFC 1345 has made one big pile: AT> JIS_C6226-1978, JIS_C6226-1978 = JIS_C6226-1983 AT> GB_1988-80 AT> KS_C_5601-1987 AT> AT> are all listed in a similar manner there. Does this RFC change AT> anything? JS> As we all know well now (and you documented), at least Encode cannot JS> use 'ks_c_5601-1987' to mean what's described in RFC 1345 (mapping JS> bet. characters and row/column numbers) because MS took it away for JS> their own CP949. A similar misuse of GB2312 made it not desirable to JS> use GB_2312-80 to mean row/column (or GL) repr. of GB 2312-1980 in Encode. Yes, yes, yes! But we're speaking about beautiful theory, not rude practice! :-) And even in theory the situation is fun to me: GB 2312-80 _has_ defined a raw CES JIS X 0208 and KS X 5601 _haven't_ But [RFC 1345] has messed them together and has defined a raw encoding for each, hasn't it? JS> The numeric ID used in JIS X 0208, JIS X 0212 and KS X 1001 JS> are row (ku) and column(ten?) while GB 2312-80 appears to use GL JS> codepoints. AT> Thanks a lot! I would have never caught this subtlety from what AT> reading I have. JS> Then, you also have to note what Dan wrote about the difference. JIS and JS> KS may have tried to 'please' the decimal-oriented :-) :-) given we're hex oriented, rather decimal-oriented, does http://tagunov.tripod.com/survey2.html#BB please us? JS> Reading what RFC wrote about GB 2312-80, 1345> Considering the Chinese standard GB 2312-1980, the 1345> Japanese standards JIS X0208 and JIS X0212, and the Korean standard 1345> KS C 5601, they are all given by row and column numbers between 1 and 1345> 94. So two positions for row and column and a character set 1345> identifier of one character would be almost as short as possible Just what I was speaking about. [RFC 1345] has neglected that difference and has messed them all up. And has presented us with raw encodings for each!! (Quite useless as I retell your, Autrijus's and Dan's explanations in http://tagunov.tripod.com/survey2.html#A5.3) JS> I developed a reservation about what I wrote about GB 2312-80. Either I JS> (or Ken Lunde) am(is) wrong or the author of RFC 1345 was wrong. Or, JS> both could be right because it's possible that the printed version of JS> GB 2312-80 in Chinese used GL code points while the English document JS> submitted to ISO to register GB 2312-80 used row/column number. The world is a mess :-) And seems [RFC 2130] has added to the mess. No matter that Microsoft has stolen the name, the raw encoding continues to live. As I've recently heard on perl5-porters, jis201-raw and jis208-raw are probably going to get back, because of some issues I do not understand. I'm indifferent about it, just noting that I blame (or prise :-) [RFC 1345] for bringing them to us. JS><snip/> JS> Not that I'd encourage people to use UTF-16 for their web pages, JS> but UTF-16 is supported by MS IE and KOI8-U is supported by both MS IE JS> and Mozilla. Was in my last patch. JS> Why don't you also refer to a successor to JS> CJK.inf, CJKV Information Processing JS> ... JS> Hmm, is it me :-) ? ;-) JS> ... JS> along with many other issues faced by anyone trying to JS> better support CJKV languages/scripts in all the areas of information JS> processing. Done. Thanks to Dan for speedy application! My ultimate regards, - Anton P.S. JS> Hmm, I feel like being treated as 'the' ultimate something here, which JS> I'm certainly not and never wanted to be :-) Settled :)