Hello Dan! DK> ... I have found that most of Chinese (Continental; seems like DK> Taiwanese are much more technically correct) and Korean mails and web DK> pages confuse "charset" and "encodings".
I'm fixing a small article on that right now (maybe you have already read the first edition, but I'm rewriting 75% of it and just because of the "charset" vs "encoding" terminology. It will probably be ready by 26 march 23:00 GMT, I'll post a message to perl-unicode since there's interest in terminology! DK> That is, charset="gb2312" DK> really means euc-cn and In the defense of continental Chinese I must say that it's okay: the IANA registry (http://www.iana.org/assignments/character-sets) has Name: GB2312 (preferred MIME name) MIBenum: 2025 Source: Chinese for People's Republic of China (PRC) mixed one byte, two byte set: 20-7E = one byte ASCII A1-FE = two byte PRC Kanji See GB 2312-80 PCL Symbol Set Id: 18C Alias: csGB2312 this looks pretty much like EUC-CN (or CN-GB what Autrijus has confirmed as an alias to EUC-CN) DK> charset="ks_c_5601-1987" really menas euc-kr. Here I 150% agree: IANA registry really has Name: KS_C_5601-1987 [RFC1345,KXS2] MIBenum: 36 Alias: iso-ir-149 Alias: KS_C_5601-1989 Alias: KSC_5601 Alias: korean Alias: csKSC56011987 and RFC 1345 really has &charset KS_C_5601-1987 &alias iso-ir-149 &alias KS_C_5601-1989 &alias KSC_5601 but this looks to me a 94x94-character table, rather then EUC-KR. I observe this with a real sorrow as people have done a wrong thing. If only they would use 'KS5601' (like GB2312) -- then it wouldn't have clashed with IANA registration and with RFC 1345 :,-((( FYI: (Interesting detail) Ken Lunde in his cjk.inf (http://www.oreilly.com/people/authors/lunde/cjk_inf.html) says that KS_C_5601-1987, KS_C_5601-1989 and 1992 year version of this standard are the same speaking about characters and there codepoints. DK> Sadly this misconception is enbedded to popular browsers. DK> So when you try something like DK> my ($encname) = /^Content-Type:.*charset=[\"\']?([0-9A-Za-z_-]+)/o; DK> .... DK> my $utf8 = encode($encname, $string); DK> You are in big trouble. Aliases is no salvation because most web DK> pages in *.cn happily includes DK> <META http-equiv="Content-Type" content="text/html; charset=gb2312"> Yup.. It's a big problem if people do not send a correct charset in their Content-Type. The META is so much less handy to catch! DK> ... Anton has wistfully :-) DK> stated this in Encode::Supported I guess it may be removed now and GB2312 be listed as a first-class preferred MIME name :-) Will send a patch in another 12 hours after syncing and finishing that article on "charset" vs "encoding" if you do not mind and if nobody patches it before that! DK> * rename gb2312 to gb2312-raw, ksc5601 to ksc5601-raw DK> * and alias gb2312 and ksc5601 to euc-(cn|kr) I'm very glad that the issue has been finally resolved! 8*) DK> I know it's technically wrong For GB2312 its ok. It is ok even for ksc5601. It _VERY_ wrong for ks_c_5601-1987 It is very-very wrong.. :,-(, but if they _do use_ it as content-type's charset and mean EUC-KR, ah! we seem to have to do a wrong favour to ks_c_5601-1987 :-( Please do tell me again so that I would really go upset is it really ks_c_5601-1987, not ks5601? Is it really EUC-KR (8-bit)? DK> but perl opts more for practical than DK> technical.... The show must go on! - Anton, really upset by the Koreans' (mis)behavior P.S. Maybe put a BIG poster in the Supported.pod or somewhere nearby that we _badly_ need a Korean volunteer for testing and advising?