This is a forwarded message From: Anton Tagunov <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] <[EMAIL PROTECTED]> Date: Tuesday, March 19, 2002, 9:48:06 PM Subject: [PATCH][docs] Encode.pm
===8<==============Original message text=============== Hello, developers! With my upgraded knowledge of encoding naming I propose this. Justification: 1) Shift-JIS -> Shift_JIS does not hurt anyone, cause it does not work either way, Encode::encode understands only 'shiftjis' I would prefer to settle the naming first, going to submit a separate bug report for all aliases that do not work later 2) I do not care too much if I have done a wrong classification of encodings: I hope that as soon as something like this gets into the docs we'll get plenty of feedback sufficient to correct even the worth mistakes :-) 2 me it looks good just to start the section. <DISCLAIMER> The main goal was to separate MIME names from ISO names from proprietary names. </DISCLAIMER> Comment: JIS 0201 JIS 0208 JIS 0212 GB 1988 GB 2312 are under my severe suspect, but I have posted separate mails on them. Grumbling: CNS 11643 GB 12345 really hurt my feelings because they have a space inside, but I have found no reason to touch them: neither IANA nor rfc1345 name them, and everywhere I've seen them they are written with a space. Do you think it could still be translated to CNS-.., GB- for consistency and beauty ? :-) Proposition: Should Name: HZ-GB-2312 be established as a synonym to HZ? Or not worth the trouble? Looking forward to your opinions! :-))) - Anton --- ext/Encode/Encode.pm.orig Mon Mar 18 00:20:24 2002 +++ ext/Encode/Encode.pm Tue Mar 19 21:42:26 2002 @@ -500,34 +500,34 @@ ISO 10646-1 => UCS-2 -The ISO 8859 and KOI: +The ISO-8859 and KOI: - ISO 8859-1 ISO 8859-6 ISO 8859-11 KOI8-F - ISO 8859-2 ISO 8859-7 (12 doesn't exist) KOI8-R - ISO 8859-3 ISO 8859-8 ISO 8859-13 KOI8-U - ISO 8859-4 ISO 8859-9 ISO 8859-14 - ISO 8859-5 ISO 8859-10 ISO 8859-15 - ISO 8859-16 - - Latin1 => 8859-1 Latin6 => 8859-10 - Latin2 => 8859-2 Latin7 => 8859-13 - Latin3 => 8859-3 Latin8 => 8859-14 - Latin4 => 8859-4 Latin9 => 8859-15 - Latin5 => 8859-9 Latin10 => 8859-16 - - Cyrillic => 8859-5 - Arabic => 8859-6 - Greek => 8859-7 - Hebrew => 8859-8 - Thai => 8859-11 - TIS620 => 8859-11 + ISO-8859-1 ISO-8859-6 ISO-8859-11 KOI8-F + ISO-8859-2 ISO-8859-7 (12 doesn't exist) KOI8-R + ISO-8859-3 ISO-8859-8 ISO-8859-13 KOI8-U + ISO-8859-4 ISO-8859-9 ISO-8859-14 + ISO-8859-5 ISO-8859-10 ISO-8859-15 + ISO-8859-16 + + Latin1 => ISO-8859-1 Latin6 => ISO-8859-10 + Latin2 => ISO-8859-2 Latin7 => ISO-8859-13 + Latin3 => ISO-8859-3 Latin8 => ISO-8859-14 + Latin4 => ISO-8859-4 Latin9 => ISO-8859-15 + Latin5 => ISO-8859-9 Latin10 => ISO-8859-16 + + Cyrillic => ISO-8859-5 + Arabic => ISO-8859-6 + Greek => ISO-8859-7 + Hebrew => ISO-8859-8 + Thai => ISO-8859-11 + TIS620 => ISO-8859-11 The CJKV: Chinese, Japanese, Korean, Vietnamese: - ISO 2022 ISO 2022 JP-1 JIS 0201 GB 1988 Big5 EUC-CN - ISO 2022 CN ISO 2022 JP-2 JIS 0208 GB 2312 HZ EUC-JP - ISO 2022 JP ISO 2022 KR JIS 0210 GB 12345 CNS 11643 EUC-JP-0212 - Shift-JIS GBK Big5-HKSCS EUC-KR + ISO-2022 ISO-2022-JP-1 JIS 0201 GB 1988 Big5 EUC-CN + ISO-2022-CN ISO-2022-JP-2 JIS 0208 GB 2312 HZ EUC-JP + ISO-2022-JP ISO-2022-KR JIS 0210 GB 12345 CNS 11643 EUC-JP-0212 + Shift_JIS GBK Big5-HKSCS EUC-KR VISCII ISO-IR-165 (Due to size concerns, additional Chinese encodings including C<GB 18030>, @@ -572,6 +572,59 @@ DingBats Roman8 GSM 0338 Symbol +=head2 Encoding Classification + +Encodings + + US-ASCII UTF-8 KOI8-R ISO-8859-* + ISO-2022-CN ISO-2022-JP ISO-2022-KR Big5 + EUC-CN EUC-JP EUC-KR + +are L<http://www.iana.org/assignments/character-sets>-registered +as preferred MIME names and may probably be used over the Internet. +So is + + Shift_JIS + +but despite its wide spread it bears the label of being +Microsft proprietary. + + UTF-16 KOI8-U ISO-2022-JP-2 + +are IANA-registered preferred MIME names but probably shoule +be avoided as encoding for web pages due to lack of browser +support. + + + ISO-2022 (http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM) + ISO-2022-JP-1 (http://www.faqs.org/rfcs/rfc2237.html) + ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html) + GBK + VISCII + GB 12345 (only plains 1 and 2 available) + GB 18030 + CNS 11643 + +are totally valid encodings but not registered at IANA. + + BIG5PLUS + EUC-JP-0212 (Encode::lib::Encode::Tcl::Extended) + +are a bit proprietary + +You may probably get some info on CJK encodings at + + brief description for most of the mentioned CJK encodings + http://www.debian.org.ru/doc/manuals/intro-i18n/ch-codes.html + + several years old, but still useful + http://www.oreilly.com/people/authors/lunde/cjk_inf.html + + and some in-depth reading for the heroes :-) + http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM (eq ISO-2022) + http://www.faqs.org/rfcs/rfc1345.txt + + =head1 PERL ENCODING API =head2 Generic Encoding Interface @@ -598,7 +651,7 @@ internal form and returns the resulting string. For CHECK see L</"Handling Malformed Data">. -For example to convert ISO 8859-1 data to UTF-8: +For example to convert ISO-8859-1 data to UTF-8: $utf8 = decode("latin1", $latin1); @@ -611,7 +664,7 @@ encode() or through PerlIO: See L</"Encoding and IO">. For CHECK see L</"Handling Malformed Data">. -For example to convert ISO 8859-1 data to UTF-8: +For example to convert ISO-8859-1 data to UTF-8: from_to($data, "iso-8859-1", "utf-8"); @@ -848,7 +901,7 @@ "character operations" (e.g. C<lc>, C</\W+/>, ...). You can also use PerlIO to convert larger amounts of data you don't -want to bring into memory. For example to convert between ISO 8859-1 +want to bring into memory. For example to convert between ISO-8859-1 (Latin 1) and UTF-8 (or UTF-EBCDIC in EBCDIC machines): open(F, "<:encoding(iso-8859-1)", "data.txt") or die $!; ===8<===========End of original message text=========== -- Best regards, Anton mailto:[EMAIL PROTECTED]