Hello! Have just read Jungshik's mail and have patched Supported.pod a bit more: added (x-)windows-949 aliases stuff.
--- ext/Encode/lib/Encode/Supported.orig.pod Fri Apr 5 01:00:36 2002 +++ ext/Encode/lib/Encode/Supported.pod Fri Apr 5 15:18:25 2002 @@ -63,7 +63,7 @@ ascii US-ascii [ECMA] iso-8859-1 latin1 [ISO] utf8 UTF-8 [RFC2279] - UCS-2 ucs2, iso-10646-1, UTF-16LE [IANA, UC] + UCS-2 ucs2, iso-10646-1, UTF-16BE [IANA, UC] UTF-16LE UCS-2LE [UC] ---------------------------------------------------------------- @@ -188,8 +188,11 @@ ---------------------------------------------------------------- euc-kr MacKorean [RFC1557] - cp949 ks_c_5601-1987 is an alias - thereof. + cp949 ks_c_5601-1987 + windows-949 + x-windows-949 + uhc + are aliases thereof. iso-2022-kr [RFC1557] johab [KS X 1001:1998, Annex 3] ksc5601-raw KSC5601 as is @@ -456,14 +459,42 @@ C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw> with Encode. See L<Encode::KR -- Korea> for details. - UTF-16 + UTF-16 UTF-16BE UTF-16LE -=for comment -waiting for comments from Jungshik Shin to soften this - Anton +are a IANA-registered C<charset>s. See [RFC 2781] for details. +Jungshik Shin reports that UTF-16 with a BOM is well accepted +by MS IE 5/6 and NS 4/6. Beware however that + +=over 2 + +=item * + +C<UTF-16> support in any software you're going to be +using/interoperating with has probably been less tested +then C<UTF-8> support + +=item * + +data coded with C<UTF-8> seamlessly passes traditional +command piping (C<cat>, C<more>, etc.) while UTF-16 coded +data is likely to cause confusion (with it's zero bytes, +for example) + +=item * + +it is beyond the power of words to describe the way HTML browsers +encode non-C<ASCII> form data. To get a general impression refer to +L<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>. +While encoding of form data has stabilzed for C<UTF-8> coded pages +(at least IE 5/6, NS 6, Opera 6 behave consitently), be sure to +expect fun (and cross-browser discrepancies) with C<UTF-16> coded +pages! + +=back + +The rule of thumb is to use C<UTF-8> unless you know what +you're doing and unless you really need from using C<UTF-16>. -is a IANA-registered preferred MIME name -but probably should be avoided as encoding for web pages due to -the lack of browser support. ISO-IR-165 (http://www.faqs.org/rfcs/rfc1345.html) GBK @@ -498,7 +529,8 @@ for details. Encode aliases C<KS_C_5601-1987> to C<cp949> to reflect -this common misusage. +this common misusage. Other aliases are C<x-windows-949> (used by +Mozilla), C<windows-949> and C<uhc>. I<Raw> C<KS_C_5601-1987> encoding is available as C<kcs5601-raw>. See L<Encode::KR -- Korea> for details. @@ -515,7 +547,7 @@ Encode aliases C<GB2312> to C<euc-cn> in full agreement with IANA registration. C<cp936> is supported separately. -I<Raw> C<GB_2312-80> encoding is available as C<kcs5601-raw>. +I<Raw> C<GB_2312-80> encoding is available as C<gb2312-raw>. See L<Encode::CN -- Continental China> for details. @@ -568,6 +600,23 @@ belongs. 7-bit ISO-2022 is a CES but it cannot be a CCS. EUC is an example of being both a CCS and CES. +=item charset (in MIME context) + +has long been used in the meaning of C<encoding>, CES. + +While C<character set> word combination has lost this meaning +in MIME context since [RFC 2130], C<charset> abbreviation has +retained it. This is how [RFC 2277], [RFC 2278] bless C<charset>: + + + This document uses the term "charset" to mean a set of rules for + mapping from a sequence of octets to a sequence of characters, such + as the combination of a coded character set and a character encoding + scheme; this is also what is used as an identifier in MIME "charset=" + parameters, and registered in the IANA charset registry ... (Note + that this is NOT a term used by other standards bodies, such as ISO). + [RFC 2277] + =item EUC Extended Unix Character. See ISO-2022 @@ -683,7 +732,7 @@ =item czyborra.com -<http://czyborra.com/> +L<http://czyborra.com/> Contains a a lot of useful information, especially gory details of ISO vs. vendor mappings. @@ -697,6 +746,37 @@ L<ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf> You will find brief info on C<EUC-CN>, C<GBK> and mostly on C<GB 18030> + +=item Jungshik Shin's Hangul FAQ + +L<http://jshin.net/faq> + +And especially it's subject 8 + +L<http://jshin.net/faq/qa8.html> + +a comprehensive overview of the Korean (C<KS *>) standards. + +=back + +=head2 Offline sources + +=over 2 + +=item Ken Lunde + +CJKV Information Processing +1999 O'Reilly & Associates, ISBN : 1-56592-224-7 + +The modern successor of the C<CJK.inf>. + +Features a comprehensive coverage on CJKV character sets and +encodings along with many other issues faced by anyone trying +to better support CJKV languages/scripts in all the areas of +information processing. + +To purchase this book visit +L<http://www.oreilly.com/catalog/cjkvinfo/> =back