Worik Macky Turei Stanton wrote: > Friends
> (At least in the case of...) When a message is sent via the http > interface to smsbox smsbox_req_handle calls charset_processing to > translate the text. > If the coding is UCS2 and the charset is not UTF-16BE it is converted > to utf8, thence to UTF-16BE. > So far so good. > But I have some questions about the charset_to_utf8. > charset_to_utf8 uses libxml2 to do the grunt work of translating the > encoding (xmlFindCharEncodingHandler and xmlCharEncInFunc). I have > been looking at... > http://xmlsoft.org/html/libxml-encoding.html#XMLCHARENCODING > This has lists a few charecter codings. (I have listed them via cut > and paste below). My email archives have emails in quite a few > different encodings. What about them? > The char types I have found in my email archive... > BIG5 > EUC-KR > GB2312 > GB2312_CHARSET > ISO-10646 > ISO-2022-JP > ISO-8859-1 > ISO-8859-2 > ISO-8859-4 > ISO-8859-7 > ISO-8859-9; > KOI8-R > UNKNOWN-8BIT > US-ASCII > UTF-8 > Windows-1250 > Windows-1251 > Windows-1252 > X-UNKNOWN > big5 > euc-kr > gb2312 > iso-2022-kr > iso-8859-1 > iso-8859-1 > iso-8859-13 > iso-8859-15 > koi8-r > ks_c_5601-1987 > unknown-8bit > windows-1256 > x-user-defined > The chartypes listed on http://xmlsoft.org/html/libxml-encoding.html > XML_CHAR_ENCODING_ERROR= -1, /* No char encoding detected */ > XML_CHAR_ENCODING_NONE= 0, /* No char encoding detected */ > XML_CHAR_ENCODING_UTF8= 1, /* UTF-8 */ > XML_CHAR_ENCODING_UTF16LE= 2, /* UTF-16 little endian */ > XML_CHAR_ENCODING_UTF16BE= 3, /* UTF-16 big endian */ > XML_CHAR_ENCODING_UCS4LE= 4, /* UCS-4 little endian */ > XML_CHAR_ENCODING_UCS4BE= 5, /* UCS-4 big endian */ > XML_CHAR_ENCODING_EBCDIC= 6, /* EBCDIC uh! */ > XML_CHAR_ENCODING_UCS4_2143=7, /* UCS-4 unusual ordering */ > XML_CHAR_ENCODING_UCS4_3412=8, /* UCS-4 unusual ordering */ > XML_CHAR_ENCODING_UCS2= 9, /* UCS-2 */ > XML_CHAR_ENCODING_8859_1= 10,/* ISO-8859-1 ISO Latin 1 */ > XML_CHAR_ENCODING_8859_2= 11,/* ISO-8859-2 ISO Latin 2 */ > XML_CHAR_ENCODING_8859_3= 12,/* ISO-8859-3 */ > XML_CHAR_ENCODING_8859_4= 13,/* ISO-8859-4 */ > XML_CHAR_ENCODING_8859_5= 14,/* ISO-8859-5 */ > XML_CHAR_ENCODING_8859_6= 15,/* ISO-8859-6 */ > XML_CHAR_ENCODING_8859_7= 16,/* ISO-8859-7 */ > XML_CHAR_ENCODING_8859_8= 17,/* ISO-8859-8 */ > XML_CHAR_ENCODING_8859_9= 18,/* [ISO-8859-9 */ > XML_CHAR_ENCODING_2022_JP= 19,/* ISO-2022-JP */ > XML_CHAR_ENCODING_SHIFT_JIS=20,/* Shift_JIS */ > XML_CHAR_ENCODING_EUC_JP= 21,/* EUC-JP */ > XML_CHAR_ENCODING_ASCII= 22 /* pure ASCII */ > Worik Yes, it's true that libxml doesn't offer built in support for most of the character sets, but then again, it is not even it's job to do it. It seems that the note about iconv library has been removed from the libxml's site. If the libxml is compiled with iconv-library support it tries to use iconv character set conversion functions whenever it falls short with it's own. Therefore the actual number of the supported character sets is dependant on the iconv library of the target platform. There is a GNU version thou, that can be used in the most *nixes when needed. It gives the following characters sets (http://www.gnu.org/software/libiconv/) European languages ASCII, ISO-8859-{1,2,3,4,5,7,9,10,13,14,15,16}, KOI8-R, KOI8-U, KOI8-RU, CP{1250,1251,1252,1253,1254,1257}, CP{850,866}, Mac{Roman,CentralEurope,Iceland,Croatian,Romania}, Mac{Cyrillic,Ukraine,Greek,Turkish}, Macintosh Semitic languages ISO-8859-{6,8}, CP{1255,1256}, CP862, Mac{Hebrew,Arabic} Japanese EUC-JP, SHIFT-JIS, CP932, ISO-2022-JP, ISO-2022-JP-2, ISO-2022-JP-1 Chinese EUC-CN, HZ, GBK, GB18030, EUC-TW, BIG5, CP950, BIG5-HKSCS, ISO-2022-CN, ISO-2022-CN-EXT Korean EUC-KR, CP949, ISO-2022-KR, JOHAB Armenian ARMSCII-8 Georgian Georgian-Academy, Georgian-PS Thai TIS-620, CP874, MacThai Laotian MuleLao-1, CP1133 Vietnamese VISCII, TCVN, CP1258 Platform specifics HP-ROMAN8, NEXTSTEP Full Unicode UTF-8 UCS-2, UCS-2BE, UCS-2LE UCS-4, UCS-4BE, UCS-4LE UTF-16, UTF-16BE, UTF-16LE UTF-32, UTF-32BE, UTF-32LE UTF-7 JAVA Full Unicode, in terms of uint16_t or uint32_t (with machine dependent endianness and alignment) UCS-2-INTERNAL, UCS-4-INTERNAL Oh yes, those Windows charsets... They are aliases to those CP charsets. Actually this reminds me about one thing on my todo list: 2001-11-15 Tuomas Luttinen <[EMAIL PROTECTED]> * gw/wml_definitions.h, gw/wml_compiler.c: Removed the windows character set registration from this module; it shouldn't have been here in the first place. * gwlib/gwlib.c (gwlib_init): Added a call to charset_init. * gwlib/charset.[ch]: New function charset_init added that registers windows charsets into the libxml character set aliases. Ok, this patch is now in the CVS, so the Windows charsets should now work with SMS:es too, not just in the wml decks. -- Tuomas Luttinen Application Developer -- Reach U **************