Re: Encoding and charset_to_utf8

Tuomas Luttinen Thu, 15 Nov 2001 08:22:26 -0800

Worik Macky Turei Stanton wrote:

> Friends


> (At least in the case of...) When a message is sent via the http
> interface to smsbox smsbox_req_handle calls charset_processing to
> translate the text.

> If the coding is UCS2 and the charset is not UTF-16BE it is converted
> to utf8, thence to UTF-16BE.

> So far so good.

> But I have some questions about the charset_to_utf8.  

> charset_to_utf8 uses libxml2 to do the grunt work of translating the
> encoding (xmlFindCharEncodingHandler and xmlCharEncInFunc).  I have
> been looking at...

> http://xmlsoft.org/html/libxml-encoding.html#XMLCHARENCODING

> This has lists a few charecter codings.  (I have listed them via cut
> and paste below).  My email archives have emails in quite a few
> different encodings.  What about them?

> The char types I have found in my email archive...

>          BIG5
>          EUC-KR
>          GB2312
>          GB2312_CHARSET
>          ISO-10646
>          ISO-2022-JP
>          ISO-8859-1
>          ISO-8859-2
>          ISO-8859-4
>          ISO-8859-7
>          ISO-8859-9;
>          KOI8-R
>          UNKNOWN-8BIT
>          US-ASCII
>          UTF-8
>          Windows-1250
>          Windows-1251
>          Windows-1252
>          X-UNKNOWN
>          big5
>          euc-kr
>          gb2312
>          iso-2022-kr
>          iso-8859-1
>          iso-8859-1
>          iso-8859-13
>          iso-8859-15
>          koi8-r
>          ks_c_5601-1987
>          unknown-8bit
>          windows-1256
>          x-user-defined

> The chartypes listed on http://xmlsoft.org/html/libxml-encoding.html


>          XML_CHAR_ENCODING_ERROR=   -1, /* No char encoding detected */
>          XML_CHAR_ENCODING_NONE=      0, /* No char encoding detected */
>          XML_CHAR_ENCODING_UTF8=      1, /* UTF-8 */
>          XML_CHAR_ENCODING_UTF16LE=   2, /* UTF-16 little endian */
>          XML_CHAR_ENCODING_UTF16BE=   3, /* UTF-16 big endian */
>          XML_CHAR_ENCODING_UCS4LE=    4, /* UCS-4 little endian */
>          XML_CHAR_ENCODING_UCS4BE=    5, /* UCS-4 big endian */
>          XML_CHAR_ENCODING_EBCDIC=    6, /* EBCDIC uh! */
>          XML_CHAR_ENCODING_UCS4_2143=7, /* UCS-4 unusual ordering */
>          XML_CHAR_ENCODING_UCS4_3412=8, /* UCS-4 unusual ordering */
>          XML_CHAR_ENCODING_UCS2=      9, /* UCS-2 */
>          XML_CHAR_ENCODING_8859_1=    10,/* ISO-8859-1 ISO Latin 1 */
>          XML_CHAR_ENCODING_8859_2=    11,/* ISO-8859-2 ISO Latin 2 */
>          XML_CHAR_ENCODING_8859_3=    12,/* ISO-8859-3 */
>          XML_CHAR_ENCODING_8859_4=    13,/* ISO-8859-4 */
>          XML_CHAR_ENCODING_8859_5=    14,/* ISO-8859-5 */
>          XML_CHAR_ENCODING_8859_6=    15,/* ISO-8859-6 */
>          XML_CHAR_ENCODING_8859_7=    16,/* ISO-8859-7 */
>          XML_CHAR_ENCODING_8859_8=    17,/* ISO-8859-8 */
>          XML_CHAR_ENCODING_8859_9=    18,/* [ISO-8859-9 */
>          XML_CHAR_ENCODING_2022_JP=  19,/* ISO-2022-JP */
>          XML_CHAR_ENCODING_SHIFT_JIS=20,/* Shift_JIS */
>          XML_CHAR_ENCODING_EUC_JP=   21,/* EUC-JP */
>          XML_CHAR_ENCODING_ASCII=    22 /* pure ASCII */

> Worik

Yes, it's true that libxml doesn't offer built in support for most of 

the character sets, but then again, it is not even it's job to do it. 

It seems that the note about iconv library has been removed from the 

libxml's site.


If the libxml is compiled with iconv-library support it tries to use 
iconv character set conversion functions whenever it falls short with 
it's own. Therefore the actual number of the supported character sets is 
  dependant on the iconv library of the target platform. There is a GNU 
version thou, that can be used in the most *nixes when needed. It gives 
the following characters sets (http://www.gnu.org/software/libiconv/)

European languages
ASCII, ISO-8859-{1,2,3,4,5,7,9,10,13,14,15,16}, KOI8-R, KOI8-U, KOI8-RU, 
CP{1250,1251,1252,1253,1254,1257}, CP{850,866}, 
Mac{Roman,CentralEurope,Iceland,Croatian,Romania}, 
Mac{Cyrillic,Ukraine,Greek,Turkish}, Macintosh

Semitic languages
ISO-8859-{6,8}, CP{1255,1256}, CP862, Mac{Hebrew,Arabic}

Japanese
EUC-JP, SHIFT-JIS, CP932, ISO-2022-JP, ISO-2022-JP-2, ISO-2022-JP-1

Chinese
EUC-CN, HZ, GBK, GB18030, EUC-TW, BIG5, CP950, BIG5-HKSCS, ISO-2022-CN, 
ISO-2022-CN-EXT

Korean
EUC-KR, CP949, ISO-2022-KR, JOHAB

Armenian
ARMSCII-8

Georgian
Georgian-Academy, Georgian-PS

Thai
TIS-620, CP874, MacThai

Laotian
MuleLao-1, CP1133

Vietnamese
VISCII, TCVN, CP1258

Platform specifics
HP-ROMAN8, NEXTSTEP

Full Unicode
     UTF-8
     UCS-2, UCS-2BE, UCS-2LE
     UCS-4, UCS-4BE, UCS-4LE
     UTF-16, UTF-16BE, UTF-16LE
     UTF-32, UTF-32BE, UTF-32LE
     UTF-7
     JAVA
Full Unicode, in terms of uint16_t or uint32_t (with machine dependent 
endianness and alignment)
UCS-2-INTERNAL, UCS-4-INTERNAL

Oh yes, those Windows charsets... They are aliases to those CP charsets.
Actually this reminds me about one thing on my todo list:

2001-11-15  Tuomas Luttinen  <[EMAIL PROTECTED]>

     * gw/wml_definitions.h, gw/wml_compiler.c: Removed the windows 
character
       set registration from this module; it shouldn't have been here in 
the
       first place.

     * gwlib/gwlib.c (gwlib_init): Added a call to charset_init.

     * gwlib/charset.[ch]: New function charset_init added that registers
       windows charsets into the libxml character set aliases.

Ok, this patch is now in the CVS, so the Windows charsets should now 
work with SMS:es too, not just in the wml decks.


-- 
Tuomas Luttinen
     Application Developer -- Reach U
         **************

Re: Encoding and charset_to_utf8

Reply via email to