On Jan 10, 2014, at 12:28 PM, John Gilmore <jwgli...@gmail.com> wrote:
> Briefly, effective rules for encoding any 'character' recognized as a > Unicode one as a 'longer' UTF-8 one do not in general exist. Sure they do. From http://www.unicode.org/faq/utf_bom.html#UTF8: "UTF-8 is the byte-oriented encoding form of Unicode. For details of its definition, see Section 2.5, Encoding Forms and Section 3.9, Unicode Encoding Forms ” in the Unicode Standard.” Also, at http://www.unicode.org/resources/utf8.html: • ANSI C implementation of UTF-8 (http://www.bsdua.org/files/unicode.tar.gz) Converts UTF-8 into UCS4 and vice versa. Source code is BSD licensed > Moreover, even when they are available, my experience with them has > been bad. In dealing recently with a document containing mixed > English, German, Korean and Japanese text I found that the UTF-8 > version was 23% longer than the UTF-16 version. As far as I’ve been able to see, the Unicode consortium views UTF-8 and UTF-16 as equally viable. Which is preferable depends entirely on the character of the texts you’re processing. (Well, with UTF-16 you have to worry about endianness but with UTF-8 you don’t.) If your text is mostly latin and related characters, UTF-8 will probably be shorter. If it includes a significant amount of CKJ (Chinese/Korean/Japanese) characters, as you apparently had here, UTF-16 will probably be shorter. -- Curtis Pew (c....@its.utexas.edu) ITS Systems Core The University of Texas at Austin ---------------------------------------------------------------------- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN