On Jan 10, 2014, at 12:28 PM, John Gilmore <jwgli...@gmail.com> wrote:

> Briefly, effective rules for encoding any 'character' recognized as a
> Unicode one as a 'longer' UTF-8 one do not in general exist.

Sure they do. From http://www.unicode.org/faq/utf_bom.html#UTF8:

"UTF-8 is the byte-oriented encoding form of Unicode. For details of its 
definition, see Section 2.5, Encoding Forms and Section 3.9, Unicode Encoding 
Forms ” in the Unicode Standard.”

Also, at http://www.unicode.org/resources/utf8.html:

        • ANSI C implementation of UTF-8 
(http://www.bsdua.org/files/unicode.tar.gz)
Converts UTF-8 into UCS4 and vice versa. Source code is BSD licensed

> Moreover, even when they are available, my experience with them has
> been bad.  In dealing recently with a document containing mixed
> English, German, Korean and Japanese text I found that the UTF-8
> version was 23% longer than the UTF-16 version.

As far as I’ve been able to see, the Unicode consortium views UTF-8 and UTF-16 
as equally viable. Which is preferable depends entirely on the character of the 
texts you’re processing. (Well, with UTF-16 you have to worry about endianness 
but with UTF-8 you don’t.) If your text is mostly latin and related characters, 
UTF-8 will probably be shorter. If it includes a significant amount of CKJ 
(Chinese/Korean/Japanese) characters, as you apparently had here, UTF-16 will 
probably be shorter.

-- 
Curtis Pew (c....@its.utexas.edu)
ITS Systems Core
The University of Texas at Austin

----------------------------------------------------------------------
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN

Reply via email to