Re: Nicest UTF

Philippe Verdy Thu, 02 Dec 2004 13:53:38 -0800

There's no *universal* best encoding.

UTF-8 however is certainly today the best encoding for portable communications and data storage (but it competes now with SCSU which uses a compressed form where, on average, each Unicode character is represented by one byte, in most documents; but other schemes also exist that use deflate compression on UTF-8).

The problem with UTF-16 and UTF-32 is byte ordering, where byte is meant in terms of portable networking and file storage, i.e. 8-bit in almost all current technologies. With UTF-16 and UTF-32, you need to get a way to determine how bytes are ordered in the code unit, as read from a byte-oriented stream. You need not with UTF-8.

The problem with UTF-8 is that it will be most often inefficient or not easy to work with within applications and libraries, that are easier accessing strings and counting characters coded on fixed-width code units.

Although UTF-16 is not strictly fixed-width, it is quite easy to work with, and is often more efficient than UTF-32 due to memory allocations.

UTF-32 however is the easiest solution when applications really want to handle each possible character encoded on one Unicode code point with a single code unit.

All UTF encodings (including the SCSU compressed encoding, or BOCU-8 which is a variant of UTF-8, or also now the GB18030 Chinese standard which is now a valid representation of Unicode) have their pros and cons.

Choose among them because they are widely documented, and offer good interoperabilities within lots of libraries handling them with similar semantics.

If you are not satisfied in your application by these encodings, you may even create your own one (like Sun did when modifying UTF-8 to allow representing any Unicode string within a null-terminated C string, and also allow any sequence of 16-bit code units, even the invalid ones where surrogates are unpaired, to be represented on 8-bit streams). If you do that, don't expect this encoding to be easily portable and recognized by other systems, unless you document it with a complete specification and make it available for free alternate implementations by others.

----- Original Message ----- From: "Arcane Jill" <[EMAIL PROTECTED]> To: "Unicode" <[EMAIL PROTECTED]> Sent: Thursday, December 02, 2004 2:19 PM Subject: RE: Nicest UTF

Oh for a chip with 21-bit wide registers!
:-)
Jill
-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of Antoine Leca
Sent: 02 December 2004 12:12
To: Unicode Mailing List
Subject: Re: Nicest UTF
There are other factors that might influence your choice. For example, the relative cost of using 16-bit entities: on a Pentium it is cheap, on more modern X86 processors the price is a bit higher, and on some RISC chips it is prohibitive (that is, short may become 32 bits; obviously, in such a case, UTF-16 is not really a good choice). On the other extreme, you have processors where byte are 16 bits; obviously again, then UTF-8 is not optimum there. ;-)

Re: Nicest UTF

Reply via email to