UTF-16 conversion for library use

Asmus Freytag Sun, 23 Sep 2001 02:51:19 -0700
At 10:21 AM 9/21/01 -0700, Kenneth Whistler wrote:

>It is my impression, however, that most significant applications
>tend, these days, to be I/O bound and/or network
>transport bound, rather than compute bound.
...
>We don't hear
>much, anymore, about how "wasteful" Unicode is in its storage
>of characters.

These points are well taken, particularly in the context of the discussion 
in which they appeared. However, there are still situations where doubling 
of storage, like going from UTF-16 to UTF-32 for average data, can have 
direct impact.

The typical situation involves cases where large data sets are cached in 
memory, for immediate access. Going to UTF-32 reduces the cache effectively 
by a factor of two, with no comparable increase in processing efficiency to 
balance out the extra cache misses. This is because each cache miss is 
orders of magnitude more expensive than a cache hit.

For specialized data sets (heavy in ascii) keeping such a cache in UTF-8 
might conceivably reduce cache misses further to a point where on the fly 
conversion to UTF-16 could get amortized. However, such an optimization is 
not robust, unless the assumption is due to the nature of the data (e.g. 
HTML) as opposed to merely their source (US). In the latter case, such an 
architecture scales badly with change in market.

[The decision to use UTF-16, on the other hand, is much more robust, 
because the code paths that deal with surrogate pairs will be exercised 
with low frequency, due to the deliberate concentration of nearly all 
modern-use characters into the BMP (i.e. the first 64K).]

A./
Re: UTF-8 <> UCS-2/UTF-16 conversion for library use

Reply via email to