Arcane Jill <arcanejill at ramonsky dot com> wrote:

> Probably a dumb question, but how come nobody's invented "UTF-24" yet?
> I just made that up, it's not an official standard, but one could
> easily define UTF-24 as UTF-32 with the most-significant byte (which
> is always zero) removed, hence all characters are stored in exactly
> three bytes and all are treated equally. You could have UTF-24LE and
> UTF-24BE variants, and even UTF-24 BOMs. Of course, I'm not suggesting
> this is a particularly brilliant idea, but I just wonder why no-one's
> suggested it before.

It has been suggested before, by Pim Blokland on April 3, 2003, in a
message titled "UTF-24."  If you get the digest, it's in Digest V3 #79.

> The "UTF-24" thing seems a reasonably sensible question though. Is it
> just that we don't like it because some processors have alignment
> restrictions or something?

Almost all do.  In addition, no programming language I know of has a
3-byte-wide integer data type (maybe INTERCAL does), so the efficiency
of UTF-24 would be wasted in software as well as in hardware.

Besides that, there were the usual protests that supplementary
characters would be vanishingly rare in the context of "normal" text,
and that one should use compression (SCSU/BOCU or GP tools) if size is
an issue.

None of this stopped me from experimentally implementing it, of course,
but I haven't touched it since finishing the implementation.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/



Reply via email to