Re: UTF-24

Markus Scherer Thu, 03 Apr 2003 12:32:51 -0800

Pim Blokland wrote:

Why is there no UTF-24?

Well, I once proposed UTF-20...

See, these MathText characters take up a lot of space. No matter how
you encode them; UTF-8, UTF-16 or UTF-32; they always are 4 bytes
long.

True for them alone, in those UTFs. Short of defining another Unicode encoding, there are two answers that I can offer you:

1. Such characters are expected to be the minority of text, I suppose even in Math text, because there are lots of other characters in such documents - punctuation, spaces, digits, regular text - that are mostly on the BMP and thus shorter. So total Math documents with some MathText supplementary characters will use, on average, fewer than 3B/code point in UTF-8/16.

2. If you want compression, use the existing SCSU (UTR #6) and BOCU-1 (UTN #6), or general-purpose compressions like bzip2.

Note that this is only for text interchange - the majority of Unicode-aware software programs uses UTF-16 internally.

Best regards,
markus

--
Opinions expressed here may not reflect my company's positions unless otherwise noted.

Re: UTF-24

Reply via email to