Re: [Patch] optimize utf8_to_ucs4

Georg Baum Mon, 30 Oct 2006 12:22:10 -0800

Am Montag, 30. Oktober 2006 20:49 schrieb Joost Verburg:
> Georg Baum wrote:
> > OK, so it is like that: Up to 4 bytes per code point are used for the 
> > currently defined 21 bits of UCS4, but UTF8 is designed in such a way 
that 
> > it is possible to encode all 36 bits of UCS4 with at most 6 bytes per 
code 
> > point.
> 
> Not really. Some years ago there was not yet a real limit in the Unicode 
> specification for the number of code points (the theoretical limit was 
> 2^31 if I remember correctly).
>
> However, the limit has now been set to 2^20+2^16 code points. There is 
> still a lot of space available, but there will _never_ be any more code 
> points than 2^20+2^16 (also not in UCS-4!).
> 
> So by definition UTF-8 allows a maximum of 4 bytes per character. Any 5 
> or 6 byte sequences are invalid.


So you say Markus Kuhn is wrong? That would be surprising to me, since he 
is considered to be an unicode expert.

> To summarize:
> 
> * UTF-8 uses 1-4 bytes (1 byte for US-ASCII, 2 bytes for other Latin 
> characters, 3 bytes for Chinese etc. and 4 bytes for rare things).
> 
> * UTF-16 uses 2 bytes for Latin, Chinese etc. and 4 bytes for rare 
> characters.
> 
> * UTF-32 has a fixed length of 4 bytes per character and is functionally 
> equivalent to UCS-4.
> 
> Please keep things simple and call the encodings UTF-8, UTF-16 and 
UTF-32.

As long as LyX calls conversion utilities with "UCS4" I will call the 
encoding that LyX uses "UCS4". Although the difference between UTF32 and 
UCS4 is only a theoretical one I think that it should be made clear what 
is meant. Since UTF8 and UTF16 are variable-byte encodings and we want a 
fixed-byte encoding I only find it natural to call it UCS4 and not UTF32, 
even if the difference is only theoretical and UTF32 and UCS4 are 
identical for all defined unicode characters.


Georg

Re: [Patch] optimize utf8_to_ucs4

Reply via email to