Am Montag, 30. Oktober 2006 20:49 schrieb Joost Verburg: > Georg Baum wrote: > > OK, so it is like that: Up to 4 bytes per code point are used for the > > currently defined 21 bits of UCS4, but UTF8 is designed in such a way that > > it is possible to encode all 36 bits of UCS4 with at most 6 bytes per code > > point. > > Not really. Some years ago there was not yet a real limit in the Unicode > specification for the number of code points (the theoretical limit was > 2^31 if I remember correctly). > > However, the limit has now been set to 2^20+2^16 code points. There is > still a lot of space available, but there will _never_ be any more code > points than 2^20+2^16 (also not in UCS-4!). > > So by definition UTF-8 allows a maximum of 4 bytes per character. Any 5 > or 6 byte sequences are invalid.
So you say Markus Kuhn is wrong? That would be surprising to me, since he is considered to be an unicode expert. > To summarize: > > * UTF-8 uses 1-4 bytes (1 byte for US-ASCII, 2 bytes for other Latin > characters, 3 bytes for Chinese etc. and 4 bytes for rare things). > > * UTF-16 uses 2 bytes for Latin, Chinese etc. and 4 bytes for rare > characters. > > * UTF-32 has a fixed length of 4 bytes per character and is functionally > equivalent to UCS-4. > > Please keep things simple and call the encodings UTF-8, UTF-16 and UTF-32. As long as LyX calls conversion utilities with "UCS4" I will call the encoding that LyX uses "UCS4". Although the difference between UTF32 and UCS4 is only a theoretical one I think that it should be made clear what is meant. Since UTF8 and UTF16 are variable-byte encodings and we want a fixed-byte encoding I only find it natural to call it UCS4 and not UTF32, even if the difference is only theoretical and UTF32 and UCS4 are identical for all defined unicode characters. Georg
