Re: [Patch] optimize utf8_to_ucs4

Abdelrazak Younes Sun, 29 Oct 2006 13:45:12 -0800

Joost Verburg wrote:

Abdelrazak Younes wrote:
utf8 can use up to 6 characters. 4 bytes would not have been enough tostore 4 bytes of data plus the protocol necessary to decode utf8.
Unicode does not use the full 32 bits. There are only 2^20+2^16 codepoints, so actually the whole of Unicode would fit in 21 bits (there areno such integers of course). Therefore it is possible to encode Unicodedata in 1-4 bytes as UTF-8.
By the way, UCS-4 and UTF-32 can be taken to be identical. So it isbetter to only use the names UTF-8, UTF-16 and UTF-32. Names like "UTF-8to UCS-4" confuse people.
(In history when the Unicode specification did not yet contain this codepoint limit there was indeed a difference between UCS-4 and UTF-32 andthe theoretical possibility of having characters that use more than 4bytes in UTF-8. This is no longer the case.)
Hopefully this makes things clear.


Indeed. Thanks a lot.

Abdel.

Re: [Patch] optimize utf8_to_ucs4

Reply via email to