Re: [Patch] optimize utf8_to_ucs4

Joost Verburg Mon, 30 Oct 2006 14:13:59 -0800

Georg Baum wrote:

So you say Markus Kuhn is wrong? That would be surprising to me, since heis considered to be an unicode expert.

His information is outdated. RFC 2279 (the old UTF-8 specification) didinclude support for a 31-bit code space. Because the Unicode code spacewas later restricted, the RFC has been updated as RFC 3629 and isrestricted to the range 0000-10FFFF. There will never be any charactersoutside this range. RFC 2279 is obsolete.

So the _current_ definition of UTF-8 (RFC 3629) does _not_ allow 5 and 6byte sequences. See http://www.faqs.org/rfcs/rfc3629.html

The discussion about differences between UCS-4 and UTF-32 is onlytheoretical. UCS-4 (part of ISO 10646) defines a theoretical 31-bit codespace but also defines that no characters above 10FFFF will be defined.So in practice you end up with exactly the same thing as UTF-32.


Joost

Re: [Patch] optimize utf8_to_ucs4

Reply via email to