Georg Baum wrote:
So you say Markus Kuhn is wrong? That would be surprising to me, since he
is considered to be an unicode expert.
His information is outdated. RFC 2279 (the old UTF-8 specification) did
include support for a 31-bit code space. Because the Unicode code space
was later restricted, the RFC has been updated as RFC 3629 and is
restricted to the range 0000-10FFFF. There will never be any characters
outside this range. RFC 2279 is obsolete.
So the _current_ definition of UTF-8 (RFC 3629) does _not_ allow 5 and 6
byte sequences. See http://www.faqs.org/rfcs/rfc3629.html
The discussion about differences between UCS-4 and UTF-32 is only
theoretical. UCS-4 (part of ISO 10646) defines a theoretical 31-bit code
space but also defines that no characters above 10FFFF will be defined.
So in practice you end up with exactly the same thing as UTF-32.
Joost
- Re: [Patch] optimize utf8_to_ucs4 Joost Verburg
-