Georg Baum wrote:
So you say Markus Kuhn is wrong? That would be surprising to me, since he is considered to be an unicode expert.

His information is outdated. RFC 2279 (the old UTF-8 specification) did include support for a 31-bit code space. Because the Unicode code space was later restricted, the RFC has been updated as RFC 3629 and is restricted to the range 0000-10FFFF. There will never be any characters outside this range. RFC 2279 is obsolete.

So the _current_ definition of UTF-8 (RFC 3629) does _not_ allow 5 and 6 byte sequences. See http://www.faqs.org/rfcs/rfc3629.html

The discussion about differences between UCS-4 and UTF-32 is only theoretical. UCS-4 (part of ISO 10646) defines a theoretical 31-bit code space but also defines that no characters above 10FFFF will be defined. So in practice you end up with exactly the same thing as UTF-32.

Joost

Reply via email to