Dan Sugalski writes:
: Have they changed that again? Last I checked, UTF-8 was capped at 4 bytes, 
: but that's in the Unicode 3.0 standard.

Doesn't really matter where they install the artificial cap, because
for philosophical reasons Perl is gonna support larger values anyway.
It's just that 4 bytes of UTF-8 happens to be large enough to represent
anything UTF-16 can represent with surrogates.  So they refuse to
believe in anything longer than 4 bytes, even though the representation
can be extended much further.  (Perl 5 extends it all the way to 64-bit
values, represented in 13 bytes!)

They also arbitrarily define UTF-32 to not use higher values than
0x10ffff, but that doesn't mean we're gonna send in the high-bit Nazis
if people want higher values for their own purposes.

But since the names UTF-8 and UTF-32 are becoming associated with those
arbitrary restrictions, it's getting even more important to refer to
Perl's looser style as utf8 (and, potentially, utf32).  I don't know
if Perl will have a utf16 that is distinguised from UTF-16.

Larry

Reply via email to