At 04:44 PM 6/5/2001 -0700, Larry Wall wrote:
>Dan Sugalski writes:
>: Have they changed that again? Last I checked, UTF-8 was capped at 4 bytes,
>: but that's in the Unicode 3.0 standard.
>
>Doesn't really matter where they install the artificial cap, because
>for philosophical reasons Perl is gonna support larger values anyway.
>It's just that 4 bytes of UTF-8 happens to be large enough to represent
>anything UTF-16 can represent with surrogates. So they refuse to
>believe in anything longer than 4 bytes, even though the representation
>can be extended much further. (Perl 5 extends it all the way to 64-bit
>values, represented in 13 bytes!)
I know we can, but is it really a good idea? 32 bits is really stretching
it for character encoding, and 64 seems rather excessive. Really
space-wasteful as well, if we maintain a character type with a fixed width
large enough to hold the largest decoded variable-width character. And I
really, *really* want to do as little as possible internally with
variable-width encodings. Yech.
>They also arbitrarily define UTF-32 to not use higher values than
>0x10ffff, but that doesn't mean we're gonna send in the high-bit Nazis
>if people want higher values for their own purposes.
Well, that'd be inappropriate since a good chunk of the rest of the set's
been dedicated to future expansion. I think it might be a reasonable idea
for -w to grumble if someone's used a character in the unassigned range,
though. (IIRC there's a piece set aside for folks to do whatever they want
with)
>But since the names UTF-8 and UTF-32 are becoming associated with those
>arbitrary restrictions, it's getting even more important to refer to
>Perl's looser style as utf8 (and, potentially, utf32). I don't know
>if Perl will have a utf16 that is distinguised from UTF-16.
I'd as soon not do UTF-16 at all, or at least no more than we need to
convert to UTF-32 or UTF-8.
Dan
--------------------------------------"it's like this"-------------------
Dan Sugalski even samurai
[EMAIL PROTECTED] have teddy bears and even
teddy bears get drunk