Paul Hoffman <[EMAIL PROTECTED]> writes:
> At 5:19 PM -0800 1/18/01, Gisle Aas wrote:
> > > How do I get this to put out UTF8, which is what I need?
> >
> >The ->utf8 method should do that.
>
> OK, I now see that you meant to do this for the output. However, this
> doesn't fix what I need, which is for length and substr to not go to
> surrogates. For that matter, hex goes to surrogates as well!
I agree this ought to be fixed. The easiest way is probably to change
Unicode::String so that it uses UTF32 internally. Then length/substr
can still be as simple (and fast) as they are now.
> =====
> #!/usr/bin/perl -w
>
> use Unicode::String qw(utf8 utf16 uchr);
> Unicode::String->stringify_as('utf8');
>
> @TestVectors = ("0x0010", "0x0100", "0x1000", "0x10000", "0x100000");
>
> foreach $ThisVector (@TestVectors) {
> $SomeUTF8 = uchr(hex($ThisVector))->utf8;
> $TheLen = length($SomeUTF8);
> $TheHex = utf8($SomeUTF8)->hex;
> print "$ThisVector $TheLen $TheHex Ords: ";
> @TheOctets = split(//, $SomeUTF8);
> foreach $ThisOctet (@TheOctets) { print ord($ThisOctet), " " };
> print "\n";
> }
> =====
>
> 0x0010 1 U+0010 Ords: 16
> 0x0100 2 U+0100 Ords: 196 128
> 0x1000 3 U+1000 Ords: 225 128 128
> 0x10000 4 U+d800 U+dc00 Ords: 240 144 128 128
> 0x100000 4 U+dbc0 U+dc00 Ords: 244 128 128 128
>
> Clearly, $SomeUTF8 is in UTF8, as exhibited by the lengths and by the
> ords. But hex turns it into a surrogate before outputting the hex
> values.
>
> Is there any way that I can break a UTF8 string into individual
> characters without doing some kludge of using UTF16 characters and
> checking manually for half-surrogates?
You could try to use perl's native UTF8 support for that. unpack("U",...)
--Gisle