Re: Hassles with Unicode::String

Gisle Aas Thu, 18 Jan 2001 20:16:01 -0800
Paul Hoffman <[EMAIL PROTECTED]> writes:

> At 5:19 PM -0800 1/18/01, Gisle Aas wrote:
> >  > How do I get this to put out UTF8, which is what I need?
> >
> >The ->utf8 method should do that.
> 
> OK, I now see that you meant to do this for the output. However, this
> doesn't fix what I need, which is for length and substr to not go to
> surrogates. For that matter, hex goes to surrogates as well!

I agree this ought to be fixed.  The easiest way is probably to change
Unicode::String so that it uses UTF32 internally.  Then length/substr
can still be as simple (and fast) as they are now.

> =====
> #!/usr/bin/perl -w
> 
> use Unicode::String qw(utf8 utf16 uchr);
> Unicode::String->stringify_as('utf8');
> 
> @TestVectors = ("0x0010", "0x0100", "0x1000", "0x10000", "0x100000");
> 
> foreach $ThisVector (@TestVectors) {
>      $SomeUTF8 = uchr(hex($ThisVector))->utf8;
>      $TheLen = length($SomeUTF8);
>      $TheHex = utf8($SomeUTF8)->hex;
>      print "$ThisVector   $TheLen   $TheHex   Ords: ";
>      @TheOctets = split(//, $SomeUTF8);
>      foreach $ThisOctet (@TheOctets) { print ord($ThisOctet), " " };
>      print "\n";
> }
> =====
> 
> 0x0010   1   U+0010   Ords: 16
> 0x0100   2   U+0100   Ords: 196 128
> 0x1000   3   U+1000   Ords: 225 128 128
> 0x10000   4   U+d800 U+dc00   Ords: 240 144 128 128
> 0x100000   4   U+dbc0 U+dc00   Ords: 244 128 128 128
> 
> Clearly, $SomeUTF8 is in UTF8, as exhibited by the lengths and by the
> ords. But hex turns it into a surrogate before outputting the hex
> values.
> 
> Is there any way that I can break a UTF8 string into individual
> characters without doing some kludge of using UTF16 characters and
> checking manually for half-surrogates?

You could try to use perl's native UTF8 support for that.  unpack("U",...)

--Gisle
Re: Hassles with Unicode::String

Reply via email to