Paul Hoffman <[EMAIL PROTECTED]> writes:

> Using Unicode-String-2.06, I have the following test program:
> 
> =====
> 
> #!/usr/bin/perl -w
> 
> use Unicode::String qw(utf8 utf16 uchr);
> Unicode::String->stringify_as('utf8');
> 
> @TestArr = ("0061 0062", "0063 12345");
> 
> foreach $TheString (@TestArr) {
>      @AllHexIn = split(/\s+/, $TheString);
>      $OutString = '';
>      foreach $PartString (@AllHexIn)
>          { $OutString .= utf8(uchr(hex("0x$PartString"))); }
> 
>      $TheLen = utf8($OutString)->length;
> 
>      $HexOfInput = '';
>      foreach($i=0; $i<utf8($OutString)->length; $i++) {
>          $HexOfInput .= utf8($OutString)->substr($i, 1)->hex . ' | ';
>      }
>      print "$TheString  $TheLen    $HexOfInput\n";
> }
> 
> =====
> 
> The output is:
> 
> 0061 0062  2    U+0061 | U+0062 |
> 0063 12345  3    U+0063 | U+d808 | U+df45 |
> 
> Why is uchr putting out UTF16 instead of UTF8 for the non-BMP character?

Unicode::String is simply UTF16 internally.  The ->length and ->substr
methods all operate directly on the UTF16 representation without
looking for surrogates.  This is actually the wrong thing to do.  If
these where fixed to know about surrogates then I think this example
would work as you expected.  The ->hex function should probably also
be made surrogate aware.

> Even if uchr is putting out UTF16, why isn't the utf8() call coercing
> the value from UTF16 to UTF8?

utf8() is actually converting from UTF8 to UTF16.  uchr() is
converting a numeric value to UTF16.

> How do I get this to put out UTF8, which is what I need?

The ->utf8 method should do that.

--Gisle

Reply via email to