Paul Hoffman <[EMAIL PROTECTED]> writes:
> Using Unicode-String-2.06, I have the following test program:
>
> =====
>
> #!/usr/bin/perl -w
>
> use Unicode::String qw(utf8 utf16 uchr);
> Unicode::String->stringify_as('utf8');
>
> @TestArr = ("0061 0062", "0063 12345");
>
> foreach $TheString (@TestArr) {
> @AllHexIn = split(/\s+/, $TheString);
> $OutString = '';
> foreach $PartString (@AllHexIn)
> { $OutString .= utf8(uchr(hex("0x$PartString"))); }
>
> $TheLen = utf8($OutString)->length;
>
> $HexOfInput = '';
> foreach($i=0; $i<utf8($OutString)->length; $i++) {
> $HexOfInput .= utf8($OutString)->substr($i, 1)->hex . ' | ';
> }
> print "$TheString $TheLen $HexOfInput\n";
> }
>
> =====
>
> The output is:
>
> 0061 0062 2 U+0061 | U+0062 |
> 0063 12345 3 U+0063 | U+d808 | U+df45 |
>
> Why is uchr putting out UTF16 instead of UTF8 for the non-BMP character?
Unicode::String is simply UTF16 internally. The ->length and ->substr
methods all operate directly on the UTF16 representation without
looking for surrogates. This is actually the wrong thing to do. If
these where fixed to know about surrogates then I think this example
would work as you expected. The ->hex function should probably also
be made surrogate aware.
> Even if uchr is putting out UTF16, why isn't the utf8() call coercing
> the value from UTF16 to UTF8?
utf8() is actually converting from UTF8 to UTF16. uchr() is
converting a numeric value to UTF16.
> How do I get this to put out UTF8, which is what I need?
The ->utf8 method should do that.
--Gisle