Re: Variation In Decoding Between Encode and XML::LibXML

David E. Wheeler Thu, 17 Jun 2010 10:18:02 -0700

On Jun 16, 2010, at 6:03 PM, Marvin Humphrey wrote:

> On Wed, Jun 16, 2010 at 05:34:44PM -0700, David E. Wheeler wrote:
> 
>> So the UTF8 flag is enabled, and yet it has "\303\204\302\215" in it. What 
>> is that crap?
> 
> That's octal notation, which I think Dump() uses for any byte greater than 127
> and for control characters, so that it can output pure ASCII.


Okay.

> That sequence is only four bytes: 
> 
>  mar...@smokey:~ $ perl -MEncode -MDevel::Peek -e '$s = "\303\204\302\215"; 
> Encode::_utf8_on($s); Dump $s'
>  SV = PV(0x801038) at 0x80e880
>    REFCNT = 1
>    FLAGS = (POK,pPOK,UTF8)
>    PV = 0x2012f0 "\303\204\302\215"\0 [UTF8 "\x{c4}\x{8d}"]
>    CUR = 4   <----------------------------------------------- four bytes
>    LEN = 8
>  mar...@smokey:~ $ 
> 
> The logical content of the string follows in the second quote:
> 
>> [UTF8 "<p>Tomas Laurinavi\x{c4}\x{8d}ius</p>"]
> 
> That's valid UTF-8.

In what sense? Legally perhaps, but I can make XML::LibXML choke on it.

>> my $str = '<p>Tomas Laurinavi????ius</p>';
> 
> In source code, I try to stick to pure ASCII and use \x escapes -- like Dump()
> does.
> 
>  my $str = "<p>Tomas Laurinavi\x{c4}\x{8d}ius</p>"

Okay, that makes it easier to test things (I've been pulling stuff out of the 
broken feed I downloaded).

> However, because those code points are both representable as Latin-1, Perl
> will create a Latin-1 string.  If you want to force its internal encoding to
> UTF-8, you need to do additional work.
> 
>  mar...@smokey:~ $ perl -MDevel::Peek -e '$s = "\x{c4}"; Dump $s; 
> utf8::upgrade($s); Dump $s'
>  SV = PV(0x801038) at 0x80e870
>    REFCNT = 1
>    FLAGS = (POK,pPOK)
>    PV = 0x2012e0 "\304"\0
>    CUR = 1
>    LEN = 4
>  SV = PV(0x801038) at 0x80e870
>    REFCNT = 1
>    FLAGS = (POK,pPOK,UTF8)
>    PV = 0x2008f0 "\303\204"\0 [UTF8 "\x{c4}"]
>    CUR = 2
>    LEN = 3
>  mar...@smokey:~ $ 
> 
>> Confused and frustrated,
> 
> IMO, to get UTF-8 right consistently in a large Perl system, you need to
> understand the internals and you need Devel::Peek at hand.  Perl tries to hide
> the details, but there are too many ways for it to fail silently.  ("perl -C",
> $YAML::Syck::ImplicitUnicode, etc.)

Bleh. Such a PITA. I'd like not to have to think about this stuff, but I must 
because other people haven't.

So here's my test:

    use 5.12.0;
    use Devel::Peek;

    my $str = "<p>Laurinavi\x{c3}\x{84}\x{c2}\x{8d}ius</p>";
    say $str;
    utf8::upgrade($str);
    binmode STDOUT, ':utf8';
    say $str;
    Dump $str;

The output it still broken, however, in both cases, looking like this:

    LaurinaviÄius
    LaurinaviÃÂius
    SV = PV(0x100801c78) at 0x10082ac40
      REFCNT = 1
      FLAGS = (PADMY,POK,pPOK,UTF8)
      PV = 0x100202170 "Laurinavi\303\203\302\204\303\202\302\215ius"\0 [UTF8 
"Laurinavi\x{c3}\x{84}\x{c2}\x{8d}ius"]
      CUR = 20
      LEN = 32

So it may be valid UTF-8, but why does it come out looking like crap? That is, 
"LaurinaviÃÂius"? I suppose there's an argument that "LaurinaviÄius" is 
correct and valid, if ugly. Maybe?

Thanks,

David

Re: Variation In Decoding Between Encode and XML::LibXML

Reply via email to