Variation In Decoding Between Encode and XML::LibXML

David E. Wheeler Tue, 15 Jun 2010 22:55:53 -0700

Fellow Perlers,

I'm parsing a lot of XML these days, and came upon a a Yahoo! Pipes feed that 
appears to mangle an originating Flickr feed. But the curious thing is, when I 
pull the offending string out of the RSS and just stick it in a script, Encode 
knows how to decode it properly, while XML::LibXML (and my Unicode-aware 
editors) cannot.


The attached script demonstrates. $str has the bogus-looking character". 
Encode, however, seems to properly convert it to the "č" in "Laurinavičius" in 
the output. XML::LibXML, OTOH, outputs it as "LaurinaviÄius" -- that is, 
broken. (If things look truly borked in this email too, please look at the 
attached script.)

So my question is, what gives? Is this truly a broken representation of the 
character and Encode just figures that out and fixes it? Or is there something 
off with my editor and with XML::LibXML.

FWIW, the character looks correct in my editor when I load it from the original 
Flickr feed. It's only after processing by Yahoo! Pipes that it comes out 
looking mangled.

Any insights would be appreciated.

Best,

David

#!/usr/local/bin/perl -w

use strict;
use Encode;
use XML::LibXML;

my $parser = XML::LibXML->new({
    no_network => 1,
    encoding   => 'utf-8',
});

my $str = '<p>Tomas LaurinaviÃÂius</p>';
print $str, $/;

my $copy = $str;
my $utf8 = decode('utf-8', $copy, 1);
print $utf8, $/;

my $doc = $parser->parse_html_string($str, encoding => 'utf-8');
print $doc->documentElement->toString, $/;

Variation In Decoding Between Encode and XML::LibXML

Reply via email to