Fellow Perlers,
I'm parsing a lot of XML these days, and came upon a a Yahoo! Pipes feed that
appears to mangle an originating Flickr feed. But the curious thing is, when I
pull the offending string out of the RSS and just stick it in a script, Encode
knows how to decode it properly, while XML::LibXML (and my Unicode-aware
editors) cannot.
The attached script demonstrates. $str has the bogus-looking character".
Encode, however, seems to properly convert it to the "č" in "Laurinavičius" in
the output. XML::LibXML, OTOH, outputs it as "LaurinaviÄius" -- that is,
broken. (If things look truly borked in this email too, please look at the
attached script.)
So my question is, what gives? Is this truly a broken representation of the
character and Encode just figures that out and fixes it? Or is there something
off with my editor and with XML::LibXML.
FWIW, the character looks correct in my editor when I load it from the original
Flickr feed. It's only after processing by Yahoo! Pipes that it comes out
looking mangled.
Any insights would be appreciated.
Best,
David
#!/usr/local/bin/perl -w
use strict;
use Encode;
use XML::LibXML;
my $parser = XML::LibXML->new({
no_network => 1,
encoding => 'utf-8',
});
my $str = '<p>Tomas LaurinaviÃÂius</p>';
print $str, $/;
my $copy = $str;
my $utf8 = decode('utf-8', $copy, 1);
print $utf8, $/;
my $doc = $parser->parse_html_string($str, encoding => 'utf-8');
print $doc->documentElement->toString, $/;