Hello (loved your PostgreSQL presentation at the most recent OSCON, BTW) Which editor do you use? When loading the script in Komodo IDE 5.2 the string looks broken. Running the script (ActivePerl 5.10.1 on Windows) only the second line is correct - the first (no surprise) and third are broken.
Loading the file in UltraEdit-32 13.20+3, set to not convert the script on loading, it becomes obvious that what should have been one character is represented by 4 bytes, \xC3 \x84 \xC2 \x8D, which modern editors would probably show as 2 characters and as broken. It looks to me like the string is being displayed as a byte representation of the characters, if that makes sense. My english isn't perfect :-/ and what I am trying to say is that this is problem that I am quite familiar with. It happens whenever the source and the reader do not agree on whether a string is encoded in utf-8 or not. Apparently Encode fixes the incorrect string which is nice. The interesting thing is, where should this be fixed? If it's at Yahoo! Pipes you'll probably have to use Encode as a work-around for some time... Best regards Henning Michael Møller Just -----Original Message----- From: David E. Wheeler [mailto:da...@kineticode.com] Sent: Wednesday, June 16, 2010 7:56 AM To: perl-unicode@perl.org Subject: Variation In Decoding Between Encode and XML::LibXML Fellow Perlers, I'm parsing a lot of XML these days, and came upon a a Yahoo! Pipes feed that appears to mangle an originating Flickr feed. But the curious thing is, when I pull the offending string out of the RSS and just stick it in a script, Encode knows how to decode it properly, while XML::LibXML (and my Unicode-aware editors) cannot. The attached script demonstrates. $str has the bogus-looking character". Encode, however, seems to properly convert it to the "č" in "Laurinavičius" in the output. XML::LibXML, OTOH, outputs it as "LaurinaviÄius" -- that is, broken. (If things look truly borked in this email too, please look at the attached script.) So my question is, what gives? Is this truly a broken representation of the character and Encode just figures that out and fixes it? Or is there something off with my editor and with XML::LibXML. FWIW, the character looks correct in my editor when I load it from the original Flickr feed. It's only after processing by Yahoo! Pipes that it comes out looking mangled. Any insights would be appreciated. Best, David