On Jun 17, 2010, at 12:30 PM, Henning Michael Møller Just wrote:
>> So it may be valid UTF-8, but why does it come out looking like crap? That
>> is, "LaurinaviÃ≥Ÿius"? I suppose there's an > argument that
>> "LaurinaviÄŸius" is correct and valid, if ugly. Maybe?
>
> I am unsure if this is the explanation you are looking for but here goes:
>
> I think the original data contained the character \x{010d}. In utf-8, that
> means that it should be represented as the bytes \x{c4} and \x{8d}. If those
> bytes are not marked as in fact being a two-byte utf-8 encoding of a single
> character, or if an application reading the data mistakenly thinks it is not
> encoded (both common errors), somewhere along the transmission an application
> may decide that it needs to re-encode the characters in utf-8.
>
> So the original character \x{010d} is represented by the bytes \x{c4} and
> \x{8d}, an application thinks those are in fact characters and encodes them
> again as \x{c3} + \x{84} and \x{c2} + \x{8d}, respectively. Which I believe
> is your broken data.
I see. That makes sense. FYI, the original source is at:
http://pipes.yahoo.com/pipes/pipe.run?Size=Medium&_id=f53b7bed8b88412fab9715a995629722&_render=rss&max=50&nsid=1025993%40N22
Look for "Tomas" in the output. If it doesn't show pu, change max=50 to max=75
or something.
> I think the error comes from Perl's handling of utf-8 data and that this
> handling has changed in subtle ways all the way since Perl 5.6. We have
> supported utf-8 in our applications since Perl 5.6 and have experienced this
> repeatedly. Any major upgrade of Perl or indeed the much needed upgrade of
> DBD::ODBC Martin Evans provided have given us a lot of work trying to sort
> out these troubles.
Maintaining the backwards compatibility from the pre-utf8 days must make it far
more difficult than it otherwise would be.
> I wonder if your code would work fine in Perl 5.8? We are "only" at 5.10(.1)
> but the upgrade from 5.8 to 5.10 also gave us some utf-8 trouble. If it works
> fine in Perl 5.8 maybe the error is in an assumption somewhere in XML::LibXML?
In my application, I finally got XML::LibXML to choke on the invalid
characters, and then found that the problem was that I was running
Encode::CP1252::zap_cp1252 against the string before passing it to XML::LibXML.
Once I removed that, it stopped choking. So clearly zap_cp1252 was changing
bytes it should not have. I now have it running fix_cp1252 *after* the parsing,
when everything is already UTF-8. Now that I think about it, though, I should
probably change it so that it searches on characters instead of bytes when
working on a utf8 string. Will have to look into that.
In the meantime, I'll just accept that sometimes the characters are valid UTF-8
and look like shit. Frankly, when I run the above feed through NetNewsWire, the
offending byte sequence displays as "Ä", just as it does in my app's output. So
I blame Yahoo.
Thanks for the detailed explanation, Henning, much appreciated.
Best,
David