> So it may be valid UTF-8, but why does it come out looking like crap? That > is, "LaurinaviÃÂius"? I suppose there's an > argument that "LaurinaviÄius" > is correct and valid, if ugly. Maybe?
I am unsure if this is the explanation you are looking for but here goes: I think the original data contained the character \x{010d}. In utf-8, that means that it should be represented as the bytes \x{c4} and \x{8d}. If those bytes are not marked as in fact being a two-byte utf-8 encoding of a single character, or if an application reading the data mistakenly thinks it is not encoded (both common errors), somewhere along the transmission an application may decide that it needs to re-encode the characters in utf-8. So the original character \x{010d} is represented by the bytes \x{c4} and \x{8d}, an application thinks those are in fact characters and encodes them again as \x{c3} + \x{84} and \x{c2} + \x{8d}, respectively. Which I believe is your broken data. I think the error comes from Perl's handling of utf-8 data and that this handling has changed in subtle ways all the way since Perl 5.6. We have supported utf-8 in our applications since Perl 5.6 and have experienced this repeatedly. Any major upgrade of Perl or indeed the much needed upgrade of DBD::ODBC Martin Evans provided have given us a lot of work trying to sort out these troubles. I wonder if your code would work fine in Perl 5.8? We are "only" at 5.10(.1) but the upgrade from 5.8 to 5.10 also gave us some utf-8 trouble. If it works fine in Perl 5.8 maybe the error is in an assumption somewhere in XML::LibXML? Best regards Henning Michael Møller Just