Gisle Aas wrote: > To me it feels wrong to add such a kludge to HTML::Entities. It just > seems to be the wrong level to do such manipulations. I would suggest > that you just post-process the string that decode_entities() returns > to fixup the Windows mess using tr///; example: > > sub cp1252_fixup { > # replaces the additional WinLatin-1 chars in the 0x80 - 0x9F range > # with the corresponding Unicode character > my $str = shift; > $str =~ > tr/\x80-\x9f/\x{20AC}\x{FFFD}\x{201A}\x{192}\x{201E}\x{2026}\x{2020}\x{2021}\x{2C6}\x{2030}\x{160}\x{2039}\x{152}\x{FFFD}\x{17D}\x{FFFD}\x{FFFD}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{2DC}\x{2122}\x{161}\x{203A}\x{153}\x{FFFD}\x{17E}\x{178}/; > $str; > }
Thanks for the response! I agree that my quick hack isn't entirely the right way to go, and I thought about a tr/// solution as well, which may be the most appropriate. I'll likely adopt whatever is proposed as the best solution. In the meantime, I went with a native HTML::Entities "fix" myself primarily because I concluded that all modern browsers seem to do this particular NCR conversion on the fly, so it seemed somewhat logical for HTML::Entities to mirror that "canonical" behaviour. It also happens to save a pass over the data, which is a minor consideration, really, unless one happens to be dealing with enormously long strings. What I began to imagine as a more elegant solution was an option for HTML::Entities that would affect all users of the internal decode_entities() routine, including HTML::Parser and so forth. Something perhaps like: $p = HTML::Parser->new(fixup_cp1252 => 1); $p->parse("<html><body>— —</body></html>"); decode_entities({fixup_cp1252 => 1}, "— —"); or maybe something involving a stateful option set on the HTML::Entities class as a whole. Anyway, do as you see best; thanks again for the great work! Chris. -- GPG Key ID: 366A375B GPG Key Fingerprint: 485E 5041 17E1 E2BB C263 E4DE C8E3 FA36 366A 375B