Re: HTML::Entities and WinLatin1 NCRs [PATCH]

Chris Darroch Tue, 07 Mar 2006 07:31:42 -0800

Gisle Aas wrote:

> To me it feels wrong to add such a kludge to HTML::Entities.  It just
> seems to be the wrong level to do such manipulations.  I would suggest
> that you just post-process the string that decode_entities() returns
> to fixup the Windows mess using tr///; example:
> 
>     sub cp1252_fixup {
>         # replaces the additional WinLatin-1 chars in the 0x80 - 0x9F range
>         # with the corresponding Unicode character
>         my $str = shift;
>         $str =~ 
> tr/\x80-\x9f/\x{20AC}\x{FFFD}\x{201A}\x{192}\x{201E}\x{2026}\x{2020}\x{2021}\x{2C6}\x{2030}\x{160}\x{2039}\x{152}\x{FFFD}\x{17D}\x{FFFD}\x{FFFD}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{2DC}\x{2122}\x{161}\x{203A}\x{153}\x{FFFD}\x{17E}\x{178}/;
>         $str;
>     }


   Thanks for the response!  I agree that my quick hack isn't entirely
the right way to go, and I thought about a tr/// solution as well,
which may be the most appropriate.  I'll likely adopt whatever is
proposed as the best solution.

   In the meantime, I went with a native HTML::Entities "fix" myself
primarily because I concluded that all modern browsers seem to do
this particular NCR conversion on the fly, so it seemed somewhat
logical for HTML::Entities to mirror that "canonical" behaviour.
It also happens to save a pass over the data, which is a minor
consideration, really, unless one happens to be dealing with
enormously long strings.

   What I began to imagine as a more elegant solution was an
option for HTML::Entities that would affect all users
of the internal decode_entities() routine, including HTML::Parser
and so forth.  Something perhaps like:

$p = HTML::Parser->new(fixup_cp1252 => 1);
$p->parse("<html><body>&#151; &#8212;</body></html>");

decode_entities({fixup_cp1252 => 1}, "&#151; &#8212;");

or maybe something involving a stateful option set on the
HTML::Entities class as a whole.  Anyway, do as you see best;
thanks again for the great work!

Chris.

-- 
GPG Key ID: 366A375B
GPG Key Fingerprint: 485E 5041 17E1 E2BB C263  E4DE C8E3 FA36 366A 375B

Re: HTML::Entities and WinLatin1 NCRs [PATCH]

Reply via email to