Re: HTML::Entities and WinLatin1 NCRs [PATCH]

Gisle Aas Tue, 07 Mar 2006 05:05:30 -0800

Gisle Aas <[EMAIL PROTECTED]> writes:

> Chris Darroch <[EMAIL PROTECTED]> writes:
> 
> >    I use the HTML::Entities module quite a bit and have really
> > appreciated its support for Unicode characters > 256 with Perl 5.8.
> > 
> >    I do have one particular issue that crops up for me, and I thought
> > it might affects others as well, so I'm including a crude set of
> > patches with my "fix".  In short, I have to support HTML documents
> > authored by a wide variety of people, and over time they've
> > accumulated numeric character references to the troublesome set
> > of characters between 128 and 159, mostly due to authors working
> > on Windows platforms.  The same documents now may also have
> > character references to the Unicode code points for those characters.
> > 
> >    Here's a simple example: "two &#151; em &#8212; dashes".
> > 
> >    Now, in my particular situation, I sometimes want to decode
> > these entities to the same code point, so that, for example, I can
> > match strings against each other.  At first I thought I might
> > get away with this:
> > 
> > $a = Encode::encode('utf8', $a);  # force no utf8 flag
> > HTML::Entities::decode_entities($a);
> > $a = Encode::decode('cp1252', $a) unless (Encode::is_utf8($a));
> > 
> >    But while that will turn "&#151;" into U+2014, it turns
> > "&#151;&#8212;" into U+0097 U+2014, which doesn't help.
> > 
> >    So, I whacked into place a decode_entities_cp1252() function
> > that decodes any numeric characters references in the 128-159
> > range (except for a couple of undefined ones) to the UTF-8
> > equivalents.  I'm positive there are nicer, more elegant, and
> > probably more flexible ways to do this, but lacking additional
> > time to experiment, this is where I stopped.
> 
> To me it feels wrong to add such a kludge to HTML::Entities.  It just
> seems to be the wrong level to do such manipulations.  I would suggest
> that you just post-process the string that decode_entities() returns
> to fixup the Windows mess using tr///; example:
> 
>     sub cp1252_fixup {
>         # replaces the additional WinLatin-1 chars in the 0x80 - 0x9F range
>         # with the corresponding Unicode character
>         my $str = shift;
>         $str =~ 
> tr/\x80-\x9f/\x{20AC}\x{FFFD}\x{201A}\x{192}\x{201E}\x{2026}\x{2020}\x{2021}\x{2C6}\x{2030}\x{160}\x{2039}\x{152}\x{FFFD}\x{17D}\x{FFFD}\x{FFFD}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{2DC}\x{2122}\x{161}\x{203A}\x{153}\x{FFFD}\x{17E}\x{178}/;
>         $str;
>     }
>     
>     
>     my $str = "Here's a simple example: two &#151; em &#8212; dashes";
>     
>     use HTML::Entities;
>     $str = cp1252_fixup(HTML::Entities::decode($str));
>     
>     use Data::Dump;
>     print Data::Dump::dump($str), "\n";
> 
> Dan: Would it make sense to make Encode provide something like
> cp1252_fixup or is there already a way to do this with Encode?


Looking at CPAN I found http://search.cpan.org/dist/Encode-ZapCP1252/
which looks like a good home for this.  The current version of
Encode-ZapCP1252 only translate to ASCII approximations.  David, would
you consider extending this module with the function above for
supporting Chris's use case?

Regards,
Gisle

Re: HTML::Entities and WinLatin1 NCRs [PATCH]

Reply via email to