Gisle Aas <[EMAIL PROTECTED]> writes:
> Chris Darroch <[EMAIL PROTECTED]> writes:
>
> > I use the HTML::Entities module quite a bit and have really
> > appreciated its support for Unicode characters > 256 with Perl 5.8.
> >
> > I do have one particular issue that crops up for me, and I thought
> > it might affects others as well, so I'm including a crude set of
> > patches with my "fix". In short, I have to support HTML documents
> > authored by a wide variety of people, and over time they've
> > accumulated numeric character references to the troublesome set
> > of characters between 128 and 159, mostly due to authors working
> > on Windows platforms. The same documents now may also have
> > character references to the Unicode code points for those characters.
> >
> > Here's a simple example: "two — em — dashes".
> >
> > Now, in my particular situation, I sometimes want to decode
> > these entities to the same code point, so that, for example, I can
> > match strings against each other. At first I thought I might
> > get away with this:
> >
> > $a = Encode::encode('utf8', $a); # force no utf8 flag
> > HTML::Entities::decode_entities($a);
> > $a = Encode::decode('cp1252', $a) unless (Encode::is_utf8($a));
> >
> > But while that will turn "—" into U+2014, it turns
> > "——" into U+0097 U+2014, which doesn't help.
> >
> > So, I whacked into place a decode_entities_cp1252() function
> > that decodes any numeric characters references in the 128-159
> > range (except for a couple of undefined ones) to the UTF-8
> > equivalents. I'm positive there are nicer, more elegant, and
> > probably more flexible ways to do this, but lacking additional
> > time to experiment, this is where I stopped.
>
> To me it feels wrong to add such a kludge to HTML::Entities. It just
> seems to be the wrong level to do such manipulations. I would suggest
> that you just post-process the string that decode_entities() returns
> to fixup the Windows mess using tr///; example:
>
> sub cp1252_fixup {
> # replaces the additional WinLatin-1 chars in the 0x80 - 0x9F range
> # with the corresponding Unicode character
> my $str = shift;
> $str =~
> tr/\x80-\x9f/\x{20AC}\x{FFFD}\x{201A}\x{192}\x{201E}\x{2026}\x{2020}\x{2021}\x{2C6}\x{2030}\x{160}\x{2039}\x{152}\x{FFFD}\x{17D}\x{FFFD}\x{FFFD}\x{2018}\x{2019}\x{201C}\x{201D}\x{2022}\x{2013}\x{2014}\x{2DC}\x{2122}\x{161}\x{203A}\x{153}\x{FFFD}\x{17E}\x{178}/;
> $str;
> }
>
>
> my $str = "Here's a simple example: two — em — dashes";
>
> use HTML::Entities;
> $str = cp1252_fixup(HTML::Entities::decode($str));
>
> use Data::Dump;
> print Data::Dump::dump($str), "\n";
>
> Dan: Would it make sense to make Encode provide something like
> cp1252_fixup or is there already a way to do this with Encode?
Looking at CPAN I found http://search.cpan.org/dist/Encode-ZapCP1252/
which looks like a good home for this. The current version of
Encode-ZapCP1252 only translate to ASCII approximations. David, would
you consider extending this module with the function above for
supporting Chris's use case?
Regards,
Gisle