I ran into this problem during mod_perl development, and I'm posting it to
this list hoping that other mod_perl developers have dealt with the same
thing and have good solutions :)
I've found that strings collected while processing XML using XML::Parser do
not play nice with the HTML::Entities module. Here's the sample program
illustrating the problem:
#!/usr/bin/perl -w
use strict;
use HTML::Entities;
use XML::Parser;
my $buffer;
my $p = XML::Parser->new(Handlers => { Char => \&xml_char });
my $xml = '<?xml version="1.0" encoding="iso-8859-1"?><test>' .
chr(0xE9) . '</test>';
$p->parse($xml);
print encode_entities($buffer), "\n";
sub xml_char
{
my($expat, $string) = @_;
$buffer .= $string;
}
The output unfortunately looks like this:
é
Which makes very little sense, since the correct entity for 0xE9 is:
é
My current work-around is to run the buffer through a (lossy!?) pack/unpack
cycle:
my $buffer2 = pack("C*", unpack("U*", $buffer));
print encode_entities($buffer2), "\n";
This works and prints:
é
I hope it is not lossy when using iso-8859-1 encoding, but I'm guessing it
will maul UTF-8 or UTF-16. This seems like quite an evil hack.
So, what is the Right Thing to do here? Which module, if any, is at fault?
Is there some combination of Perl Unicode-related "use" statements that will
help me here? Has anyone else run into this problem?
-John