I ran into this problem during mod_perl development, and I'm posting it to this list hoping that other mod_perl developers have dealt with the same thing and have good solutions :)
I've found that strings collected while processing XML using XML::Parser do not play nice with the HTML::Entities module. Here's the sample program illustrating the problem: #!/usr/bin/perl -w use strict; use HTML::Entities; use XML::Parser; my $buffer; my $p = XML::Parser->new(Handlers => { Char => \&xml_char }); my $xml = '<?xml version="1.0" encoding="iso-8859-1"?><test>' . chr(0xE9) . '</test>'; $p->parse($xml); print encode_entities($buffer), "\n"; sub xml_char { my($expat, $string) = @_; $buffer .= $string; } The output unfortunately looks like this: é Which makes very little sense, since the correct entity for 0xE9 is: é My current work-around is to run the buffer through a (lossy!?) pack/unpack cycle: my $buffer2 = pack("C*", unpack("U*", $buffer)); print encode_entities($buffer2), "\n"; This works and prints: é I hope it is not lossy when using iso-8859-1 encoding, but I'm guessing it will maul UTF-8 or UTF-16. This seems like quite an evil hack. So, what is the Right Thing to do here? Which module, if any, is at fault? Is there some combination of Perl Unicode-related "use" statements that will help me here? Has anyone else run into this problem? -John