I ran into this problem during mod_perl development, and I'm posting it to
this list hoping that other mod_perl developers have dealt with the same
thing and have good solutions :)

I've found that strings collected while processing XML using XML::Parser do
not play nice with the HTML::Entities module.  Here's the sample program
illustrating the problem:

    #!/usr/bin/perl -w

    use strict;

    use HTML::Entities;
    use XML::Parser;

    my $buffer;

    my $p = XML::Parser->new(Handlers => { Char  => \&xml_char });

    my $xml = '<?xml version="1.0" encoding="iso-8859-1"?><test>' .
              chr(0xE9) . '</test>';

    $p->parse($xml);

    print encode_entities($buffer), "\n";

    sub xml_char
    {
      my($expat, $string) = @_;
  
      $buffer .= $string;
    }

The output unfortunately looks like this:

    &Atilde;&copy;

Which makes very little sense, since the correct entity for 0xE9 is:

    &eacute;

My current work-around is to run the buffer through a (lossy!?) pack/unpack
cycle:

    my $buffer2 = pack("C*", unpack("U*", $buffer));
    print encode_entities($buffer2), "\n";

This works and prints:

    &eacute;

I hope it is not lossy when using iso-8859-1 encoding, but I'm guessing it
will maul UTF-8 or UTF-16.  This seems like quite an evil hack.

So, what is the Right Thing to do here?  Which module, if any, is at fault?
Is there some combination of Perl Unicode-related "use" statements that will
help me here?  Has anyone else run into this problem?

-John

Reply via email to