John Siracusa wrote:
> I ran into this problem during mod_perl development, and I'm posting it to
> this list hoping that other mod_perl developers have dealt with the same
> thing and have good solutions :)

I did ;-)

> I've found that strings collected while processing XML using XML::Parser do
> not play nice with the HTML::Entities module.  Here's the sample program
> illustrating the problem:
> 
>     #!/usr/bin/perl -w
> 
>     use strict;
> 
>     use HTML::Entities;
>     use XML::Parser;
> 
>     my $buffer;
> 
>     my $p = XML::Parser->new(Handlers => { Char  => \&xml_char });
> 
>     my $xml = '<?xml version="1.0" encoding="iso-8859-1"?><test>' .
>               chr(0xE9) . '</test>';
> 
>     $p->parse($xml);
> 
>     print encode_entities($buffer), "\n";
> 
>     sub xml_char
>     {
>       my($expat, $string) = @_;
>   
>       $buffer .= $string;
>     }
> 
> The output unfortunately looks like this:
> 
>     &Atilde;&copy;
> 
> Which makes very little sense, since the correct entity for 0xE9 is:
> 
>     &eacute;

That's an XML::Parser issue.
XML::Parser gives UTF-8 to your Char handler, as specified in the manpage :
"Whatever the encoding of the string in the original document,
this is given to the handler in UTF-8."

The workaround I used is to write the handler like this :

sub xml_char
{
   my ($expat) = @_;
   $buffer .= $expat->original_string;
}

Reading the original string, no need to convert UTF-8 back to iso-8859-1.

> My current work-around is to run the buffer through a (lossy!?) pack/unpack
> cycle:
> 
>     my $buffer2 = pack("C*", unpack("U*", $buffer));
>     print encode_entities($buffer2), "\n";
> 
> This works and prints:
> 
>     &eacute;
> 
> I hope it is not lossy when using iso-8859-1 encoding, but I'm guessing it
> will maul UTF-8 or UTF-16.  This seems like quite an evil hack.
> 
> So, what is the Right Thing to do here?  Which module, if any, is at fault?
> Is there some combination of Perl Unicode-related "use" statements that will
> help me here?  Has anyone else run into this problem?
> 
> -John
> 



-- 
Rafael Garcia-Suarez

Reply via email to