HTML::Entities chokes on XML::Parser strings
I ran into this problem during mod_perl development, and I'm posting it to this list hoping that other mod_perl developers have dealt with the same thing and have good solutions :) I've found that strings collected while processing XML using XML::Parser do not play nice with the HTML::Entities module. Here's the sample program illustrating the problem: #!/usr/bin/perl -w use strict; use HTML::Entities; use XML::Parser; my $buffer; my $p = XML::Parser-new(Handlers = { Char = \xml_char }); my $xml = '?xml version=1.0 encoding=iso-8859-1?test' . chr(0xE9) . '/test'; $p-parse($xml); print encode_entities($buffer), \n; sub xml_char { my($expat, $string) = _; $buffer .= $string; } The output unfortunately looks like this: Atilde;copy; Which makes very little sense, since the correct entity for 0xE9 is: eacute; My current work-around is to run the buffer through a (lossy!?) pack/unpack cycle: my $buffer2 = pack(C*, unpack(U*, $buffer)); print encode_entities($buffer2), \n; This works and prints: eacute; I hope it is not lossy when using iso-8859-1 encoding, but I'm guessing it will maul UTF-8 or UTF-16. This seems like quite an evil hack. So, what is the Right Thing to do here? Which module, if any, is at fault? Is there some combination of Perl Unicode-related use statements that will help me here? Has anyone else run into this problem? -John
Re: HTML::Entities chokes on XML::Parser strings
The output from your example looks like UTF-8 data (Atilde; is a commonly seen UTF-8 escape sequence). XML::Parser converts all incoming text into UTF-8. You will need to convert it back to iso-8859-1. My favorite is Text::Iconv use Text::Iconv; $utf8tolatin1 = Text::Iconv-new(UTF-8, ISO8859-1); my $buffer_latin1 = $converter-convert($buffer); On Tue, May 07, 2002 at 10:51:10AM -0400, John Siracusa wrote: I ran into this problem during mod_perl development, and I'm posting it to this list hoping that other mod_perl developers have dealt with the same thing and have good solutions :) I've found that strings collected while processing XML using XML::Parser do not play nice with the HTML::Entities module. Here's the sample program illustrating the problem: #!/usr/bin/perl -w use strict; use HTML::Entities; use XML::Parser; my $buffer; my $p = XML::Parser-new(Handlers = { Char = \xml_char }); my $xml = '?xml version=1.0 encoding=iso-8859-1?test' . chr(0xE9) . '/test'; $p-parse($xml); print encode_entities($buffer), \n; sub xml_char { my($expat, $string) = @_; $buffer .= $string; } The output unfortunately looks like this: Atilde;copy; Which makes very little sense, since the correct entity for 0xE9 is: eacute; My current work-around is to run the buffer through a (lossy!?) pack/unpack cycle: my $buffer2 = pack(C*, unpack(U*, $buffer)); print encode_entities($buffer2), \n; This works and prints: eacute; I hope it is not lossy when using iso-8859-1 encoding, but I'm guessing it will maul UTF-8 or UTF-16. This seems like quite an evil hack. So, what is the Right Thing to do here? Which module, if any, is at fault? Is there some combination of Perl Unicode-related use statements that will help me here? Has anyone else run into this problem? -John -- Paul Lindner[EMAIL PROTECTED] | | | | | | | | | | mod_perl Developer's Cookbook http://www.modperlcookbook.org/ Human Rights Declaration http://www.unhchr.ch/udhr/
Re: HTML::Entities chokes on XML::Parser strings
John Siracusa wrote: I ran into this problem during mod_perl development, and I'm posting it to this list hoping that other mod_perl developers have dealt with the same thing and have good solutions :) I did ;-) I've found that strings collected while processing XML using XML::Parser do not play nice with the HTML::Entities module. Here's the sample program illustrating the problem: #!/usr/bin/perl -w use strict; use HTML::Entities; use XML::Parser; my $buffer; my $p = XML::Parser-new(Handlers = { Char = \xml_char }); my $xml = '?xml version=1.0 encoding=iso-8859-1?test' . chr(0xE9) . '/test'; $p-parse($xml); print encode_entities($buffer), \n; sub xml_char { my($expat, $string) = _; $buffer .= $string; } The output unfortunately looks like this: Atilde;copy; Which makes very little sense, since the correct entity for 0xE9 is: eacute; That's an XML::Parser issue. XML::Parser gives UTF-8 to your Char handler, as specified in the manpage : Whatever the encoding of the string in the original document, this is given to the handler in UTF-8. The workaround I used is to write the handler like this : sub xml_char { my ($expat) = _; $buffer .= $expat-original_string; } Reading the original string, no need to convert UTF-8 back to iso-8859-1. My current work-around is to run the buffer through a (lossy!?) pack/unpack cycle: my $buffer2 = pack(C*, unpack(U*, $buffer)); print encode_entities($buffer2), \n; This works and prints: eacute; I hope it is not lossy when using iso-8859-1 encoding, but I'm guessing it will maul UTF-8 or UTF-16. This seems like quite an evil hack. So, what is the Right Thing to do here? Which module, if any, is at fault? Is there some combination of Perl Unicode-related use statements that will help me here? Has anyone else run into this problem? -John -- Rafael Garcia-Suarez
Re: HTML::Entities chokes on XML::Parser strings
On 5/7/02 10:58 AM, Paul Lindner wrote: The output from your example looks like UTF-8 data (Atilde; is a commonly seen UTF-8 escape sequence). XML::Parser converts all incoming text into UTF-8. You will need to convert it back to iso-8859-1. My favorite is Text::Iconv use Text::Iconv; $utf8tolatin1 = Text::Iconv-new(UTF-8, ISO8859-1); my $buffer_latin1 = $converter-convert($buffer); So HTML::Entities only works with ISO8859-1 (or ASCII, presumably)? What if I have actual UTF-8 data? Won't conversion to ISO8859-1 in service of HTML::Entities result in data loss? -John
Re: HTML::Entities chokes on XML::Parser strings
On 5/7/02 11:06 AM, Rafael Garcia-Suarez wrote: The workaround I used is to write the handler like this : sub xml_char { my ($expat) = _; $buffer .= $expat-original_string; } Reading the original string, no need to convert UTF-8 back to iso-8859-1. Doh! I dunno why I didn't think of that, since I've used that expat method plenty of times before. This seems safer than forcing a conversion from UTF-8 to something else (although the other technique is nice to know too :) -John
Re: HTML::Entities chokes on XML::Parser strings
John Siracusa [EMAIL PROTECTED] writes: On 5/7/02 10:58 AM, Paul Lindner wrote: The output from your example looks like UTF-8 data (Atilde; is a commonly seen UTF-8 escape sequence). XML::Parser converts all incoming text into UTF-8. You will need to convert it back to iso-8859-1. My favorite is Text::Iconv use Text::Iconv; $utf8tolatin1 = Text::Iconv-new(UTF-8, ISO8859-1); my $buffer_latin1 = $converter-convert($buffer); So HTML::Entities only works with ISO8859-1 (or ASCII, presumably)? Not true. But the unicode support in perl-5.6.x has many bugs. With 5.8 things will be better. It is a bad idea for XML::Parser to give out strings with the UTF8 flag set. Regards, Gisle
Re: HTML::Entities chokes on XML::Parser strings
On 5/7/02 11:25 AM, Gisle Aas wrote: John Siracusa [EMAIL PROTECTED] writes: On 5/7/02 10:58 AM, Paul Lindner wrote: The output from your example looks like UTF-8 data (Atilde; is a commonly seen UTF-8 escape sequence). XML::Parser converts all incoming text into UTF-8. You will need to convert it back to iso-8859-1. My favorite is Text::Iconv use Text::Iconv; $utf8tolatin1 = Text::Iconv-new(UTF-8, ISO8859-1); my $buffer_latin1 = $converter-convert($buffer); So HTML::Entities only works with ISO8859-1 (or ASCII, presumably)? Not true. But the unicode support in perl-5.6.x has many bugs. With 5.8 things will be better. It is a bad idea for XML::Parser to give out strings with the UTF8 flag set. Well, I'll let your guys figure it out (all fixed in 5.8, right? :) In the meantime, I guess I'll stick with the workaround(s) posted... :) -John
Re: HTML::Entities chokes on XML::Parser strings
On Tue, May 07, 2002 at 11:13:43AM -0400, John Siracusa wrote: On 5/7/02 10:58 AM, Paul Lindner wrote: The output from your example looks like UTF-8 data (Atilde; is a commonly seen UTF-8 escape sequence). XML::Parser converts all incoming text into UTF-8. You will need to convert it back to iso-8859-1. My favorite is Text::Iconv use Text::Iconv; $utf8tolatin1 = Text::Iconv-new(UTF-8, ISO8859-1); my $buffer_latin1 = $converter-convert($buffer); So HTML::Entities only works with ISO8859-1 (or ASCII, presumably)? What if I have actual UTF-8 data? Won't conversion to ISO8859-1 in service of HTML::Entities result in data loss? Yes, HTML::Entities is based on ISO8859-1 input only. BTW, for better performance in mod_perl consider using Apache::Util::escape_html() escape_html This routine replaces unsafe characters in $string with their entity representation. my $esc = Apache::Util::escape_html($html); Anyway, back to character entities.. Text::Iconv will fail if you try to convert unconvertable text, so at least you can test for that condition (and adjust accordingly) BasisTech sells a comprehensive unicode library called Rosette that knows how to automatically convert to a target character set while incorporating SGML entities for any character set. Perhaps it's time for an open implementation of that.. Also see http://rf.net/~james/perli18n.html for a perl i18n faq. -- Paul Lindner[EMAIL PROTECTED] | | | | | | | | | | mod_perl Developer's Cookbook http://www.modperlcookbook.org/ Human Rights Declaration http://www.unhchr.ch/udhr/