Re: HTML::Entities chokes on XML::Parser strings

Paul Lindner Tue, 07 May 2002 08:18:46 -0700

On Tue, May 07, 2002 at 11:13:43AM -0400, John Siracusa wrote:
> On 5/7/02 10:58 AM, Paul Lindner wrote:
> > The output from your example looks like UTF-8 data (&Atilde; is a
> > commonly seen UTF-8 escape sequence).  XML::Parser converts all
> > incoming text into UTF-8.  You will need to convert it back to
> > iso-8859-1.
> > 
> > My favorite is Text::Iconv
> > 
> >        use Text::Iconv;
> >        $utf8tolatin1 = Text::Iconv->new("UTF-8", "ISO8859-1");
> > 
> >        my $buffer_latin1 = $converter->convert($buffer);
> 
> So HTML::Entities only works with ISO8859-1 (or ASCII, presumably)?  What if
> I have actual UTF-8 data?  Won't conversion to ISO8859-1 in service of
> HTML::Entities result in data loss?


Yes, HTML::Entities is based on ISO8859-1 input only.  BTW, for better
performance in mod_perl consider using Apache::Util::escape_html()


 escape_html
           This routine replaces unsafe characters in $string
           with their entity representation.

            my $esc = Apache::Util::escape_html($html);


Anyway, back to character entities..

Text::Iconv will fail if you try to convert unconvertable text, so at
least you can test for that condition (and adjust accordingly)

BasisTech sells a comprehensive unicode library called Rosette that
knows how to automatically convert to a target character set while
incorporating SGML entities for any character set.  Perhaps it's time
for an open implementation of that..

Also see http://rf.net/~james/perli18n.html for a perl i18n faq.




-- 
Paul Lindner    [EMAIL PROTECTED]   ||||| | | | |  |  |  |   |   |

    mod_perl Developer's Cookbook   http://www.modperlcookbook.org/
         Human Rights Declaration   http://www.unhchr.ch/udhr/

Re: HTML::Entities chokes on XML::Parser strings

Reply via email to