On Tue, May 07, 2002 at 11:13:43AM -0400, John Siracusa wrote:
> On 5/7/02 10:58 AM, Paul Lindner wrote:
> > The output from your example looks like UTF-8 data (Ã is a
> > commonly seen UTF-8 escape sequence). XML::Parser converts all
> > incoming text into UTF-8. You will need to convert it back to
> > iso-8859-1.
> >
> > My favorite is Text::Iconv
> >
> > use Text::Iconv;
> > $utf8tolatin1 = Text::Iconv->new("UTF-8", "ISO8859-1");
> >
> > my $buffer_latin1 = $converter->convert($buffer);
>
> So HTML::Entities only works with ISO8859-1 (or ASCII, presumably)? What if
> I have actual UTF-8 data? Won't conversion to ISO8859-1 in service of
> HTML::Entities result in data loss?
Yes, HTML::Entities is based on ISO8859-1 input only. BTW, for better
performance in mod_perl consider using Apache::Util::escape_html()
escape_html
This routine replaces unsafe characters in $string
with their entity representation.
my $esc = Apache::Util::escape_html($html);
Anyway, back to character entities..
Text::Iconv will fail if you try to convert unconvertable text, so at
least you can test for that condition (and adjust accordingly)
BasisTech sells a comprehensive unicode library called Rosette that
knows how to automatically convert to a target character set while
incorporating SGML entities for any character set. Perhaps it's time
for an open implementation of that..
Also see http://rf.net/~james/perli18n.html for a perl i18n faq.
--
Paul Lindner [EMAIL PROTECTED] ||||| | | | | | | | | |
mod_perl Developer's Cookbook http://www.modperlcookbook.org/
Human Rights Declaration http://www.unhchr.ch/udhr/