Now for a much less pressing issue: Does anybody know of something
similar to the HTML::FormatText module that can take utf-8 input, and
generate utf-8 output?

Doubt it. But if you run it on Unicode chars (as indicated above) then unless it is doing something too clever it should just work.

Could it be that the problem is with HTML::TreeBuilder (which is required for pre-processing by HTML::FormatText)? Does anybody know if this module has issues with Unicode?

These are the crucial parts of my script:

************************************************************************ ***************

use HTML::TreeBuilder;
use HTML::FormatText;
binmode STDOOUT,":utf8";

# now I get html files from the web, and I guess their charset
# by looking at the html code

# ...

# the $charset variable contains the charset of the page, and
# $text contains the whole page

# now I convert to unicode:

my $encoding = find_encoding($charset);
my $unicode = $encoding->decode($text);

# the previous operation works fine, in the sense that if at this point
# I print $unicode, I do get unicode as output

# now, I would like to get rid of the html code and format the
# page as text

# however, after the following steps, $cleaned_text contains character salad

my $tree = HTML::TreeBuilder->new_from_content($unicode);

my $formatter = HTML::FormatText->new();
my $cleaned_text = $formatter->format($tree);

************************************************************************ ***************


Any advice?

Thanks again!

Marco


--- Marco Baroni University of Bologna http://sslmit.unibo.it/~baroni



Reply via email to