Marco Baroni <[EMAIL PROTECTED]> writes: >A few days ago, I queried this list about my problems with a script >that finds the charset of Japanese web pages and translates their text >into utf-8. > >The following solution, proposed by Nick Ing-Simmons, worked for my >case: > >> binmode STDOOUT,":utf8"; >> my $encoding = find_encoding($charset); >> my $unicode = $encoding->decode($text);
Run HTML::FormatText here with chars in Unicode. >> print $unicode; >> >($charset is the charset as extracted from the html code of the page >and $text is all the text from the page itself, as returned by the LWP >agent.) > >Thanks a lot to Nick and to all the others who responded to my plea for >help. > >Now for a much less pressing issue: Does anybody know of something >similar to the HTML::FormatText module that can take utf-8 input, and >generate utf-8 output? Doubt it. But if you run it on Unicode chars (as indicated above) then unless it is doing something too clever it should just work. >In other words, of a module or command line tool >to which I could feed my Japanese html pages, or html documents in >other non-Latin alphabets, and get nicely formatted plain utf-8 text as >output? > (HTML::FormatText seems to break with utf-8 and with the Japanese >encodings.) > >Thanks in advance. > >Regards, > >Marco > > >--- >Marco Baroni >University of Bologna >http://sslmit.unibo.it/~baroni