Re: utf8, japanese, web-pages: beginning to see the light...

Nick Ing-Simmons Tue, 18 May 2004 02:10:05 -0700

Marco Baroni <[EMAIL PROTECTED]> writes:
>A few days ago, I queried this list about my problems with a script 
>that finds the charset of Japanese web pages and translates their text 
>into utf-8.
>
>The following solution, proposed by Nick Ing-Simmons, worked for my 
>case:
>
>>    binmode STDOOUT,":utf8";
>>    my $encoding = find_encoding($charset);
>>    my $unicode = $encoding->decode($text);


      Run HTML::FormatText here with chars in Unicode.

>>    print $unicode;
>>
>($charset is the charset as extracted from the html code of the page 
>and $text is all the text from the page itself, as returned by the LWP 
>agent.)
>
>Thanks a lot to Nick and to all the others who responded to my plea for 
>help.
>
>Now for a much less pressing issue: Does anybody know of something 
>similar to the HTML::FormatText module that can take utf-8 input, and 
>generate utf-8 output? 

Doubt it. But if you run it on Unicode chars (as indicated above)
then unless it is doing something too clever it should just work.

>In other words, of a module or command line tool 
>to which I could feed my Japanese html pages, or html documents in 
>other non-Latin alphabets, and get nicely formatted plain utf-8 text as 
>output?
>  (HTML::FormatText seems to break with utf-8 and with the Japanese 
>encodings.)
>
>Thanks in advance.
>
>Regards,
>
>Marco
>
>
>---
>Marco Baroni
>University of Bologna
>http://sslmit.unibo.it/~baroni

Re: utf8, japanese, web-pages: beginning to see the light...

Reply via email to