Dear perl-unicode gurus, I've been struggling for days with this but I still can't see the light at the end of the byte/character tunnel... Any advice will be greatly appreciated.
I have a script that takes a list of urls, retrieves the corresponding web-pages and prints their contents. I would like to adapt it to Japanese. Input is in utf8, output can be in utf8 or any other encoding, as long as I know in which encoding it is. Of course, the Japanese webpages can be in one of many encodings, so I thought I strip the encoding off the html (/charset=([^\"]+)\"/) and recode everything to utf8 using from_to. Here is a simplified version of the script with what I believe to be the relevant parts: ****************************************************** ****************************************************** #!/usr/bin/perl # print_pages_from_url_list.jp.pl use strict; use warnings; use LWP; use Encode 'from_to'; # output will be utf8 binmode(STDOUT, ":utf8"); my $browser; my $html_text; my $ifile = shift; # input will be utf8 open my $in, "<:encoding(utf8)", $ifile or die; while (<$in>) { # input file is in one-url-per-line format # just in case there is some ftp left in the list if ($_ !~/^http/) { next; } my ($url) = $_; chomp $url; # avoid ps, pdf, word and the like if ($url !~/\.(ps)|(gz)|(pdf)|(gif)|(jpg)|(jpeg)|(doc)|(xls)|(ppt)|(rtf)$/i) { # retrieve web pages if ($html_text = do_GET($url)) { if ($html_text =~ /charset=([^\"]+)\"/) { # find out charset and convert to utf8 my $charset = $1; from_to($html_text,$charset,"utf8"); # here, I used to send $htm_text to HTML::TreeBuilder # for debugging purposes now I'm just stripping off # tags $html_text =~ s/<[^>]*>//g; print "CURRENT URL $url\n$html_text\n"; } else { # charset was not specified, we better leave the # page alone next; } } } } sub do_GET { # this is taken from the perl & lwp book (but I changed it a bit) $browser = LWP::UserAgent->new() unless $browser; $browser->timeout(10); $browser->env_proxy(); my $response; eval {$response = $browser->get(@_);}; if ($@) { print STDERR "something went wrong: [EMAIL PROTECTED]"; return; } return unless $response->is_success; return $response->content; } ****************************************************** ****************************************************** In this version, it runs, it doesn't complain, but the output doesn't look like utf8. For example, I canot visualize it with more (which on the computer I'm using works fine with other utf8 files), and if I try the following: $ recode utf8..euc-jp <newpages I immediately get the error: Invalid input in step `UTF-8..EUC-JP' What am I doing wrong? Why do encodings always cause so much pain? Arigato!!!! Marco -- Marco Baroni SSLMIT, University of Bologna http://sslmit.unibo.it/~baroni