On Wed, Oct 14, 2009 at 6:00 PM, Oliver Block <li...@oliver-block.eu> wrote:
> > my $ua = LWP::UserAgent->new; > my $response = $ua->get($form->{'url'}); > > my $tree = HTML::TreeBuilder->new(); > $tree->parse($response->content); > > # ... > # encoding of content of $tree is ISO-8859-1 at this point > $template = $tree->as_HTML('<>&'); > > # encoding of content of $template is UTF-8 > > Now the following problem arises. The encoding of the content of > $template (UTF-8) is not the same than the content of $tree > (ISO-8859-1). So it is obvious, that as_HTML converts the encoding to > UTF-8. > I'm not really sure what the problem is, sorry. But, the terminology above seems a bit off. UTF-8 and ISO-8859-1 are encodings (encoded octets) not characters. Characters are an abstractions. You should use character's inside Perl and encoded octets outside. (Ignore the fact that Perl's internal encoding is UTF-8 and just pretend they are character abstractions.) So, in general, I would bring character data into Perl like: my $characters = $response->decoded_content; Then you work with $characters as needed. And then when you want to output you convert back to whatever encoding you need: $utf8_octets = encode_utf8( $characters ); send_to_client( $utf8_octets ); For your case you might try $tree->parse( $response->decoded_content ); Or, if you have raw utf-8 octets that you need to parse I think you can call $tree->utf8_mode( 1 ) to tell the parser to decode. But, I'd prefer the first. (One thing I'm not clear on is when or if the parsers detect encoding by looking for a charset in the content. XML::LibXML will use the <?xml encoding= from the content, for example. But I'm not clear if the HTML::Parser will look at an encoding set in a <meta> tag.) -- Bill Moseley mose...@hank.org