On Wed, Oct 14, 2009 at 6:00 PM, Oliver Block <li...@oliver-block.eu> wrote:

>
>        my $ua = LWP::UserAgent->new;
>        my $response = $ua->get($form->{'url'});
>
>        my $tree = HTML::TreeBuilder->new();
>        $tree->parse($response->content);
>
> # ...
> # encoding of content of $tree is ISO-8859-1 at this point
>        $template = $tree->as_HTML('<>&');
>
> # encoding of content of $template is UTF-8
>
> Now the following problem arises. The encoding of the content of
> $template (UTF-8) is not the same than the content of $tree
> (ISO-8859-1). So it is obvious, that as_HTML converts the encoding to
> UTF-8.
>

I'm not really sure what the problem is, sorry.  But, the terminology above
seems a bit off.

UTF-8 and ISO-8859-1 are encodings (encoded octets) not characters.
Characters are an abstractions.  You should use character's inside Perl and
encoded octets outside.  (Ignore the fact that Perl's internal encoding is
UTF-8 and just pretend they are character abstractions.)

So, in general, I would bring character data into Perl like:

my $characters = $response->decoded_content;

Then you work with $characters as needed.

And then when you want to output you convert back to whatever encoding you
need:

$utf8_octets = encode_utf8( $characters );

send_to_client( $utf8_octets );

For your case you might try $tree->parse( $response->decoded_content );  Or,
if you have raw utf-8 octets that you need to parse I think you can call
$tree->utf8_mode( 1 ) to tell the parser to decode.  But, I'd prefer the
first.

(One thing I'm not clear on is when or if the parsers detect encoding by
looking for a charset in the content.  XML::LibXML will use the <?xml
encoding= from the content, for example.  But I'm not clear if the
HTML::Parser will look at an encoding set in a <meta> tag.)






-- 
Bill Moseley
mose...@hank.org

Reply via email to