Bill Moseley schrieb:
> So, in general, I would bring character data into Perl like:
>
> my $characters = $response->decoded_content;
>
> Then you work with $characters as needed.
>
> And then when you want to output you convert back to whatever encoding
> you need:
>
> $utf8_octets = encode_utf8( $characters );
>
> send_to_client( $utf8_octets );
>
> For your case you might try $tree->parse( $response->decoded_content
> );  Or, if you have raw utf-8 octets that you need to parse I think
> you can call $tree->utf8_mode( 1 ) to tell the parser to decode.  But,
> I'd prefer the first.
>
That seems to be a good idea. There are only some modifications I have
to make, because there is not always the same encoding for incoming
documents. It can be latin1 or utf-8 or others. Those who create the web
pages are not always that precise. That's why HTML::Parser is such a
good choice in this cases, because it is tolerant.


I thought that not touching the encoding would be the best idea, but
decoding characters with code points higher than 255 seems to be better.
But it might also a good idea to use $response->decoded_content and
later encode the content again. At least if $response provides always
for an ->content_charset.

Thank you.

Best regards,

Oliver Block

Reply via email to