(You will find the perl code at the end)

A close look to the dump of $tree and a comparison with
$response->content showed the following:

The following markup from $response->content

<td colspan="8" align="left" bgcolor="#FFFFFF" class="Rubrik">&raquo;
Kontakt &nbsp;&rsaquo; Kontaktformular</td>

appears in tree as

bless( {
'_parent' =>
$VAR1->{'_content'}[1]{'_content'}[0]{'_content'}[1]{'_content'}[5],
'_content' => [
                         "\x{bb} Kontakt \x{a0}\x{203a} Kontaktformular"
                       ],
'colspan' => '8',
'align' => 'left',
'bgcolor' => '#FFFFFF',
'_tag' => 'td',
'class' => 'Rubrik'
}, 'HTML::Element' )

Last but not least a snippet of hexdump from the html markup
($response->content) above

00001f10  09 3c 74 64 20 63 6f 6c  73 70 61 6e 3d 22 38 22  |.<td
colspan="8"|
00001f20  20 61 6c 69 67 6e 3d 22  6c 65 66 74 22 20 62 67  |
align="left" bg|
00001f30  63 6f 6c 6f 72 3d 22 23  46 46 46 46 46 46 22 20 
|color="#FFFFFF" |
00001f40  63 6c 61 73 73 3d 22 52  75 62 72 69 6b 22 3e 26 
|class="Rubrik">&|
00001f50  72 61 71 75 6f 3b 20 4b  6f 6e 74 61 6b 74 20 26  |raquo;
Kontakt &|
00001f60  6e 62 73 70 3b 26 72 73  61 71 75 6f 3b 20 4b 6f 
|nbsp;&rsaquo; Ko|
00001f70  6e 74 61 6b 74 66 6f 72  6d 75 6c 61 72 3c 2f 74 
|ntaktformular</t|
00001f80  64 3e                                           |d>              |

I still do not understand why that happens but join does certainly not
cause it.

If you have any idea how to avoid the conversion to utf8 and how to
assure the the output of $tree->as_HTML() can be saved in the same
encoding as stated in $response, please tell it.

Best Regards,

Oliver Block


Oliver Block schrieb:
> Hello everyone,
>
> the following code is used to load a web page from a certain web server
> and parse it into an html tree. At the end a variable is assigned the
> string representation of that tree.
>
>         use LWP::UserAgent;
>         use HTML::TreeBuilder;
>
>         my $ua = LWP::UserAgent->new;
>         my $response = $ua->get($form->{'url'});
>
>         my $tree = HTML::TreeBuilder->new();
>         $tree->parse($response->content);
>
> # ...
> # encoding of content of $tree is ISO-8859-1 at this point
>         $template = $tree->as_HTML('<>&');
>
> # encoding of content of $template is UTF-8
>
> Now the following problem arises. The encoding of the content of
> $template (UTF-8) is not the same than the content of $tree
> (ISO-8859-1). So it is obvious, that as_HTML converts the encoding to UTF-8.
>
> I debugged everything and everythings is fine up to the last line of code of 
> sub HTML::Element::as_HTML which is:
>
>   return join('', @html, "\n");
>
> This would mean that join seems to modify the encoding of the content.
>
> Any suggestions?
>
>
> Best Regards,
>
> Oliver Block
>
>
>   

Reply via email to