I'm trying to learn from this, too.  Please (anyone) correct me if I'm wrong
below.

On Thu, Oct 15, 2009 at 11:18 AM, Oliver Block <li...@oliver-block.eu>wrote:

>
> I think I've found out what causes the problem. As I mentioned earlier
> the content of a td tag in my case "&raquo; Kontakt &nbsp;&rsaquo;
> Kontaktformular" will be represented by the following ...  characters
> (?) "\x{bb} Kontakt \x{a0}\x{203a} Kontaktformular" and the reason seems
> to be that there is nothing like a character representation in the
> ISO-8859-1 encoding. The codepoint (for &rsaquo;) is U+203A or &#8250;
> This seems to be a legal character in ISO-8859-1-encoded html documents
> when it appears in the form of a character entity reference.
>

Well, I think you are slightly mixing things there.  But, it's probably more
about terminology.

The 8 letters and symbols that make up "&rsaquo;" are all valid ISO-8859-1
code points. The character that it represents is not an ISO-8859-1
character.  One point of the entity is to allow the browser to render
characters that are not in the encoding used to transmit the document from
the server to the browser.

What I think is happening in your case is when parsing the *entities* they
end up as wide characters so Perl has to promote the text to a wide
character -- that is it's setting the utf8 flag on the data so that Perl can
represent the *character*.

Now, this won't happen if you don't have entities (well entities that
represent wide characters).  If, for example, you have just uft8 characters
in the web page you are parsing and don't decode it (which I consider a
programming error) then you won't end up with the utf8 flag on. That is, you
have octets instead of characters inside Perl.

And w/o the utf8 flag set you won't get "wide character in print" errors,
either, so you don't even know you are doing it wrong. ;)

So, what I think you should do is:

$octets = $response->content;
$tree->parse( decode_utf8( $octets ) );  # assuming you know it's utf8

But, HTTP::Response determines how to decode for you, so do:

$tree->parse( $response->decoded_content );

Now you have characters inside perl.  Entites no longer exist -- just
characters.


> So, changing the parameter for as_HTML from
>
> $tree->as_HTML('<>&');
>
>
> to
>
> $tree->as_HTML();
>

That depends on what you want to do.

If you are creating a web page AND you set the encoding in the HTTP headers
to utf8 then it's this:

print encode_utf8( $tree->as_HTML( '<>&' ) );

Which says generate the HTML, and convert <,> and & to entities inside text
elements.  No need to create entities for other characters (like &rsaquo;).

Now, if you for some reason really want to encode to latin1, then you are
right, you do this:

print encode( 'iso-8859-1', $tree->as_HTML, Encode::FB_CROAK );

The $tree->as_HTML will convert "unsafe" characters to entites that can be
represented in an Latin1.

But, I'd stick with encoding to utf-8.  Decode and encode character data at
the edges of your program.




-- 
Bill Moseley
mose...@hank.org

Reply via email to