Re: Unexpected behaviour with HTMLParser...

Diez B. Roggisch Tue, 09 Oct 2007 14:45:53 -0700

Just Another Victim of the Ambient Morality schrieb:
>     HTMLParser is behaving in, what I find to be, strange ways and I would 
> like to better understand what it is doing and why.
> 
>     First, it doesn't appear to translate HTML escape characters.  I don't 
> know the actual terminology but things like &amp; don't get translated into 
> & as one would like.  Furthermore, not only does HTMLParser not translate it 
> properly, it seems to omit it altogether!  This prevents me from even doing 
> the translation myself, so I can't even working around the issue.
>     Why is it doing this?  Is there some mode I need to set?  Can anyone 
> else duplicate this behaviour?  Is it a bug?


Without code, that's hard to determine. But you are aware of e.g.

handle_entityref(name)
handle_charref(ref)

?

>     Secondly, HTMLParser often calls handle_data() consecutively, without 
> any calls to handle_starttag() in between.  I did not expect this.  In HTML, 
> you either have text or you have tags.  Why split up my text into successive 
> handle_data() calls?  This makes no sense to me.  At the very least, it does 
> this in response to text with &amp; like escape sequences (or whatever 
> they're called), so that it may successively avoid those translations.

That's the way XML/HTML is defined - there is no guarantee that you get 
text as whole. If you must, you can collect the snippets yourself, and 
on the next end-tag deliver them as whole.


>     Again, why is it doing this?  Is there some mode I need to set?  Can 
> anyone else duplicate this behaviour?  Is it a bug?

No. It's the way it is, because it would require buffering with 
unlimited capacity to ensure this property.

>     These are serious problems for me and I would greatly appreciate a 
> deeper understanding of these issues.

HTH, and read the docs.

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Unexpected behaviour with HTMLParser...

Reply via email to