Just Another Victim of the Ambient Morality schrieb: > HTMLParser is behaving in, what I find to be, strange ways and I would > like to better understand what it is doing and why. > > First, it doesn't appear to translate HTML escape characters. I don't > know the actual terminology but things like & don't get translated into > & as one would like. Furthermore, not only does HTMLParser not translate it > properly, it seems to omit it altogether! This prevents me from even doing > the translation myself, so I can't even working around the issue. > Why is it doing this? Is there some mode I need to set? Can anyone > else duplicate this behaviour? Is it a bug?
Without code, that's hard to determine. But you are aware of e.g. handle_entityref(name) handle_charref(ref) ? > Secondly, HTMLParser often calls handle_data() consecutively, without > any calls to handle_starttag() in between. I did not expect this. In HTML, > you either have text or you have tags. Why split up my text into successive > handle_data() calls? This makes no sense to me. At the very least, it does > this in response to text with & like escape sequences (or whatever > they're called), so that it may successively avoid those translations. That's the way XML/HTML is defined - there is no guarantee that you get text as whole. If you must, you can collect the snippets yourself, and on the next end-tag deliver them as whole. > Again, why is it doing this? Is there some mode I need to set? Can > anyone else duplicate this behaviour? Is it a bug? No. It's the way it is, because it would require buffering with unlimited capacity to ensure this property. > These are serious problems for me and I would greatly appreciate a > deeper understanding of these issues. HTH, and read the docs. Diez -- http://mail.python.org/mailman/listinfo/python-list