John Salerno <[email protected]> writes: > The Beautiful Soup 4 documentation was very clear, and BS4 itself is > so simple and Pythonic. And best of all, since version 4 no longer > does the parsing itself, you can choose your own parser, and it works > with lxml, so I'll still be using lxml, but with a nice, clean overlay > for navigating the tree structure.
I haven't used BS4 but have made good use of earlier versions. Main thing to understand is that an awful lot of HTML in the real world is malformed and will break an XML parser or anything that expects syntactically invalid HTML. People tend to write HTML that works well enough to render decently in browsers, whose parsers therefore have to be tolerant of bad errors. Beautiful Soup also tries to make sense of crappy, malformed, HTML. Partly as a result, it's dog slow compared to any serious XML parser. But it works very well if you don't mind the low speed. -- http://mail.python.org/mailman/listinfo/python-list
