Stefan Behnel <stefan...@behnel.de> writes: > Well, if multi-core performance is so important here, then there's a pretty > simple thing the OP can do: switch to lxml. > > http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
Well, lxml is uses libxml2, a fast XML parser written in C, but AFAIK it only works on well-formed XML. The point of Beautiful Soup is that it works on all kinds of garbage hand-written legacy HTML with mismatched tags and other sorts of errors. Beautiful Soup is slower because it's full of special cases and hacks for that reason, and it is written in Python. Writing something that complex in C to handle so much potentially malicious input would be quite a lot of work to write at all, and very difficult to ensure was really safe. Look at the many browser vulnerabilities we've seen over the years due to that sort of problem, for example. But, for web crawling, you really do need to handle the messy and wrong HTML properly. -- http://mail.python.org/mailman/listinfo/python-list