For better performance, lxml easily outperforms Beautiful Soup. For what its worth, the code runs fine if you switch from urllib2 to urllib (different exceptions are raised, obviously). I have no experience using urllib2 in a threaded environment, so I'm not sure why it breaks; urllib does OK, though.
- Shailen On May 1, 9:29 am, Stefan Behnel <stefan...@behnel.de> wrote: > robean wrote: > > I am writing a program that involves visiting several hundred webpages > > and extracting specific information from the contents. I've written a > > modest 'test' example here that uses a multi-threaded approach to > > reach the urls with urllib2. The actual program will involve fairly > > elaborate scraping and parsing (I'm using Beautiful Soup for that) > > Try lxml.html instead. It often parses HTML pages better than BS, can parse > directly from HTTP/FTP URLs, frees the GIL doing so, and is generally a lot > faster and more memory friendly than the combination of urllib2 and BS, > especially when threading is involved. It also supports CSS selectors for > finding page content, so your "elaborate scraping" might actually turn out > to be a lot simpler than you think. > > http://codespeak.net/lxml/ > > These might be worth reading: > > http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-sc...http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ > > Stefan -- http://mail.python.org/mailman/listinfo/python-list