Re: urllib2 and threading

shailen . tuli Fri, 01 May 2009 11:20:53 -0700

For better performance, lxml easily outperforms Beautiful Soup.

For what its worth, the code runs fine if you switch from urllib2 to
urllib (different exceptions are raised, obviously). I have no
experience using urllib2 in a threaded environment, so I'm not sure
why it breaks; urllib does OK, though.


- Shailen

On May 1, 9:29 am, Stefan Behnel <stefan...@behnel.de> wrote:
> robean wrote:
> > I am writing a program that involves visiting several hundred webpages
> > and extracting specific information from the contents. I've written a
> > modest 'test' example here that uses a multi-threaded approach to
> > reach the urls with urllib2. The actual program will involve fairly
> > elaborate scraping and parsing (I'm using Beautiful Soup for that)
>
> Try lxml.html instead. It often parses HTML pages better than BS, can parse
> directly from HTTP/FTP URLs, frees the GIL doing so, and is generally a lot
> faster and more memory friendly than the combination of urllib2 and BS,
> especially when threading is involved. It also supports CSS selectors for
> finding page content, so your "elaborate scraping" might actually turn out
> to be a lot simpler than you think.
>
> http://codespeak.net/lxml/
>
> These might be worth reading:
>
> http://blog.ianbicking.org/2008/12/10/lxml-an-underappreciated-web-sc...http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
>
> Stefan

--
http://mail.python.org/mailman/listinfo/python-list

Re: urllib2 and threading

Reply via email to