In article ,
robean wrote:
>
>Here's the problem: the script simply crashes after getting a a couple
>of urls and takes a long time to run (slower that a non-threaded
>version that I wrote and ran). Can anyone figure out what I am doing
>wrong? I am new to both threading and urllib2, so its possi
> robean (R) wrote:
>R> def get_info_from_url(url):
>R> """ A dummy version of the function simply visits urls and prints
>R> the url of the page. """
>R> try:
>R> page = urllib2.urlopen(url)
>R> except urllib2.URLError, e:
>R> print " error ", e.reason
>R> except urll
For better performance, lxml easily outperforms Beautiful Soup.
For what its worth, the code runs fine if you switch from urllib2 to
urllib (different exceptions are raised, obviously). I have no
experience using urllib2 in a threaded environment, so I'm not sure
why it breaks; urllib does OK, tho
robean wrote:
> I am writing a program that involves visiting several hundred webpages
> and extracting specific information from the contents. I've written a
> modest 'test' example here that uses a multi-threaded approach to
> reach the urls with urllib2. The actual program will involve fairly
>
Thanks for your reply. Obviously you make several good points about
Beautiful Soup and Queue. But here's the problem: even if I do nothing
whatsoever with the threads beyond just visiting the urls with
urllib2, the program chokes. If I replace
else:
ulock.acquire()
print page.geturl() #
robean writes:
> reach the urls with urllib2. The actual program will involve fairly
> elaborate scraping and parsing (I'm using Beautiful Soup for that) but
> the example shown here is simplified and just confirms the url of the
> site visited.
Keep in mind Beautiful Soup is pretty slow, so if y
I am writing a program that involves visiting several hundred webpages
and extracting specific information from the contents. I've written a
modest 'test' example here that uses a multi-threaded approach to
reach the urls with urllib2. The actual program will involve fairly
elaborate scraping and p