Threadpool exception when crawling more than 1000 urls

Clément Thu, 17 Apr 2014 04:11:30 -0700


Hi guys,


I've got a little issue with my Scrapy project.
 Basically I want to built a tools that iterate through a list of urls ( 
more than 30k ) and make some test on the responses.
I use a structure close to the one displayed in the 
docs<http://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process>for
 running multiple spiders :

for link in links:
link = link.replace('\n', '')
link = link.replace('\"','')
elements = link.split(';')
setup_crawler(elements[4], elements[2], dead_links) 

reactor.run()

def setup_crawler(link, merchant, dead_links):
spider = PokeSpider(link, merchant, dead_links)
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(incrementCount, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()

def incrementCount():
global count
global links
count+=1
if count%10==0:
print(count)

if count==len(links):
reactor.stop()


The fact is it work just like I want it to but only when the list contain 
less than 1000 urls, otherwise I get this error :

--- <exception caught here> ---
  File 
"/usr/local/lib/python2.7/dist-packages/Twisted-13.2.0-py2.7-linux-x86_64.egg/twisted/internet/base.py",
 
line 824, in runUntilCurrent
    
  File 
"/usr/local/lib/python2.7/dist-packages/Twisted-13.2.0-py2.7-linux-x86_64.egg/twisted/internet/tcp.py",
 
line 421, in resolveAddress
    
  File 
"/usr/local/lib/python2.7/dist-packages/Twisted-13.2.0-py2.7-linux-x86_64.egg/twisted/internet/base.py",
 
line 569, in resolve
    
  File 
"/usr/local/lib/python2.7/dist-packages/Twisted-13.2.0-py2.7-linux-x86_64.egg/twisted/internet/base.py",
 
line 270, in getHostByName
    
  File 
"/usr/local/lib/python2.7/dist-packages/Twisted-13.2.0-py2.7-linux-x86_64.egg/twisted/internet/base.py",
 
line 989, in getThreadPool
    
  File 
"/usr/local/lib/python2.7/dist-packages/Twisted-13.2.0-py2.7-linux-x86_64.egg/twisted/internet/base.py",
 
line 954, in _initThreadPool
    
exceptions.ImportError: cannot import name threadpool
Unhandled Error
Traceback (most recent call last):
  File "launch.py", line 54, in <module>
    reactor.run()
  File 
"/usr/local/lib/python2.7/dist-packages/Twisted-13.2.0-py2.7-linux-x86_64.egg/twisted/internet/_glibbase.py",
 
line 301, in run
    
  File 
"/usr/local/lib/python2.7/dist-packages/Twisted-13.2.0-py2.7-linux-x86_64.egg/twisted/internet/_glibbase.py",
 
line 333, in _simulate



I didn't use a single spider with a big start_urls list because I also need 
to pass some arguments to the spider in order to compare them with the 
response

I hope I've been clear enough

Thanks
    


 



-- 

This e-mail, including attachments, contains confidential and/or 
proprietary information, and may be used only by the person or entity to 
which it is addressed. The reader is hereby notified that any 
dissemination, distribution or copying of this e-mail is prohibited. If you 
have received this e-mail in error, please notify the sender by replying to 
this message and delete this e-mail immediately.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Threadpool exception when crawling more than 1000 urls

Reply via email to