Hi guys,
I've got a little issue with my Scrapy project. Basically I want to built a tools that iterate through a list of urls ( more than 30k ) and make some test on the responses. I use a structure close to the one displayed in the docs<http://doc.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process>for running multiple spiders : for link in links: link = link.replace('\n', '') link = link.replace('\"','') elements = link.split(';') setup_crawler(elements[4], elements[2], dead_links) reactor.run() def setup_crawler(link, merchant, dead_links): spider = PokeSpider(link, merchant, dead_links) settings = get_project_settings() crawler = Crawler(settings) crawler.signals.connect(incrementCount, signal=signals.spider_closed) crawler.configure() crawler.crawl(spider) crawler.start() def incrementCount(): global count global links count+=1 if count%10==0: print(count) if count==len(links): reactor.stop() The fact is it work just like I want it to but only when the list contain less than 1000 urls, otherwise I get this error : --- <exception caught here> --- File "/usr/local/lib/python2.7/dist-packages/Twisted-13.2.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 824, in runUntilCurrent File "/usr/local/lib/python2.7/dist-packages/Twisted-13.2.0-py2.7-linux-x86_64.egg/twisted/internet/tcp.py", line 421, in resolveAddress File "/usr/local/lib/python2.7/dist-packages/Twisted-13.2.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 569, in resolve File "/usr/local/lib/python2.7/dist-packages/Twisted-13.2.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 270, in getHostByName File "/usr/local/lib/python2.7/dist-packages/Twisted-13.2.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 989, in getThreadPool File "/usr/local/lib/python2.7/dist-packages/Twisted-13.2.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 954, in _initThreadPool exceptions.ImportError: cannot import name threadpool Unhandled Error Traceback (most recent call last): File "launch.py", line 54, in <module> reactor.run() File "/usr/local/lib/python2.7/dist-packages/Twisted-13.2.0-py2.7-linux-x86_64.egg/twisted/internet/_glibbase.py", line 301, in run File "/usr/local/lib/python2.7/dist-packages/Twisted-13.2.0-py2.7-linux-x86_64.egg/twisted/internet/_glibbase.py", line 333, in _simulate I didn't use a single spider with a big start_urls list because I also need to pass some arguments to the spider in order to compare them with the response I hope I've been clear enough Thanks -- This e-mail, including attachments, contains confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed. The reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately. -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
