currently I use Scrapy with multiprocessing. I made a POC, in order to run
many spider. My code look like that:
#!/usr/bin/python # -*- coding: utf-8 -*-from multiprocessing import Lock,
Process, Queue, current_process
def worker(work_queue, done_queue):
try:
for url in iter(work_queue.get, 'STOP'):
status_code = run_spider(action)
except Exception, e:
done_queue.put("%s failed on %s with: %s" % (current_process().name,
action, e.message))
return True
def run_spider(action):
os.system(action)
def main():
sites = (
scrapy crawl level1 -a url='https://www.example.com/test.html',
scrapy crawl level1 -a url='https://www.example.com/test1.html',
scrapy crawl level1 -a url='https://www.example.com/test2.html',
scrapy crawl level1 -a url='https://www.example.com/test3.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test4.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test5.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test6.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test7.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test8.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test9.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test10.html',
scrapy crawl level1 -a url='https://www.anotherexample.com/test11.html',
)
workers = 2
work_queue = Queue()
done_queue = Queue()
processes = []
for action in sites:
work_queue.put(action)
for w in xrange(workers):
p = Process(target=worker, args=(work_queue, done_queue))
p.start()
processes.append(p)
work_queue.put('STOP')
for p in processes:
p.join()
done_queue.put('STOP')
for status in iter(done_queue.get, 'STOP'):
print status
if __name__ == '__main__':
main()
According you, what is the best solution to run multiple instance of Scrapy
?
It would be better to launch a Scrapy instance for each URL or launch a
spider with x URL (ex: 1 spider with 100 links) ?
Thanks in advance
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.