What is the best manner to run many Scrapy with multiprocessing?

Pierre Therrode Fri, 14 Aug 2015 05:21:44 -0700

currently I use Scrapy with multiprocessing. I made a POC, in order to run 
many spider. My code look like that:


#!/usr/bin/python # -*- coding: utf-8 -*-from multiprocessing import Lock, 
Process, Queue, current_process
def worker(work_queue, done_queue):
    try:
        for url in iter(work_queue.get, 'STOP'):
            status_code = run_spider(action)
    except Exception, e:
        done_queue.put("%s failed on %s with: %s" % (current_process().name, 
action, e.message))
    return True

def run_spider(action):
    os.system(action)
def main():
    sites = (
        scrapy crawl level1 -a url='https://www.example.com/test.html',
        scrapy crawl level1 -a url='https://www.example.com/test1.html',
        scrapy crawl level1 -a url='https://www.example.com/test2.html',
        scrapy crawl level1 -a url='https://www.example.com/test3.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test4.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test5.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test6.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test7.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test8.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test9.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test10.html',
        scrapy crawl level1 -a url='https://www.anotherexample.com/test11.html',
    )

    workers = 2
    work_queue = Queue()
    done_queue = Queue()
    processes = []

    for action in sites:
        work_queue.put(action)

    for w in xrange(workers):
        p = Process(target=worker, args=(work_queue, done_queue))
        p.start()
        processes.append(p)
        work_queue.put('STOP')

    for p in processes:
        p.join()

    done_queue.put('STOP')

    for status in iter(done_queue.get, 'STOP'):
        print status
if __name__ == '__main__':
    main()


According you, what is the best solution to run multiple instance of Scrapy 
?

It would be better to launch a Scrapy instance for each URL or launch a 
spider with x URL (ex: 1 spider with 100 links) ?


Thanks in advance

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

What is the best manner to run many Scrapy with multiprocessing?

Reply via email to