Designing multiple spiders orchestration

Jérémy Subtil Thu, 10 Apr 2014 07:20:46 -0700

Hi there,

I'm designing a crawler scraping the same kind of item from different 
sources A, B, C. First, items are found from source A. Then some business 
logic defines if the same item should be looked up from other sources B and 
C. If so, new requests are sent from the corresponding spiders B and C.


AFAIK, a spider can't communicate with another spider.

To solve this problem through the scrapy stack, I suppose I have to write a 
Python script instantiating spiders A, B and C. This script would embed the 
orchestration logic by listening on spiders signals like item_scraped and 
spider_idle. Are resources handled correctly by scrapy this way?

An alternative would be to deploy the spiders to scrapyd, while the 
orchestration logic would be written in a separate program communicating 
with scrapyd and spiders through REST. Resources and queue would be handled 
in a better way through scrapyd, wouldn't it?

My feeling is that scrapyd + external orchestration through REST is a 
better approach. Do you think so too?

Thanks in advance for your feedback.

Cheers,

Jeremy

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Designing multiple spiders orchestration

Reply via email to