Are those urls from the same domain?

If not, then you can use as many processes as you want distributed in
different servers to max out all the bandwidth available. You can either
split the input among the scrapy processes (e.g. passing an argument to
read a batch of urls from a file) or share a single scheduler queue and
start processes on demand (e.g. by using redis[1] or any other db).

Otherwise, if you do a mass distributed crawling of a single domain it
might qualify as a DDoS attack. You would need to manage a pool of IPs[2]
to crawl gently from each IP to avoid being blocked (in case the website
does block you).

Rolando

[1] Here is a sample implementation of a custom scheduler with redis as a
backend:
https://github.com/darkrho/scrapy-redis/blob/master/scrapy_redis/scheduler.py
[2] Shameless plug: Crawlera's paid service does this for you:
http://crawlera.com/



On Thu, Mar 6, 2014 at 5:16 PM, James Ford <[email protected]> wrote:

> Hello,
>
> How do I get the most out of Scrapy when crawling a list of static urls?
> The list can range from one url to thousands.
>
> Should I divide the list, and distribute X urls to Y spiders and let them
> run in the same process? Or should I let one spider handle them all? Maybe
> it doesent matter?
>
> Thanks
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to