What site are you scraping? Lots of sites have good caching on common pages, but if you go a link or two deep, the site has to recreate the page.
What I'm getting as is this - I think scrapy should handle this situation out of the box, and I'm wondering if the remote server is throttling you. Have you profiled the scrape of the urls to determine if there's throttling or timing issues? On Sat, Feb 6, 2016 at 8:25 PM, kris brown <[email protected]> wrote: > Hello everyone! Apologies if this topic appeared twice, my first attempt > to post it did not seem to show up in the group. > > Anyways, this is my first scrapy project and I'm trying to crawl multiple > domains ( about 100) which has presented a scheduling issue. In trying to > be polite to the sites I'm crawling I've set a reasonable download delay > and limited the ip concurrency to 1 for any particular domain. What I > think is happening is that the url queue fills up with many urls for a > single domain which of course ends up dragging the crawl rate down to about > 15/minute. I've been thinking about writing a scheduler that would return > the next url based on a heap sorted by the earliest time a domain can be > crawled next. However, I'm sure others have faced a similar problem and as > I'm a total beginner to scrapy I wanted to hear some different opinions on > how to resolve this. Thanks! > > -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
