Try those: CONCURRENT_REQUESTS = 100 CONCURRENT_REQUESTS_PER_DOMAIN = 1 CONCURRENT_REQUESTS_PER_IP = 1 DOWNLOAD_TIMEOUT = 3 COOKIES_ENABLED = False
On Tuesday, February 9, 2016 at 10:46:36 AM UTC, Dimitris Kouzis - Loukas wrote: > > Let me give you a very clear example. Let's assume you have 10k fast URLs > where each takes 1sec and 10k slow URLs where each takes 15 seconds. The > first ones need 2.7 download hours while the slow ones need 41 download > hours. > > That's your workload - there isn't much you can do about it if you want to > do all this work. By increasing the number of Requests you run in parallel > e.g. from 1 to 10, you divide those numbers by 10. By using a more clever > scheduling algorithm what you can do is bring those fast URLs first, so in > 3 hours you have most fast URLs and a few slow ones. By setting download > timeout to e.g. 3 seconds, you trim the slow job from 41 hours to 8 hours > (and of course potentially lose many URLs). > > The most effective way to attack the problem you have is find ways to do > less. > > On Tuesday, February 9, 2016 at 10:34:10 AM UTC, Dimitris Kouzis - Loukas > wrote: >> >> Yes - certainly bring your concurrency level to 8 requests per IP and be >> less nice. See if it fixes the problem and if anyone complains. Beyond >> that, make sure you don't download stuff you've already downloaded >> recently. If a site is slow, it likely doesn't have much updated content... >> not even comments... so don't recrawl the same boring old static pages. Or >> crawl them once a month, instead of on every crawl. >> >> On Monday, February 8, 2016 at 11:19:15 PM UTC, kris brown wrote: >>> >>> Dmitris, >>> Thanks for the tips. So I have taken your advice and put a download >>> timeout on my crawl and a download delay of just a couple seconds, but >>> still face the issue of a long string of urls for a single domain in my >>> queues. Since I'm only making a single request to a given ip address this >>> seems to bring down the crawl rate despite having a quick download time for >>> any given single request. This is where I'm wondering what my best option >>> is for doing a broad crawl. Implement my own scheduling queue? Run >>> multiple spiders? Not sure what the best option is. For my current project >>> with no pipelines implemented I'm maxing out at about 2-300 crawls per >>> minute. Definitely need to get that number much higher to have any kind of >>> reasonable performance on the crawl. Thanks again for everyone's advice! >>> >>> On Monday, February 8, 2016 at 2:29:22 PM UTC-6, Dimitris Kouzis - >>> Loukas wrote: >>>> >>>> Just a quick ugly tip... Set download timeout >>>> <http://doc.scrapy.org/en/latest/topics/settings.html#download-timeout> to >>>> 3 seconds... get done with handling the responsive websites and then try >>>> another approach with the slower ones (or skip them altogether?) >>>> >>>> Also don't be over-polite... if you could do something with a browser, >>>> I think, it's fair to do it with Scrapy. >>>> >>>> >>>> On Monday, February 8, 2016 at 1:01:19 AM UTC, kris brown wrote: >>>>> >>>>> So the project is scraping several university websites. I've profiled >>>>> the crawl as it's going to see the engine and downloader slots which >>>>> eventually converge to just having a single domain that urls come from. >>>>> Having looked at the download latency on the headers I don't see any >>>>> degradation of response times. The drift towards an extremely long >>>>> series >>>>> of responses from a single domain is what lead me to think I need a >>>>> different scheduler. If there's any other info I can provide that would >>>>> be >>>>> more useful let me know. >>>>> >>>>> On Sunday, February 7, 2016 at 4:09:03 PM UTC-6, Travis Leleu wrote: >>>>>> >>>>>> What site are you scraping? Lots of sites have good caching on >>>>>> common pages, but if you go a link or two deep, the site has to recreate >>>>>> the page. >>>>>> >>>>>> What I'm getting as is this - I think scrapy should handle this >>>>>> situation out of the box, and I'm wondering if the remote server is >>>>>> throttling you. >>>>>> >>>>>> Have you profiled the scrape of the urls to determine if there's >>>>>> throttling or timing issues? >>>>>> >>>>>> On Sat, Feb 6, 2016 at 8:25 PM, kris brown <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hello everyone! Apologies if this topic appeared twice, my first >>>>>>> attempt to post it did not seem to show up in the group. >>>>>>> >>>>>>> Anyways, this is my first scrapy project and I'm trying to crawl >>>>>>> multiple domains ( about 100) which has presented a scheduling issue. >>>>>>> In >>>>>>> trying to be polite to the sites I'm crawling I've set a reasonable >>>>>>> download delay and limited the ip concurrency to 1 for any particular >>>>>>> domain. What I think is happening is that the url queue fills up with >>>>>>> many >>>>>>> urls for a single domain which of course ends up dragging the crawl >>>>>>> rate >>>>>>> down to about 15/minute. I've been thinking about writing a scheduler >>>>>>> that >>>>>>> would return the next url based on a heap sorted by the earliest time a >>>>>>> domain can be crawled next. However, I'm sure others have faced a >>>>>>> similar >>>>>>> problem and as I'm a total beginner to scrapy I wanted to hear some >>>>>>> different opinions on how to resolve this. Thanks! >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "scrapy-users" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to [email protected]. >>>>>>> To post to this group, send email to [email protected]. >>>>>>> Visit this group at https://groups.google.com/group/scrapy-users. >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
