Yes - certainly bring your concurrency level to 8 requests per IP and be less nice. See if it fixes the problem and if anyone complains. Beyond that, make sure you don't download stuff you've already downloaded recently. If a site is slow, it likely doesn't have much updated content... not even comments... so don't recrawl the same boring old static pages. Or crawl them once a month, instead of on every crawl.
On Monday, February 8, 2016 at 11:19:15 PM UTC, kris brown wrote: > > Dmitris, > Thanks for the tips. So I have taken your advice and put a download > timeout on my crawl and a download delay of just a couple seconds, but > still face the issue of a long string of urls for a single domain in my > queues. Since I'm only making a single request to a given ip address this > seems to bring down the crawl rate despite having a quick download time for > any given single request. This is where I'm wondering what my best option > is for doing a broad crawl. Implement my own scheduling queue? Run > multiple spiders? Not sure what the best option is. For my current project > with no pipelines implemented I'm maxing out at about 2-300 crawls per > minute. Definitely need to get that number much higher to have any kind of > reasonable performance on the crawl. Thanks again for everyone's advice! > > On Monday, February 8, 2016 at 2:29:22 PM UTC-6, Dimitris Kouzis - Loukas > wrote: >> >> Just a quick ugly tip... Set download timeout >> <http://doc.scrapy.org/en/latest/topics/settings.html#download-timeout> to >> 3 seconds... get done with handling the responsive websites and then try >> another approach with the slower ones (or skip them altogether?) >> >> Also don't be over-polite... if you could do something with a browser, I >> think, it's fair to do it with Scrapy. >> >> >> On Monday, February 8, 2016 at 1:01:19 AM UTC, kris brown wrote: >>> >>> So the project is scraping several university websites. I've profiled >>> the crawl as it's going to see the engine and downloader slots which >>> eventually converge to just having a single domain that urls come from. >>> Having looked at the download latency on the headers I don't see any >>> degradation of response times. The drift towards an extremely long series >>> of responses from a single domain is what lead me to think I need a >>> different scheduler. If there's any other info I can provide that would be >>> more useful let me know. >>> >>> On Sunday, February 7, 2016 at 4:09:03 PM UTC-6, Travis Leleu wrote: >>>> >>>> What site are you scraping? Lots of sites have good caching on common >>>> pages, but if you go a link or two deep, the site has to recreate the page. >>>> >>>> What I'm getting as is this - I think scrapy should handle this >>>> situation out of the box, and I'm wondering if the remote server is >>>> throttling you. >>>> >>>> Have you profiled the scrape of the urls to determine if there's >>>> throttling or timing issues? >>>> >>>> On Sat, Feb 6, 2016 at 8:25 PM, kris brown <[email protected]> >>>> wrote: >>>> >>>>> Hello everyone! Apologies if this topic appeared twice, my first >>>>> attempt to post it did not seem to show up in the group. >>>>> >>>>> Anyways, this is my first scrapy project and I'm trying to crawl >>>>> multiple domains ( about 100) which has presented a scheduling issue. In >>>>> trying to be polite to the sites I'm crawling I've set a reasonable >>>>> download delay and limited the ip concurrency to 1 for any particular >>>>> domain. What I think is happening is that the url queue fills up with >>>>> many >>>>> urls for a single domain which of course ends up dragging the crawl rate >>>>> down to about 15/minute. I've been thinking about writing a scheduler >>>>> that >>>>> would return the next url based on a heap sorted by the earliest time a >>>>> domain can be crawled next. However, I'm sure others have faced a >>>>> similar >>>>> problem and as I'm a total beginner to scrapy I wanted to hear some >>>>> different opinions on how to resolve this. Thanks! >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "scrapy-users" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/scrapy-users. >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
