Re: Help with slow crawl rate on broad crawls

Dimitris Kouzis - Loukas Tue, 09 Feb 2016 02:54:07 -0800

Try those:

CONCURRENT_REQUESTS = 100
CONCURRENT_REQUESTS_PER_DOMAIN = 1
CONCURRENT_REQUESTS_PER_IP = 1
DOWNLOAD_TIMEOUT = 3
COOKIES_ENABLED = False


On Tuesday, February 9, 2016 at 10:46:36 AM UTC, Dimitris Kouzis - Loukas 
wrote:
>
> Let me give you a very clear example. Let's assume you have 10k fast URLs 
> where each takes 1sec and 10k slow URLs where each takes 15 seconds. The 
> first ones need 2.7 download hours while the slow ones need 41 download 
> hours.
>
> That's your workload - there isn't much you can do about it if you want to 
> do all this work. By increasing the number of Requests you run in parallel 
> e.g. from 1 to 10, you divide those numbers by 10. By using a more clever 
> scheduling algorithm what you can do is bring those fast URLs first, so in 
> 3 hours you have most fast URLs and a few slow ones. By setting download 
> timeout to e.g. 3 seconds, you trim the slow job from 41 hours to 8 hours 
> (and of course potentially lose many URLs).
>
> The most effective way to attack the problem you have is find ways to do 
> less.
>
> On Tuesday, February 9, 2016 at 10:34:10 AM UTC, Dimitris Kouzis - Loukas 
> wrote:
>>
>> Yes - certainly bring your concurrency level to 8 requests per IP and be 
>> less nice. See if it fixes the problem and if anyone complains. Beyond 
>> that, make sure you don't download stuff you've already downloaded 
>> recently. If a site is slow, it likely doesn't have much updated content... 
>> not even comments... so don't recrawl the same boring old static pages. Or 
>> crawl them once a month, instead of on every crawl.
>>
>> On Monday, February 8, 2016 at 11:19:15 PM UTC, kris brown wrote:
>>>
>>> Dmitris, 
>>>     Thanks for the tips.  So I have taken your advice and put a download 
>>> timeout on my crawl and a download delay of just a couple seconds, but 
>>> still face the issue of a long string of urls for a single domain in my 
>>> queues.  Since I'm only making a single request to a given ip address this 
>>> seems to bring down the crawl rate despite having a quick download time for 
>>> any given single request.  This is where I'm wondering what my best option 
>>> is for doing a broad crawl.  Implement my own scheduling queue? Run 
>>> multiple spiders? Not sure what the best option is.  For my current project 
>>> with no pipelines implemented I'm maxing out at about 2-300 crawls per 
>>> minute.  Definitely need to get that number much higher to have any kind of 
>>> reasonable performance on the crawl. Thanks again for everyone's advice!
>>>
>>> On Monday, February 8, 2016 at 2:29:22 PM UTC-6, Dimitris Kouzis - 
>>> Loukas wrote:
>>>>
>>>> Just a quick ugly tip... Set download timeout 
>>>> <http://doc.scrapy.org/en/latest/topics/settings.html#download-timeout> to 
>>>> 3 seconds... get done with handling the responsive websites and then try 
>>>> another approach with the slower ones (or skip them altogether?)
>>>>
>>>> Also don't be over-polite... if you could do something with a browser, 
>>>> I think, it's fair to do it with Scrapy.
>>>>
>>>>
>>>> On Monday, February 8, 2016 at 1:01:19 AM UTC, kris brown wrote:
>>>>>
>>>>> So the project is scraping several university websites.  I've profiled 
>>>>> the crawl as it's going to see the engine and downloader slots which 
>>>>> eventually converge to just having a single domain that urls come from. 
>>>>>  Having looked at the download latency on the headers I don't see any 
>>>>> degradation of response times.  The drift towards an extremely long 
>>>>> series 
>>>>> of responses from a single domain is what lead me to think I need a 
>>>>> different scheduler.  If there's any other info I can provide that would 
>>>>> be 
>>>>> more useful let me know.
>>>>>
>>>>> On Sunday, February 7, 2016 at 4:09:03 PM UTC-6, Travis Leleu wrote:
>>>>>>
>>>>>> What site are you scraping?  Lots of sites have good caching on 
>>>>>> common pages, but if you go a link or two deep, the site has to recreate 
>>>>>> the page.
>>>>>>
>>>>>> What I'm getting as is this - I think scrapy should handle this 
>>>>>> situation out of the box, and I'm wondering if the remote server is 
>>>>>> throttling you.
>>>>>>
>>>>>> Have you profiled the scrape of the urls to determine if there's 
>>>>>> throttling or timing issues?
>>>>>>
>>>>>> On Sat, Feb 6, 2016 at 8:25 PM, kris brown <[email protected]> 
>>>>>> wrote:
>>>>>>
>>>>>>> Hello everyone! Apologies if this topic appeared twice, my first 
>>>>>>> attempt to post it did not seem to show up in the group.  
>>>>>>>
>>>>>>> Anyways, this is my first scrapy project and I'm trying to crawl 
>>>>>>> multiple domains ( about 100) which has presented a scheduling issue.  
>>>>>>> In 
>>>>>>> trying to be polite to the sites I'm crawling I've set a reasonable 
>>>>>>> download delay and limited the ip concurrency to 1 for any particular 
>>>>>>> domain.  What I think is happening is that the url queue fills up with 
>>>>>>> many 
>>>>>>> urls for a single domain which of course ends up dragging the crawl 
>>>>>>> rate 
>>>>>>> down to about 15/minute.  I've been thinking about writing a scheduler 
>>>>>>> that 
>>>>>>> would return the next url based on a heap sorted by the earliest time a 
>>>>>>> domain can be crawled next.  However, I'm sure others have faced a 
>>>>>>> similar 
>>>>>>> problem and as I'm a total beginner to scrapy I wanted to hear some 
>>>>>>> different opinions on how to resolve this.  Thanks!
>>>>>>>
>>>>>>> -- 
>>>>>>> You received this message because you are subscribed to the Google 
>>>>>>> Groups "scrapy-users" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>>> send an email to [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> Visit this group at https://groups.google.com/group/scrapy-users.
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Help with slow crawl rate on broad crawls

Reply via email to