Re: Help with slow crawl rate on broad crawls

Dimitris Kouzis - Loukas Tue, 09 Feb 2016 02:34:44 -0800

Yes - certainly bring your concurrency level to 8 requests per IP and be 
less nice. See if it fixes the problem and if anyone complains. Beyond 
that, make sure you don't download stuff you've already downloaded 
recently. If a site is slow, it likely doesn't have much updated content... 
not even comments... so don't recrawl the same boring old static pages. Or 
crawl them once a month, instead of on every crawl.


On Monday, February 8, 2016 at 11:19:15 PM UTC, kris brown wrote:
>
> Dmitris, 
>     Thanks for the tips.  So I have taken your advice and put a download 
> timeout on my crawl and a download delay of just a couple seconds, but 
> still face the issue of a long string of urls for a single domain in my 
> queues.  Since I'm only making a single request to a given ip address this 
> seems to bring down the crawl rate despite having a quick download time for 
> any given single request.  This is where I'm wondering what my best option 
> is for doing a broad crawl.  Implement my own scheduling queue? Run 
> multiple spiders? Not sure what the best option is.  For my current project 
> with no pipelines implemented I'm maxing out at about 2-300 crawls per 
> minute.  Definitely need to get that number much higher to have any kind of 
> reasonable performance on the crawl. Thanks again for everyone's advice!
>
> On Monday, February 8, 2016 at 2:29:22 PM UTC-6, Dimitris Kouzis - Loukas 
> wrote:
>>
>> Just a quick ugly tip... Set download timeout 
>> <http://doc.scrapy.org/en/latest/topics/settings.html#download-timeout> to 
>> 3 seconds... get done with handling the responsive websites and then try 
>> another approach with the slower ones (or skip them altogether?)
>>
>> Also don't be over-polite... if you could do something with a browser, I 
>> think, it's fair to do it with Scrapy.
>>
>>
>> On Monday, February 8, 2016 at 1:01:19 AM UTC, kris brown wrote:
>>>
>>> So the project is scraping several university websites.  I've profiled 
>>> the crawl as it's going to see the engine and downloader slots which 
>>> eventually converge to just having a single domain that urls come from. 
>>>  Having looked at the download latency on the headers I don't see any 
>>> degradation of response times.  The drift towards an extremely long series 
>>> of responses from a single domain is what lead me to think I need a 
>>> different scheduler.  If there's any other info I can provide that would be 
>>> more useful let me know.
>>>
>>> On Sunday, February 7, 2016 at 4:09:03 PM UTC-6, Travis Leleu wrote:
>>>>
>>>> What site are you scraping?  Lots of sites have good caching on common 
>>>> pages, but if you go a link or two deep, the site has to recreate the page.
>>>>
>>>> What I'm getting as is this - I think scrapy should handle this 
>>>> situation out of the box, and I'm wondering if the remote server is 
>>>> throttling you.
>>>>
>>>> Have you profiled the scrape of the urls to determine if there's 
>>>> throttling or timing issues?
>>>>
>>>> On Sat, Feb 6, 2016 at 8:25 PM, kris brown <[email protected]> 
>>>> wrote:
>>>>
>>>>> Hello everyone! Apologies if this topic appeared twice, my first 
>>>>> attempt to post it did not seem to show up in the group.  
>>>>>
>>>>> Anyways, this is my first scrapy project and I'm trying to crawl 
>>>>> multiple domains ( about 100) which has presented a scheduling issue.  In 
>>>>> trying to be polite to the sites I'm crawling I've set a reasonable 
>>>>> download delay and limited the ip concurrency to 1 for any particular 
>>>>> domain.  What I think is happening is that the url queue fills up with 
>>>>> many 
>>>>> urls for a single domain which of course ends up dragging the crawl rate 
>>>>> down to about 15/minute.  I've been thinking about writing a scheduler 
>>>>> that 
>>>>> would return the next url based on a heap sorted by the earliest time a 
>>>>> domain can be crawled next.  However, I'm sure others have faced a 
>>>>> similar 
>>>>> problem and as I'm a total beginner to scrapy I wanted to hear some 
>>>>> different opinions on how to resolve this.  Thanks!
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "scrapy-users" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/scrapy-users.
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Help with slow crawl rate on broad crawls

Reply via email to