Also if i set
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

Then from 100k+ pages scraped same site gives only 5k pages scraped. is it 
normal? Shouldnt it be the same?

On Saturday, August 15, 2015 at 2:43:51 PM UTC+3, ShapeR wrote:
>
> Ok, i was wrong that updating scrapy fixed it. I just accidentely set 
> depth of scan to 10k instead of 100k. When i set back it - still same 
> problem. Requests just goes up. 
>
> >>> prefs()
> Live References
>
> LinkCrawlItem                      33   oldest: 3s ago
> HtmlResponse                       59   oldest: 5s ago
> ExternalLinkSpider                  1   oldest: 352s ago
> Request                        411836   oldest: 349s ago
>
>
> The oldest request stay is memory no matter what. So at 100k page count on 
> some sites spider can reach 40gb of memory use, which is totally  flawed. 
> Can you elaborate about duplicate requests filter? I dont work anywhere in 
> my code with the requests, only with responses.
>
> As for jobdir - well it will become slow i guess and i dont think  thats 
> its normal for this spider to consume 30gb of memory.
>
>
>
> On Monday, July 27, 2015 at 7:48:13 PM UTC+3, fernando vasquez wrote:
>>
>> You are not processing Requests as fast as you capture them. I had the 
>> same problem; however the cause could be different. In my case the Link 
>> Extractor was capturing duplicated Requests, so I decided filter the 
>> duplicated ones. The problem with scrapy is that the duplicated filter work 
>> after the Link Extractor saved the Resquest, so you get tons of Requests.
>>
>> In conclusion you might have duplicated requests, just filter before the 
>> for loop.
>>
>> El jueves, 23 de julio de 2015, 12:33:19 (UTC-5), ShapeR escribió:
>>>
>>> My spider have a serious memory leak.. After 15 min of run its memory 
>>> 5gb and scrapy tells (using prefs() ) that there 900k requests objects and 
>>> thats all. What can be the reason for this high number of living requests 
>>> objects? Request only goes up and doesnt goes down. All other objects are 
>>> close to zero.
>>>
>>> My spider looks like this:
>>>
>>> class ExternalLinkSpider(CrawlSpider):
>>>   name = 'external_link_spider'
>>>   allowed_domains = ['']
>>>   start_urls = ['']
>>>
>>>   rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', 
>>> follow=True),)
>>>
>>>   def parse_obj(self, response):
>>>     if not isinstance(response, HtmlResponse):
>>>         return
>>>     for link in LxmlLinkExtractor(allow=(), 
>>> deny=self.allowed_domains).extract_links(response):
>>>         if not link.nofollow:
>>>             yield LinkCrawlItem(domain=link.url)
>>>
>>> Here output of prefs()
>>>
>>>
>>> HtmlResponse                        2   oldest: 0s ago ExternalLinkSpider   
>>>                1   oldest: 3285s agoLinkCrawlItem                       2   
>>> oldest: 0s agoRequest                        1663405   oldest: 3284s ago
>>>
>>>
>>> Any ideas or suggestions?
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to