Also if i set SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
Then from 100k+ pages scraped same site gives only 5k pages scraped. is it normal? Shouldnt it be the same? On Saturday, August 15, 2015 at 2:43:51 PM UTC+3, ShapeR wrote: > > Ok, i was wrong that updating scrapy fixed it. I just accidentely set > depth of scan to 10k instead of 100k. When i set back it - still same > problem. Requests just goes up. > > >>> prefs() > Live References > > LinkCrawlItem 33 oldest: 3s ago > HtmlResponse 59 oldest: 5s ago > ExternalLinkSpider 1 oldest: 352s ago > Request 411836 oldest: 349s ago > > > The oldest request stay is memory no matter what. So at 100k page count on > some sites spider can reach 40gb of memory use, which is totally flawed. > Can you elaborate about duplicate requests filter? I dont work anywhere in > my code with the requests, only with responses. > > As for jobdir - well it will become slow i guess and i dont think thats > its normal for this spider to consume 30gb of memory. > > > > On Monday, July 27, 2015 at 7:48:13 PM UTC+3, fernando vasquez wrote: >> >> You are not processing Requests as fast as you capture them. I had the >> same problem; however the cause could be different. In my case the Link >> Extractor was capturing duplicated Requests, so I decided filter the >> duplicated ones. The problem with scrapy is that the duplicated filter work >> after the Link Extractor saved the Resquest, so you get tons of Requests. >> >> In conclusion you might have duplicated requests, just filter before the >> for loop. >> >> El jueves, 23 de julio de 2015, 12:33:19 (UTC-5), ShapeR escribió: >>> >>> My spider have a serious memory leak.. After 15 min of run its memory >>> 5gb and scrapy tells (using prefs() ) that there 900k requests objects and >>> thats all. What can be the reason for this high number of living requests >>> objects? Request only goes up and doesnt goes down. All other objects are >>> close to zero. >>> >>> My spider looks like this: >>> >>> class ExternalLinkSpider(CrawlSpider): >>> name = 'external_link_spider' >>> allowed_domains = [''] >>> start_urls = [''] >>> >>> rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', >>> follow=True),) >>> >>> def parse_obj(self, response): >>> if not isinstance(response, HtmlResponse): >>> return >>> for link in LxmlLinkExtractor(allow=(), >>> deny=self.allowed_domains).extract_links(response): >>> if not link.nofollow: >>> yield LinkCrawlItem(domain=link.url) >>> >>> Here output of prefs() >>> >>> >>> HtmlResponse 2 oldest: 0s ago ExternalLinkSpider >>> 1 oldest: 3285s agoLinkCrawlItem 2 >>> oldest: 0s agoRequest 1663405 oldest: 3284s ago >>> >>> >>> Any ideas or suggestions? >>> >> -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
