Ok, i was wrong that updating scrapy fixed it. I just accidentely set depth of scan to 10k instead of 100k. When i set back it - still same problem. Requests just goes up.
>>> prefs() Live References LinkCrawlItem 33 oldest: 3s ago HtmlResponse 59 oldest: 5s ago ExternalLinkSpider 1 oldest: 352s ago Request 411836 oldest: 349s ago The oldest request stay is memory no matter what. So at 100k page count on some sites spider can reach 40gb of memory use, which is totally flawed. Can you elaborate about duplicate requests filter? I dont work anywhere in my code with the requests, only with responses. As for jobdir - well it will become slow i guess and i dont think thats its normal for this spider to consume 30gb of memory. On Monday, July 27, 2015 at 7:48:13 PM UTC+3, fernando vasquez wrote: > > You are not processing Requests as fast as you capture them. I had the > same problem; however the cause could be different. In my case the Link > Extractor was capturing duplicated Requests, so I decided filter the > duplicated ones. The problem with scrapy is that the duplicated filter work > after the Link Extractor saved the Resquest, so you get tons of Requests. > > In conclusion you might have duplicated requests, just filter before the > for loop. > > El jueves, 23 de julio de 2015, 12:33:19 (UTC-5), ShapeR escribió: >> >> My spider have a serious memory leak.. After 15 min of run its memory 5gb >> and scrapy tells (using prefs() ) that there 900k requests objects and >> thats all. What can be the reason for this high number of living requests >> objects? Request only goes up and doesnt goes down. All other objects are >> close to zero. >> >> My spider looks like this: >> >> class ExternalLinkSpider(CrawlSpider): >> name = 'external_link_spider' >> allowed_domains = [''] >> start_urls = [''] >> >> rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_obj', >> follow=True),) >> >> def parse_obj(self, response): >> if not isinstance(response, HtmlResponse): >> return >> for link in LxmlLinkExtractor(allow=(), >> deny=self.allowed_domains).extract_links(response): >> if not link.nofollow: >> yield LinkCrawlItem(domain=link.url) >> >> Here output of prefs() >> >> >> HtmlResponse 2 oldest: 0s ago ExternalLinkSpider >> 1 oldest: 3285s agoLinkCrawlItem 2 >> oldest: 0s agoRequest 1663405 oldest: 3284s ago >> >> >> Any ideas or suggestions? >> > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
