Disabling DNSCACHE_ENABLED and HTTPCACHE_ENABLED doesn't help. Still facing the same issue.
Any idea how I go about finding out what the last few URLs are? Besides, I've already set DOWNLOAD_TIMEOUT to 15 and DNS_TIMEOUT = 10. In tcptrack, I don't see any connections longer than 15 seconds. On Monday, April 18, 2016 at 2:14:22 PM UTC+5, vishal singh wrote: > > disable DNSCACHE_ENABLED and HTTPCACHE_ENABLED, and check if you are > getting same results. > try to open last urls manually in scrapy shell and check if its taking > more than usual time > > On Mon, Apr 18, 2016 at 5:57 AM, Hyder Alamgir <[email protected] > <javascript:>> wrote: > >> I've got a set of 25,000+ urls that I need to scrape. I'm consistently >> seeing that after about 22,000 urls the crawl rate drops drastically. >> >> Take a look at these *log lines* to get some perspective: >> >> 2016-04-18 00:14:06 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), >> scraped 0 items (at 0 items/min) >> 2016-04-18 00:15:06 [scrapy] INFO: Crawled 5324 pages (at *5324* >> pages/min), scraped 0 items (at 0 items/min) >> 2016-04-18 00:16:06 [scrapy] INFO: Crawled 9475 pages (at *4151* >> pages/min), scraped 0 items (at 0 items/min) >> 2016-04-18 00:17:06 [scrapy] INFO: Crawled 14416 pages (at *4941* >> pages/min), scraped 0 items (at 0 items/min) >> 2016-04-18 00:18:07 [scrapy] INFO: Crawled 20575 pages (at *6159* >> pages/min), scraped 0 items (at 0 items/min) >> 2016-04-18 00:19:06 [scrapy] INFO: Crawled 22036 pages (at *1461* >> pages/min), scraped 0 items (at 0 items/min) >> 2016-04-18 00:20:06 [scrapy] INFO: Crawled 22106 pages (at *70* >> pages/min), scraped 0 items (at 0 items/min) >> 2016-04-18 00:21:06 [scrapy] INFO: Crawled 22146 pages (at *40* >> pages/min), scraped 0 items (at 0 items/min) >> 2016-04-18 00:22:06 [scrapy] INFO: Crawled 22189 pages (at *43* >> pages/min), scraped 0 items (at 0 items/min) >> 2016-04-18 00:23:06 [scrapy] INFO: Crawled 22229 pages (at *40* >> pages/min), scraped 0 items (at 0 items/min) >> >> *Here're my settings* >> >> # -*- coding: utf-8 -*- >> >> BOT_NAME = 'crawler' >> >> SPIDER_MODULES = ['crawler.spiders'] >> NEWSPIDER_MODULE = 'crawler.spiders' >> >> CONCURRENT_REQUESTS = 10 >> REACTOR_THREADPOOL_MAXSIZE = 100 >> LOG_LEVEL = 'INFO' >> COOKIES_ENABLED = False >> RETRY_ENABLED = False >> DOWNLOAD_TIMEOUT = 15 >> DNSCACHE_ENABLED = True >> DNSCACHE_SIZE = 1024000 >> DNS_TIMEOUT = 10 >> DOWNLOAD_MAXSIZE = 1024000 # 10 MB >> DOWNLOAD_WARNSIZE = 819200 # 8 MB >> REDIRECT_MAX_TIMES = 3 >> METAREFRESH_MAXDELAY = 10 >> ROBOTSTXT_OBEY = True >> USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, >> like Gecko) Chrome/41.0.2228.0 Safari/537.36' #Chrome 41 >> >> DEPTH_PRIORITY = 1 >> SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue' >> SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue' >> >> #DOWNLOAD_DELAY = 1 >> #AUTOTHROTTLE_ENABLED = True >> HTTPCACHE_ENABLED = True >> HTTPCACHE_EXPIRATION_SECS = 604800 # 7 days >> COMPRESSION_ENABLED = True >> >> DOWNLOADER_MIDDLEWARES = { >> 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100, >> 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300, >> >> 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': >> 350, >> 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400, >> >> 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 550, >> 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580, >> >> 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': >> 590, >> 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600, >> 'crawler.middlewares.RandomizeProxies': 740, >> 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750, >> 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware': 830, >> 'scrapy.downloadermiddlewares.stats.DownloaderStats': 850, >> 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900, >> } >> >> PROXY_LIST = '/etc/scrapyd/proxy_list.txt' >> >> Memory and CPU consumption is *less than 10%* >> tcptrack shows *no unusual network activity* >> iostat shows *negligible disk i/o\* >> >> >> What can I look at to debug this? >> >> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
