Re: Crawling slows down drastically towards the end

Hyder Alamgir Mon, 18 Apr 2016 05:54:48 -0700

Disabling DNSCACHE_ENABLED and HTTPCACHE_ENABLED doesn't help. Still facing 
the same issue.


Any idea how I go about finding out what the last few URLs are?

Besides, I've already set DOWNLOAD_TIMEOUT to 15 and DNS_TIMEOUT = 10. In 
tcptrack, I don't see any connections longer than 15 seconds.


On Monday, April 18, 2016 at 2:14:22 PM UTC+5, vishal singh wrote:
>
> disable DNSCACHE_ENABLED and HTTPCACHE_ENABLED, and check if you are 
> getting same results.
> try to open last urls manually in scrapy shell and check if its taking 
> more than usual time
>
> On Mon, Apr 18, 2016 at 5:57 AM, Hyder Alamgir <[email protected] 
> <javascript:>> wrote:
>
>> I've got a set of 25,000+ urls that I need to scrape. I'm consistently 
>> seeing that after about 22,000 urls the crawl rate drops drastically.
>>
>> Take a look at these *log lines* to get some perspective:
>>
>> 2016-04-18 00:14:06 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), 
>> scraped 0 items (at 0 items/min)
>> 2016-04-18 00:15:06 [scrapy] INFO: Crawled 5324 pages (at *5324* 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2016-04-18 00:16:06 [scrapy] INFO: Crawled 9475 pages (at *4151* 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2016-04-18 00:17:06 [scrapy] INFO: Crawled 14416 pages (at *4941* 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2016-04-18 00:18:07 [scrapy] INFO: Crawled 20575 pages (at *6159* 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2016-04-18 00:19:06 [scrapy] INFO: Crawled 22036 pages (at *1461* 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2016-04-18 00:20:06 [scrapy] INFO: Crawled 22106 pages (at *70* 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2016-04-18 00:21:06 [scrapy] INFO: Crawled 22146 pages (at *40* 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2016-04-18 00:22:06 [scrapy] INFO: Crawled 22189 pages (at *43* 
>> pages/min), scraped 0 items (at 0 items/min)
>> 2016-04-18 00:23:06 [scrapy] INFO: Crawled 22229 pages (at *40* 
>> pages/min), scraped 0 items (at 0 items/min)
>>
>> *Here're my settings*
>>
>> # -*- coding: utf-8 -*-
>>
>> BOT_NAME = 'crawler'
>>
>> SPIDER_MODULES = ['crawler.spiders']
>> NEWSPIDER_MODULE = 'crawler.spiders'
>>
>> CONCURRENT_REQUESTS = 10
>> REACTOR_THREADPOOL_MAXSIZE = 100
>> LOG_LEVEL = 'INFO'
>> COOKIES_ENABLED = False
>> RETRY_ENABLED = False
>> DOWNLOAD_TIMEOUT = 15
>> DNSCACHE_ENABLED = True
>> DNSCACHE_SIZE = 1024000
>> DNS_TIMEOUT = 10
>> DOWNLOAD_MAXSIZE = 1024000 # 10 MB
>> DOWNLOAD_WARNSIZE = 819200 # 8 MB
>> REDIRECT_MAX_TIMES = 3
>> METAREFRESH_MAXDELAY = 10
>> ROBOTSTXT_OBEY = True
>> USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, 
>> like Gecko) Chrome/41.0.2228.0 Safari/537.36' #Chrome 41
>>
>> DEPTH_PRIORITY = 1
>> SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
>> SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
>>
>> #DOWNLOAD_DELAY = 1
>> #AUTOTHROTTLE_ENABLED = True
>> HTTPCACHE_ENABLED = True
>> HTTPCACHE_EXPIRATION_SECS = 604800 # 7 days
>> COMPRESSION_ENABLED = True
>>
>> DOWNLOADER_MIDDLEWARES = {
>>     'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
>>     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
>>     
>> 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 
>> 350,
>>     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
>>     
>> 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 550,
>>     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
>>     
>> 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 
>> 590,
>>     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
>>     'crawler.middlewares.RandomizeProxies': 740,
>>     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
>>     'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware': 830,
>>     'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
>>     'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
>> }
>>
>> PROXY_LIST = '/etc/scrapyd/proxy_list.txt'
>>
>> Memory and CPU consumption is *less than 10%*
>> tcptrack shows *no unusual network activity*
>> iostat shows *negligible disk i/o\*
>>
>>
>> What can I look at to debug this?
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Crawling slows down drastically towards the end

Reply via email to