disable DNSCACHE_ENABLED and HTTPCACHE_ENABLED, and check if you are
getting same results.
try to open last urls manually in scrapy shell and check if its taking more
than usual time

On Mon, Apr 18, 2016 at 5:57 AM, Hyder Alamgir <[email protected]>
wrote:

> I've got a set of 25,000+ urls that I need to scrape. I'm consistently
> seeing that after about 22,000 urls the crawl rate drops drastically.
>
> Take a look at these *log lines* to get some perspective:
>
> 2016-04-18 00:14:06 [scrapy] INFO: Crawled 0 pages (at 0 pages/min),
> scraped 0 items (at 0 items/min)
> 2016-04-18 00:15:06 [scrapy] INFO: Crawled 5324 pages (at *5324*
> pages/min), scraped 0 items (at 0 items/min)
> 2016-04-18 00:16:06 [scrapy] INFO: Crawled 9475 pages (at *4151*
> pages/min), scraped 0 items (at 0 items/min)
> 2016-04-18 00:17:06 [scrapy] INFO: Crawled 14416 pages (at *4941*
> pages/min), scraped 0 items (at 0 items/min)
> 2016-04-18 00:18:07 [scrapy] INFO: Crawled 20575 pages (at *6159*
> pages/min), scraped 0 items (at 0 items/min)
> 2016-04-18 00:19:06 [scrapy] INFO: Crawled 22036 pages (at *1461*
> pages/min), scraped 0 items (at 0 items/min)
> 2016-04-18 00:20:06 [scrapy] INFO: Crawled 22106 pages (at *70*
> pages/min), scraped 0 items (at 0 items/min)
> 2016-04-18 00:21:06 [scrapy] INFO: Crawled 22146 pages (at *40*
> pages/min), scraped 0 items (at 0 items/min)
> 2016-04-18 00:22:06 [scrapy] INFO: Crawled 22189 pages (at *43*
> pages/min), scraped 0 items (at 0 items/min)
> 2016-04-18 00:23:06 [scrapy] INFO: Crawled 22229 pages (at *40*
> pages/min), scraped 0 items (at 0 items/min)
>
> *Here're my settings*
>
> # -*- coding: utf-8 -*-
>
> BOT_NAME = 'crawler'
>
> SPIDER_MODULES = ['crawler.spiders']
> NEWSPIDER_MODULE = 'crawler.spiders'
>
> CONCURRENT_REQUESTS = 10
> REACTOR_THREADPOOL_MAXSIZE = 100
> LOG_LEVEL = 'INFO'
> COOKIES_ENABLED = False
> RETRY_ENABLED = False
> DOWNLOAD_TIMEOUT = 15
> DNSCACHE_ENABLED = True
> DNSCACHE_SIZE = 1024000
> DNS_TIMEOUT = 10
> DOWNLOAD_MAXSIZE = 1024000 # 10 MB
> DOWNLOAD_WARNSIZE = 819200 # 8 MB
> REDIRECT_MAX_TIMES = 3
> METAREFRESH_MAXDELAY = 10
> ROBOTSTXT_OBEY = True
> USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like
> Gecko) Chrome/41.0.2228.0 Safari/537.36' #Chrome 41
>
> DEPTH_PRIORITY = 1
> SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
> SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
>
> #DOWNLOAD_DELAY = 1
> #AUTOTHROTTLE_ENABLED = True
> HTTPCACHE_ENABLED = True
> HTTPCACHE_EXPIRATION_SECS = 604800 # 7 days
> COMPRESSION_ENABLED = True
>
> DOWNLOADER_MIDDLEWARES = {
>     'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
>     'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
>
> 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware':
> 350,
>     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
>
> 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 550,
>     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
>
> 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':
> 590,
>     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
>     'crawler.middlewares.RandomizeProxies': 740,
>     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
>     'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware': 830,
>     'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
>     'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
> }
>
> PROXY_LIST = '/etc/scrapyd/proxy_list.txt'
>
> Memory and CPU consumption is *less than 10%*
> tcptrack shows *no unusual network activity*
> iostat shows *negligible disk i/o\*
>
>
> What can I look at to debug this?
>
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to