I am using scrapy to scrape multiple sites and Scrapyd to run spiders.
I had written 7 spiders and each spider processes at least 50 start URLs. I
have around 7000 URL's. 1000 URL's for each spider.
As I start placing jobs in ScrapyD with 50 start URL's per job. Initially
all spiders responds fine but suddenly they start working really slow.
While running those on localhost it gives high performance.
While I run Scrapyd on localhost it gives me very high performance. As I
publish jobs on Scrapyd server. Request response time drastically decreases.
Response time for each start URL is really slow after some time on server
Settings looks like this:
BOT_NAME = 'service_scraper'
SPIDER_MODULES = ['service_scraper.spiders']
NEWSPIDER_MODULE = 'service_scraper.spiders'
CONCURRENT_REQUESTS = 30
# DOWNLOAD_DELAY = 0
CONCURRENT_REQUESTS_PER_DOMAIN = 1000
ITEM_PIPELINES = {
'service_scraper.pipelines.MongoInsert': 300,
}
MONGO_URL="mongodb://xxxxx:yyyy"
EXTENSIONS = {'scrapy.contrib.feedexport.FeedExporter': None}
HTTPCACHE_ENABLED = True
We tried changing CONCURRENT_REQUESTS and CONCURRENT_REQUESTS_PER_DOMAIN,
but nothing is working. We had hosted scrapyd in AWS EC2.
--
You received this message because you are subscribed to the Google Groups
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.