Triggering the built-in spider_idle signal has some conditions, look at here http://doc.scrapy.org/en/latest/topics/signals.html.
You could only get several URLs not 5000 URLs at a time to observe, check these built-in settings CONCURRENT_ITEMS CONCURRENT_REQUESTS CONCURRENT_REQUESTS_PER_DOMAIN CONCURRENT_REQUESTS_PER_IP DOWNLOAD_DELAY Make sure each of them has a appropriate value. The spider_idle signal gives an ability to a spider which can continue to run if has new URLs. If you require greater speed, should run multiple spiders(use Scrapyd or manually launch) on several machines. 在 2014年10月3日,下午11:34,Drew Friestedt <[email protected]> 写道: > I got everything working great, but for the final piece, restarting the > scrape in spider_idle. self.start_requests() under spider_idle does not seem > to work. I posted this question in stackoverflow but the initial feedback I > got was to load 1M URLs in start_requests() , which I'm trying to avoid > entirely. > > http://stackoverflow.com/questions/26179390/scrapy-spider-idle-call-to-restart-scrape > > Thx > > > > > On Thursday, September 25, 2014 9:56:04 PM UTC-5, lnxpgn wrote: > You can implement your own spider_idle signal hander, to get new urls from > Mongodb when the spider is idle. This way doesn't need to run Scrapy again > and again. > > 在 2014-9-25,下午10:12,Drew Friestedt <[email protected]> 写道: > >> I'm trying to setup a scrape that targets 1M unique URLs on the same site. >> The scrape has a proxy and captcha breaker, so it's running pretty slow and >> it's prone to crash because the target site goes down frequently (not from >> me scraping). Once the 1M pages are scraped, the scrape will grab about >> 1000 incremental urls per day. >> >> URL Format: >> http://www.foo.com/000000001 #the number sequence is a 'pin' >> http://www.foo.com/000000002 >> http://www.foo.com/000000003 >> etc.. >> >> Does my proposed setup make sense? >> >> Setup mongodb with 1M pins, and a scraped flag. For example: >> {'pin': '000000001', 'scraped': False} >> >> In the scrape I would setup a query to select 10,000 pins where 'scraped' = >> False. I would then append 10,000 urls to start_urls[]. The resulting >> scrape would get inserted into another collection and the pin 'scraped' flag >> would get set to True. After the 10,000 pins are scraped I would run the >> scrape again until all 1M pins are scraped. >> >> Does this setup make sense or is there a more efficient way to do this? >> >> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at http://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. > > > -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
