You can implement your own spider_idle signal hander, to get new urls from Mongodb when the spider is idle. This way doesn't need to run Scrapy again and again.
在 2014-9-25,下午10:12,Drew Friestedt <[email protected]> 写道: > I'm trying to setup a scrape that targets 1M unique URLs on the same site. > The scrape has a proxy and captcha breaker, so it's running pretty slow and > it's prone to crash because the target site goes down frequently (not from me > scraping). Once the 1M pages are scraped, the scrape will grab about 1000 > incremental urls per day. > > URL Format: > http://www.foo.com/000000001 #the number sequence is a 'pin' > http://www.foo.com/000000002 > http://www.foo.com/000000003 > etc.. > > Does my proposed setup make sense? > > Setup mongodb with 1M pins, and a scraped flag. For example: > {'pin': '000000001', 'scraped': False} > > In the scrape I would setup a query to select 10,000 pins where 'scraped' = > False. I would then append 10,000 urls to start_urls[]. The resulting > scrape would get inserted into another collection and the pin 'scraped' flag > would get set to True. After the 10,000 pins are scraped I would run the > scrape again until all 1M pins are scraped. > > Does this setup make sense or is there a more efficient way to do this? > > -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
