I got everything working great, but for the final piece, restarting the 
scrape in spider_idle.  self.start_requests() under spider_idle does not 
seem to work.  I posted this question in stackoverflow but the initial 
feedback I got was to load 1M URLs in start_requests() , which I'm trying 
to avoid entirely.

http://stackoverflow.com/questions/26179390/scrapy-spider-idle-call-to-restart-scrape

Thx




On Thursday, September 25, 2014 9:56:04 PM UTC-5, lnxpgn wrote:
>
> You can implement your own spider_idle signal hander,  to get new urls 
> from Mongodb when the spider is idle. This way doesn't need to run Scrapy 
> again and again.
>
> 在 2014-9-25,下午10:12,Drew Friestedt <[email protected] <javascript:>> 
> 写道:
>
> I'm trying to setup a scrape that targets 1M unique URLs on the same 
> site.  The scrape has a proxy and captcha breaker, so it's running pretty 
> slow and it's prone to crash because the target site goes down frequently 
> (not from me scraping).  Once the 1M pages are scraped, the scrape will 
> grab about 1000 incremental urls per day.    
>
> URL Format:
> http://www.foo.com/000000001 #the number sequence is a 'pin'
> http://www.foo.com/000000002
> http://www.foo.com/000000003
> etc..
>
> Does my proposed setup make sense?  
>
> Setup mongodb with 1M pins, and a scraped flag.  For example:
> {'pin': '000000001', 'scraped': False}
>
> In the scrape I would setup a query to select 10,000 pins where 'scraped' 
> = False.  I would then append 10,000 urls to start_urls[].  The resulting 
> scrape would get inserted into another collection and the pin 'scraped' 
> flag would get set to True.  After the 10,000 pins are scraped I would run 
> the scrape again until all 1M pins are scraped.
>
> Does this setup make sense or is there a more efficient way to do this?  
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] <javascript:>.
> To post to this group, send email to [email protected] 
> <javascript:>.
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to