Re: 1M Page Scrape Setup

lnxpgn Sat, 04 Oct 2014 00:38:25 -0700

Triggering the built-in spider_idle signal has some conditions, look at here 
http://doc.scrapy.org/en/latest/topics/signals.html.


You could only get several URLs not 5000 URLs at a time to observe, check these 
built-in settings CONCURRENT_ITEMS
CONCURRENT_REQUESTS
CONCURRENT_REQUESTS_PER_DOMAIN
CONCURRENT_REQUESTS_PER_IP
DOWNLOAD_DELAY
Make sure each of them has a appropriate value.

The spider_idle signal gives an ability to a spider which can continue to run 
if has new URLs.

If you require greater speed, should run multiple spiders(use Scrapyd or 
manually launch) on several machines.


在 2014年10月3日，下午11:34，Drew Friestedt <[email protected]> 写道：

> I got everything working great, but for the final piece, restarting the 
> scrape in spider_idle.  self.start_requests() under spider_idle does not seem 
> to work.  I posted this question in stackoverflow but the initial feedback I 
> got was to load 1M URLs in start_requests() , which I'm trying to avoid 
> entirely.
> 
> http://stackoverflow.com/questions/26179390/scrapy-spider-idle-call-to-restart-scrape
> 
> Thx
> 
> 
> 
> 
> On Thursday, September 25, 2014 9:56:04 PM UTC-5, lnxpgn wrote:
> You can implement your own spider_idle signal hander,  to get new urls from 
> Mongodb when the spider is idle. This way doesn't need to run Scrapy again 
> and again.
> 
> 在 2014-9-25，下午10:12，Drew Friestedt <[email protected]> 写道：
> 
>> I'm trying to setup a scrape that targets 1M unique URLs on the same site.  
>> The scrape has a proxy and captcha breaker, so it's running pretty slow and 
>> it's prone to crash because the target site goes down frequently (not from 
>> me scraping).  Once the 1M pages are scraped, the scrape will grab about 
>> 1000 incremental urls per day.    
>> 
>> URL Format:
>> http://www.foo.com/000000001 #the number sequence is a 'pin'
>> http://www.foo.com/000000002
>> http://www.foo.com/000000003
>> etc..
>> 
>> Does my proposed setup make sense?  
>> 
>> Setup mongodb with 1M pins, and a scraped flag.  For example:
>> {'pin': '000000001', 'scraped': False}
>> 
>> In the scrape I would setup a query to select 10,000 pins where 'scraped' = 
>> False.  I would then append 10,000 urls to start_urls[].  The resulting 
>> scrape would get inserted into another collection and the pin 'scraped' flag 
>> would get set to True.  After the 10,000 pins are scraped I would run the 
>> scrape again until all 1M pins are scraped.
>> 
>> Does this setup make sense or is there a more efficient way to do this?  
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: 1M Page Scrape Setup

Reply via email to