1M Page Scrape Setup

Drew Friestedt Thu, 25 Sep 2014 07:12:59 -0700

I'm trying to setup a scrape that targets 1M unique URLs on the same site.  
The scrape has a proxy and captcha breaker, so it's running pretty slow and 
it's prone to crash because the target site goes down frequently (not from 
me scraping).  Once the 1M pages are scraped, the scrape will grab about 
1000 incremental urls per day.


URL Format:
http://www.foo.com/000000001 #the number sequence is a 'pin'
http://www.foo.com/000000002
http://www.foo.com/000000003
etc..

Does my proposed setup make sense?  

Setup mongodb with 1M pins, and a scraped flag.  For example:
{'pin': '000000001', 'scraped': False}

In the scrape I would setup a query to select 10,000 pins where 'scraped' = 
False.  I would then append 10,000 urls to start_urls[].  The resulting 
scrape would get inserted into another collection and the pin 'scraped' 
flag would get set to True.  After the 10,000 pins are scraped I would run 
the scrape again until all 1M pins are scraped.

Does this setup make sense or is there a more efficient way to do this?  

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

1M Page Scrape Setup

Reply via email to