I'm trying to setup a scrape that targets 1M unique URLs on the same site. The scrape has a proxy and captcha breaker, so it's running pretty slow and it's prone to crash because the target site goes down frequently (not from me scraping). Once the 1M pages are scraped, the scrape will grab about 1000 incremental urls per day.
URL Format: http://www.foo.com/000000001 #the number sequence is a 'pin' http://www.foo.com/000000002 http://www.foo.com/000000003 etc.. Does my proposed setup make sense? Setup mongodb with 1M pins, and a scraped flag. For example: {'pin': '000000001', 'scraped': False} In the scrape I would setup a query to select 10,000 pins where 'scraped' = False. I would then append 10,000 urls to start_urls[]. The resulting scrape would get inserted into another collection and the pin 'scraped' flag would get set to True. After the 10,000 pins are scraped I would run the scrape again until all 1M pins are scraped. Does this setup make sense or is there a more efficient way to do this? -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
