Re: 1M Page Scrape Setup

Travis Leleu Thu, 25 Sep 2014 17:15:08 -0700

Scrapy uses Twisted, the asynchronous library, so it will have multiple
http requests made "simultaneously".  Is that what you mean by "process one
url at a time"?


IE, when I run a scrapy crawl process, I see it load around 20 URLs from my
database into the (internal) queuing system.  Dependent on
your CONCURRENT_REQUESTS_PER_DOMAIN setting, scrapy will make that many
parallel requests to each domain.

In other words, scrapy does not wait until the previous request has
returned before issuing the next one.  If this seems confusing, you might
want to read a little on how Twisted does asynchronous requests (I would
offer more, but I don't really know much about that myself.)

Best of luck,
Travis


On Thu, Sep 25, 2014 at 5:06 PM, Drew Friestedt <[email protected]>
wrote:

>  If I implement this recommendation will scrapy process more than one url
> at a time.  After reading the documents it looks like it will only process
> one url at a time:
>
>  start_requests() > Query mongodb > select 1 pin >  parse url > update
> mongodb > call start_requests()
>
>  Can I construction a list of URLs and parse a list rather than an
> individual URL?
>
>   From: "[email protected]" <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Thursday, September 25, 2014 at 11:45 AM
> To: scrapy-users <[email protected]>
> Subject: Re: 1M Page Scrape Setup
>
>   Drew,
>
>  Take a look at the start_requests() method in scrapy's crawler class.
> You'll override this method, and should yield the Request object for the
> next page to scrape.  Ref:
> http://doc.scrapy.org/en/latest/topics/spiders.html?highlight=make_request#scrapy.spider.Spider.make_requests_from_url
>
>  I like to use start_requests() when I'm pulling from a database, because
> you can write the function as a generator to only pull from the db when you
> need.  (I usually also mark the status as "QUEUED" in my DB once it's been
> handed to scrapy, and this is a good place to put that logic.)
>
>  One gotcha with this that I've run into: if you query mongo and have a
> cursor pointing to your results, that cursor will time out much quicker
> than I expected.  I implemented the start_requests() as a generator,
> described above.  But the cursor would time our between times retrieving
> the URLs!  (You can check if the cursor is timed out and re-acquire the
> result set in start_requests(), or you can move to using a queuing data
> structure as I tend to prefer.)
>
>  Hope this helps.  If you get stuck with start_Requests(), feel free to
> send me a link to a binpaste and I'll check it out when I have time.
>
>  Thanks,
> Travis
>
>  
>
>
>  On Thu, Sep 25, 2014 at 7:45 AM, Nicolás Alejandro Ramírez Quiros <
> [email protected]> wrote:
>
>> If you already have the "pins" you want to crawl, just make a file with
>> them, then crawl the site. When the spider stops you calculate the
>> difference between spider output and your total, and you launch the spider
>> with that; you will have to repeat as many times needed.
>>
>> El jueves, 25 de septiembre de 2014 11:12:04 UTC-3, Drew Friestedt
>> escribió:
>>
>>> I'm trying to setup a scrape that targets 1M unique URLs on the same
>>> site.  The scrape has a proxy and captcha breaker, so it's running pretty
>>> slow and it's prone to crash because the target site goes down frequently
>>> (not from me scraping).  Once the 1M pages are scraped, the scrape will
>>> grab about 1000 incremental urls per day.
>>>
>>> URL Format:
>>> http://www.foo.com/000000001 #the number sequence is a 'pin'
>>> http://www.foo.com/000000002
>>> http://www.foo.com/000000003
>>> etc..
>>>
>>> Does my proposed setup make sense?
>>>
>>> Setup mongodb with 1M pins, and a scraped flag.  For example:
>>> {'pin': '000000001', 'scraped': False}
>>>
>>> In the scrape I would setup a query to select 10,000 pins where
>>> 'scraped' = False.  I would then append 10,000 urls to start_urls[].  The
>>> resulting scrape would get inserted into another collection and the pin
>>> 'scraped' flag would get set to True.  After the 10,000 pins are scraped I
>>> would run the scrape again until all 1M pins are scraped.
>>>
>>> Does this setup make sense or is there a more efficient way to do this?
>>>
>>    --
>> You received this message because you are subscribed to the Google Groups
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To post to this group, send email to [email protected].
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>   --
> You received this message because you are subscribed to a topic in the
> Google Groups "scrapy-users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/scrapy-users/sAFnLZ3wsKc/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: 1M Page Scrape Setup

Reply via email to