Re: Running scrapy in scale

Erik van de Ven Sun, 10 May 2015 03:02:00 -0700

Use this for javascript, works 
perfect: https://github.com/brandicted/scrapy-webdriver


Op woensdag 3 december 2014 16:24:36 UTC+1 schreef Travis Leleu:
>
> Hi Adi,
>
> I believe scrapy would meet your needs, especially since you have a 
> decentralized queue to feed the urls into it.
>
> 1. If you use the start_requests() method (see more: 
> http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.start_requests),
>  
> you can just consume from the queue to feed URLs into scrapy.  You can pop 
> the queue, modify the URL as needed, and yield it to the scrapy core engine.
>
> 2. scrapyd is a convenient way to send jobs around to different systems, 
> without having to copy your codebase.  It's essentially a deployment tool.  
> Scrapy is pretty efficient for web scraping.  Scraping is I/O bound, and 
> scrapy uses Twisted, an async http framework.  So scrapy fires off a 
> request, then forgets about it until the request comes back through 
> Twisted.  In the interim, it can process or fire off other requests.
>
> Processing requirements vary, but I would expect you could have hundreds, 
> if not thousands, of concurrent scraping requests using a medium sized ec2 
> server.
>
> In my experience, the only shortcomings of scrapy are the architectural 
> complexity (takes some time to master), and the lack of javascript 
> support.  So many sites are one page apps that load their content via js, 
> and scrapy (to my knowledge) can't do anything with that.
>
> Hope this helps,
> Travis
>
> On Wed, Dec 3, 2014 at 4:01 AM, <[email protected] <javascript:>> wrote:
>
>> Hi,
>> I am building a back-end which one of its modules needs to do web 
>> scraping of various sites. The URL is originated by an end user, therefore 
>> the domain is known before-hand, but the full URL is dynamic.
>>
>> The back-end is planned to support thousands of requests per second.
>> I like what I see for scrapy regarding feature coverage, extensibility, 
>> ease of use and more, but I am concerned of those 2 points:
>>
>> 1. Passing the URL  in real-time as an argument to scrapy, where only the 
>> domain (therefore, the specific spider) is known
>> 2. I've read that in order to invoke scrapy via API one should use 
>> scrapyd with json API, which invokes a process per scraping. It means that 
>> a process per request runs, and this is not scalable (imagine each request 
>> takes 1.5 second).
>>
>> Please advise,
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Running scrapy in scale

Reply via email to