Use this for javascript, works perfect: https://github.com/brandicted/scrapy-webdriver
Op woensdag 3 december 2014 16:24:36 UTC+1 schreef Travis Leleu: > > Hi Adi, > > I believe scrapy would meet your needs, especially since you have a > decentralized queue to feed the urls into it. > > 1. If you use the start_requests() method (see more: > http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.start_requests), > > you can just consume from the queue to feed URLs into scrapy. You can pop > the queue, modify the URL as needed, and yield it to the scrapy core engine. > > 2. scrapyd is a convenient way to send jobs around to different systems, > without having to copy your codebase. It's essentially a deployment tool. > Scrapy is pretty efficient for web scraping. Scraping is I/O bound, and > scrapy uses Twisted, an async http framework. So scrapy fires off a > request, then forgets about it until the request comes back through > Twisted. In the interim, it can process or fire off other requests. > > Processing requirements vary, but I would expect you could have hundreds, > if not thousands, of concurrent scraping requests using a medium sized ec2 > server. > > In my experience, the only shortcomings of scrapy are the architectural > complexity (takes some time to master), and the lack of javascript > support. So many sites are one page apps that load their content via js, > and scrapy (to my knowledge) can't do anything with that. > > Hope this helps, > Travis > > On Wed, Dec 3, 2014 at 4:01 AM, <[email protected] <javascript:>> wrote: > >> Hi, >> I am building a back-end which one of its modules needs to do web >> scraping of various sites. The URL is originated by an end user, therefore >> the domain is known before-hand, but the full URL is dynamic. >> >> The back-end is planned to support thousands of requests per second. >> I like what I see for scrapy regarding feature coverage, extensibility, >> ease of use and more, but I am concerned of those 2 points: >> >> 1. Passing the URL in real-time as an argument to scrapy, where only the >> domain (therefore, the specific spider) is known >> 2. I've read that in order to invoke scrapy via API one should use >> scrapyd with json API, which invokes a process per scraping. It means that >> a process per request runs, and this is not scalable (imagine each request >> takes 1.5 second). >> >> Please advise, >> >> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at http://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
