And PhantomJS crashes randomly a lot (well, "a lot" depends on how much
you're doing), so you have to deal with that. And the libraries for
controlling it all suck except for the one I wrote (obviously I state that
completely unbiasedly! /s).

But no, I don't know of anything that deals with all the issues around
throttling and so on. But it's not that hard to use something like Kue or
just async.queue to get some sane level of throttling implemented.


On Wed, Jan 15, 2014 at 10:36 PM, // ravi <ravi-li...@g8o.net> wrote:

> On Jan 15, 2014, at 9:09 PM, Victor Hooi <victorh...@gmail.com> wrote:
>
>
> I'm wondering if anybody knows of any web-scraping frameworks in Node.JS?
>
> Previously, there was node.io (https://github.com/chriso/node.io),
> however, the project was recently discontinued.
>
> Googling for Node.JS and web scraping, most of the guides online just talk
> about using requests and cheerio - it works, but you need to handle a whole
> bunch of things yourself (throttling, distributing jobs, configuration,
> managing jobs etc.).
>
>
> There are a few modules (node-crawler, simple-crawler, etc) that might
> help you. Ultimately you may have to wrap something around PhantomJS to
> deal with JS modifications to the DOM (which can in turn be a bit of a pain
> since PhantomJS for various reasons has to be run independently).
>
> —ravi
>
>
> On the Python side, I know of Scrapy (https://github.com/scrapy/scrapy),
> which is using Twisted for asynchronicity
> On the Ruby side, Nokogiri (http://nokogiri.org/) is meant to be good,
> although I haven't dived into it much.
> Is there anything equivalent in the Node world? Or what approaches are
> people using to tackle this problem?
>
>
>  --
> --
> Job Board: http://jobs.nodejs.org/
> Posting guidelines:
> https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
> You received this message because you are subscribed to the Google
> Groups "nodejs" group.
> To post to this group, send email to nodejs@googlegroups.com
> To unsubscribe from this group, send email to
> nodejs+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/nodejs?hl=en?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "nodejs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to nodejs+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
-- 
Job Board: http://jobs.nodejs.org/
Posting guidelines: 
https://github.com/joyent/node/wiki/Mailing-List-Posting-Guidelines
You received this message because you are subscribed to the Google
Groups "nodejs" group.
To post to this group, send email to nodejs@googlegroups.com
To unsubscribe from this group, send email to
nodejs+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/nodejs?hl=en?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"nodejs" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to nodejs+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to