Re: [New Nutch Plugin] Delegate fetching to Selenium/Firefox for those jobs where you neeeeed javascript parsing

Bin Wang Tue, 29 Jul 2014 21:47:18 -0700

+1 for using selenium-grid.


On Tue, Jul 29, 2014 at 8:26 AM, Mohammed Omer <[email protected]>
wrote:

> Morning everyone,
>
> Figured I'd share out a little plugin that delegates fetching and crawling
> to a Selenium Hub/Node system, so that you can rely on Firefox to correctly
> render and parse javascript as it would, and Selenium to pull out the
> content you care about.
>
> At the moment, the plugin is set to pull just the innerHTML of the page's
> <body>; as I just needed a quick and dirty fix. It's forked from my
> patching of another user's previous attempt at getting Selenium standalone
> working with Nutch; that was in turn a fork of httpclient. That worked
> fine, but it was vulnerable to leaving lots of zombie processes hanging
> around when errors occurred. I pretty much just patched it enough to get it
> working - so if you end up using it and patching things / removing
> unnecessaries, send them up on a PR!
>
> Here, we rely on Selenium Hub/Node's self-healing set-up, and just pass
> requests for pages to that system, and receive html content as the
> response.
>
> I've been using it in production for a month now; and, there are some
> obvious things that need patching like
>
> - Enabling for https pages
> - It would probably be best for the overall use case to retrieve all of the
> document's html, rather than just a <body> tag (if exists).
>
> Available at: https://github.com/momer/nutch-selenium-grid-plugin
>

Re: [New Nutch Plugin] Delegate fetching to Selenium/Firefox for those jobs where you neeeeed javascript parsing

Reply via email to