[New Nutch Plugin] Delegate fetching to Selenium/Firefox for those jobs where you neeeeed javascript parsing

Mohammed Omer Tue, 29 Jul 2014 07:33:21 -0700

Morning everyone,

Figured I'd share out a little plugin that delegates fetching and crawling
to a Selenium Hub/Node system, so that you can rely on Firefox to correctly
render and parse javascript as it would, and Selenium to pull out the
content you care about.


At the moment, the plugin is set to pull just the innerHTML of the page's
<body>; as I just needed a quick and dirty fix. It's forked from my
patching of another user's previous attempt at getting Selenium standalone
working with Nutch; that was in turn a fork of httpclient. That worked
fine, but it was vulnerable to leaving lots of zombie processes hanging
around when errors occurred. I pretty much just patched it enough to get it
working - so if you end up using it and patching things / removing
unnecessaries, send them up on a PR!

Here, we rely on Selenium Hub/Node's self-healing set-up, and just pass
requests for pages to that system, and receive html content as the response.

I've been using it in production for a month now; and, there are some
obvious things that need patching like

- Enabling for https pages
- It would probably be best for the overall use case to retrieve all of the
document's html, rather than just a <body> tag (if exists).

Available at: https://github.com/momer/nutch-selenium-grid-plugin

[New Nutch Plugin] Delegate fetching to Selenium/Firefox for those jobs where you neeeeed javascript parsing

Reply via email to