Morning everyone, Figured I'd share out a little plugin that delegates fetching and crawling to a Selenium Hub/Node system, so that you can rely on Firefox to correctly render and parse javascript as it would, and Selenium to pull out the content you care about.
At the moment, the plugin is set to pull just the innerHTML of the page's <body>; as I just needed a quick and dirty fix. It's forked from my patching of another user's previous attempt at getting Selenium standalone working with Nutch; that was in turn a fork of httpclient. That worked fine, but it was vulnerable to leaving lots of zombie processes hanging around when errors occurred. I pretty much just patched it enough to get it working - so if you end up using it and patching things / removing unnecessaries, send them up on a PR! Here, we rely on Selenium Hub/Node's self-healing set-up, and just pass requests for pages to that system, and receive html content as the response. I've been using it in production for a month now; and, there are some obvious things that need patching like - Enabling for https pages - It would probably be best for the overall use case to retrieve all of the document's html, rather than just a <body> tag (if exists). Available at: https://github.com/momer/nutch-selenium-grid-plugin

