+1 for using selenium-grid.
On Tue, Jul 29, 2014 at 8:26 AM, Mohammed Omer <[email protected]> wrote: > Morning everyone, > > Figured I'd share out a little plugin that delegates fetching and crawling > to a Selenium Hub/Node system, so that you can rely on Firefox to correctly > render and parse javascript as it would, and Selenium to pull out the > content you care about. > > At the moment, the plugin is set to pull just the innerHTML of the page's > <body>; as I just needed a quick and dirty fix. It's forked from my > patching of another user's previous attempt at getting Selenium standalone > working with Nutch; that was in turn a fork of httpclient. That worked > fine, but it was vulnerable to leaving lots of zombie processes hanging > around when errors occurred. I pretty much just patched it enough to get it > working - so if you end up using it and patching things / removing > unnecessaries, send them up on a PR! > > Here, we rely on Selenium Hub/Node's self-healing set-up, and just pass > requests for pages to that system, and receive html content as the > response. > > I've been using it in production for a month now; and, there are some > obvious things that need patching like > > - Enabling for https pages > - It would probably be best for the overall use case to retrieve all of the > document's html, rather than just a <body> tag (if exists). > > Available at: https://github.com/momer/nutch-selenium-grid-plugin >

