I am super interested in this plugin developed by Mohammed, and wondering if there is anything that I can help in this integration. I am doing some homework on both Selenium grid and how Nutch plugin works. I have also written a few posts on my blog - datafireball.com. Maybe I can help edit the Wiki page for the plugin?
On Wed, Jul 30, 2014 at 4:22 PM, Sebastian Nagel <[email protected] > wrote: > Hi Mohammed, > > sounds interesting. I'll give it a try soon. > > > I've been using it in production for a month now; and, there are some > > obvious things that need patching like > > - Enabling for https pages > > - It would probably be best for the overall use case to retrieve all of > the > > document's html, rather than just a <body> tag (if exists). > At a first glance, looks like long passages of code are from protocol-http. > Would be good to pull-out the parts specific to selenium and integrate > them with the existing code base. This might require some refactoring. > > > (from https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium > ) > > C) Not have to wait another 2 years for Nutch to patch in either the > Ajax crawler > > hashbang workaround and then, not having to patch it to get the use case > of ammending the > > original url with the hashbang-workaround's content. > Your are right: it's a shame for many issues and patches lying around > for years until they get integrated. On the other hand: everyone > is welcome to participate, provide and review patches, improve code > and documentation, etc. There is lot of work to do... > > Thanks for sharing the plugin, > would be great to here more from you! > > Sebastian > > > > On 07/30/2014 09:26 PM, Lewis John Mcgibbney wrote: > > This looks fantastic. Are you interested in bringing in into the > codebase?I > > think that this would be very useful to many users of Nutch and would be > > extremely interested in hashing out a patch with you in order to do so. > > Thanks > > Lewis > > > > > On 07/29/2014 04:26 PM, Mohammed Omer wrote: > > Morning everyone, > > > > Figured I'd share out a little plugin that delegates fetching and > crawling > > to a Selenium Hub/Node system, so that you can rely on Firefox to > correctly > > render and parse javascript as it would, and Selenium to pull out the > > content you care about. > > > > At the moment, the plugin is set to pull just the innerHTML of the page's > > <body>; as I just needed a quick and dirty fix. It's forked from my > > patching of another user's previous attempt at getting Selenium > standalone > > working with Nutch; that was in turn a fork of httpclient. That worked > > fine, but it was vulnerable to leaving lots of zombie processes hanging > > around when errors occurred. I pretty much just patched it enough to get > it > > working - so if you end up using it and patching things / removing > > unnecessaries, send them up on a PR! > > > > Here, we rely on Selenium Hub/Node's self-healing set-up, and just pass > > requests for pages to that system, and receive html content as the > response. > > > > I've been using it in production for a month now; and, there are some > > obvious things that need patching like > > > > - Enabling for https pages > > - It would probably be best for the overall use case to retrieve all of > the > > document's html, rather than just a <body> tag (if exists). > > > > Available at: https://github.com/momer/nutch-selenium-grid-plugin > > > >

