Sure, I'll have some time tomorrow to put together a general post about how to set up Selenium node / hub on Debian you asked about; I'm open to any ideas or wiki additions. I figure it's a small amount of people who'll ever use it, so any notes or additional docs to make it easier are welcome.
Thank you, Mo This message was drafted on a tiny touch screen; please forgive brevity & tpyos > On Jul 30, 2014, at 8:47 PM, Bin Wang <[email protected]> wrote: > > I am super interested in this plugin developed by Mohammed, and wondering > if there is anything that I can help in this integration. > I am doing some homework on both Selenium grid and how Nutch plugin works. > I have also written a few posts on my blog - datafireball.com. Maybe I > can help edit the Wiki page for the plugin? > > > On Wed, Jul 30, 2014 at 4:22 PM, Sebastian Nagel <[email protected] >> wrote: > >> Hi Mohammed, >> >> sounds interesting. I'll give it a try soon. >> >>> I've been using it in production for a month now; and, there are some >>> obvious things that need patching like >>> - Enabling for https pages >>> - It would probably be best for the overall use case to retrieve all of >> the >>> document's html, rather than just a <body> tag (if exists). >> At a first glance, looks like long passages of code are from protocol-http. >> Would be good to pull-out the parts specific to selenium and integrate >> them with the existing code base. This might require some refactoring. >> >>> (from https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium >> ) >>> C) Not have to wait another 2 years for Nutch to patch in either the >> Ajax crawler >>> hashbang workaround and then, not having to patch it to get the use case >> of ammending the >>> original url with the hashbang-workaround's content. >> Your are right: it's a shame for many issues and patches lying around >> for years until they get integrated. On the other hand: everyone >> is welcome to participate, provide and review patches, improve code >> and documentation, etc. There is lot of work to do... >> >> Thanks for sharing the plugin, >> would be great to here more from you! >> >> Sebastian >> >> >> >>> On 07/30/2014 09:26 PM, Lewis John Mcgibbney wrote: >>> This looks fantastic. Are you interested in bringing in into the >> codebase?I >>> think that this would be very useful to many users of Nutch and would be >>> extremely interested in hashing out a patch with you in order to do so. >>> Thanks >>> Lewis >> >> >>> On 07/29/2014 04:26 PM, Mohammed Omer wrote: >>> Morning everyone, >>> >>> Figured I'd share out a little plugin that delegates fetching and >> crawling >>> to a Selenium Hub/Node system, so that you can rely on Firefox to >> correctly >>> render and parse javascript as it would, and Selenium to pull out the >>> content you care about. >>> >>> At the moment, the plugin is set to pull just the innerHTML of the page's >>> <body>; as I just needed a quick and dirty fix. It's forked from my >>> patching of another user's previous attempt at getting Selenium >> standalone >>> working with Nutch; that was in turn a fork of httpclient. That worked >>> fine, but it was vulnerable to leaving lots of zombie processes hanging >>> around when errors occurred. I pretty much just patched it enough to get >> it >>> working - so if you end up using it and patching things / removing >>> unnecessaries, send them up on a PR! >>> >>> Here, we rely on Selenium Hub/Node's self-healing set-up, and just pass >>> requests for pages to that system, and receive html content as the >> response. >>> >>> I've been using it in production for a month now; and, there are some >>> obvious things that need patching like >>> >>> - Enabling for https pages >>> - It would probably be best for the overall use case to retrieve all of >> the >>> document's html, rather than just a <body> tag (if exists). >>> >>> Available at: https://github.com/momer/nutch-selenium-grid-plugin >> >>

