All, a little post about how I arrived at using Selenium with Nutch is at http://soryy.com/blog/2014/ajax-javascript-enabled-parsing-apache-nutch-selenium/
I didn't have time to also go through setting up the individual components, but I'll save that for next week. Figured it might make for a fun read for you all, and a reminder that while many sites promise to implement a work-around, not all of them keep that promise! Mo On Thu, Jul 31, 2014 at 5:43 PM, Mohammed Omer <[email protected]> wrote: > Hey Julien, > > I definitely should have thanked all the work that goes into Nutch before > that (at least I said that Nutch was an awesome, world class, web crawler > though!). I get that patches are in the hands of the community, but for > someone like me or the person who submitted > https://issues.apache.org/jira/browse/NUTCH-1323 and asked for input, it > didn't seem like any existing committers were willing to vote, review it, > etc. > > I'll keep that in mind though about being more vocal and active in this > and other Apache projects I use/am interested in! > > Back-porting this to Nutch 1.x isn't something I plan on doing; but, if > someone using 1.x and would like to make a PR for a 1.x branch, that'd be > jiggy and I'd merge it in. > > Thank you, > > Mo > > > On Thu, Jul 31, 2014 at 2:56 AM, Julien Nioche < > [email protected]> wrote: > >> Hi, >> >> Just to add to what Seb said below : >> >> >> >> >> >> >> >> >> *> (from >> https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium >> <https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium>)> C) >> Not have to wait another 2 years for Nutch to patch in either the Ajax >> crawler> hashbang workaround and then, not having to patch it to get the >> use case of ammending the> original url with the hashbang-workaround's >> content.Your are right: it's a shame for many issues and patches lying >> aroundfor years until they get integrated. On the other hand: everyoneis >> welcome to participate, provide and review patches, improve codeand >> documentation, etc. There is lot of work to do...* >> >> Open source projects like Nutch rely on the participation of the >> community. >> Everyone is welcome to contribute is any way possible. >> If you wanted NUTCH-1323 to be committed quicker you could have helped >> review the patch, voted for it, expressed yourself on the mailing list, >> etc... Nutch is not a top-down organisation where things are decided >> entirely by PMC members but an evolutionary process where things get done >> because they are needed, get improved because they are used and so on... >> Your contribution with this plugin is a good example of this : you needed >> it, shared it and it might get improved as more people start using it. >> >> Glad to see interest, and more importantly, people still interested in >> > nutch on the mailing list! >> >> >> Crawling is a bit of a niche activity and the traffic on the lists is >> never >> huge but Nutch is a very healthy project, and keeps getting better and >> better (even if some JIRA issues to not get committed very quickly). >> Having >> to maintain 2 versions definitely doesn't help focusing the effort. >> >> BTW what about porting your plugin to Nutch 1.x? >> >> Thanks again for sharing your work >> >> Julien >> >> >> >> >> >> >> On 31 July 2014 06:25, Mo Omer <[email protected]> wrote: >> >> > Sorry for the multiple emails, I didn't see the rest of your email >> > Sebastian. >> > >> > Re httpclient - I had a total of just a few hours to hack together my >> > previous selenium stand alone plugin, and even less time to put together >> > this solution so there is looooots of stuff that can be pulled out >> that's >> > leftover from httpclient! >> > >> > Unfortunately lately my work queue is heavy; and, I've already moved on >> > from the project using this plugin. I'll happily look at and merge PRs, >> but >> > can't promise any additional refactoring or curation on my end. >> > >> > I will put together a tutorial, as I mentioned in the previous email, >> > showing >> > >> > A) What selenium is >> > B) Why it's a good compromise >> > C) Setting up Selenium Hub on Ubuntu 14.04 >> > D) Setting up Selenium Node on Ubuntu 14.04 >> > E) Some issues I've encountered with selenium node >> > >> > Glad to see interest, and more importantly, people still interested in >> > nutch on the mailing list! >> > >> > Thank you, >> > >> > Mo >> > >> > This message was drafted on a tiny touch screen; please forgive brevity >> & >> > tpyos >> > >> > > On Jul 30, 2014, at 5:22 PM, Sebastian Nagel < >> [email protected]> >> > wrote: >> > > >> > > Hi Mohammed, >> > > >> > > sounds interesting. I'll give it a try soon. >> > > >> > >> I've been using it in production for a month now; and, there are some >> > >> obvious things that need patching like >> > >> - Enabling for https pages >> > >> - It would probably be best for the overall use case to retrieve all >> of >> > the >> > >> document's html, rather than just a <body> tag (if exists). >> > > At a first glance, looks like long passages of code are from >> > protocol-http. >> > > Would be good to pull-out the parts specific to selenium and integrate >> > > them with the existing code base. This might require some refactoring. >> > > >> > >> (from >> > https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium) >> > >> C) Not have to wait another 2 years for Nutch to patch in either the >> > Ajax crawler >> > >> hashbang workaround and then, not having to patch it to get the use >> > case of ammending the >> > >> original url with the hashbang-workaround's content. >> > > Your are right: it's a shame for many issues and patches lying around >> > > for years until they get integrated. On the other hand: everyone >> > > is welcome to participate, provide and review patches, improve code >> > > and documentation, etc. There is lot of work to do... >> > > >> > > Thanks for sharing the plugin, >> > > would be great to here more from you! >> > > >> > > Sebastian >> > > >> > > >> > > >> > >> On 07/30/2014 09:26 PM, Lewis John Mcgibbney wrote: >> > >> This looks fantastic. Are you interested in bringing in into the >> > codebase?I >> > >> think that this would be very useful to many users of Nutch and >> would be >> > >> extremely interested in hashing out a patch with you in order to do >> so. >> > >> Thanks >> > >> Lewis >> > > >> > > >> > >> On 07/29/2014 04:26 PM, Mohammed Omer wrote: >> > >> Morning everyone, >> > >> >> > >> Figured I'd share out a little plugin that delegates fetching and >> > crawling >> > >> to a Selenium Hub/Node system, so that you can rely on Firefox to >> > correctly >> > >> render and parse javascript as it would, and Selenium to pull out the >> > >> content you care about. >> > >> >> > >> At the moment, the plugin is set to pull just the innerHTML of the >> > page's >> > >> <body>; as I just needed a quick and dirty fix. It's forked from my >> > >> patching of another user's previous attempt at getting Selenium >> > standalone >> > >> working with Nutch; that was in turn a fork of httpclient. That >> worked >> > >> fine, but it was vulnerable to leaving lots of zombie processes >> hanging >> > >> around when errors occurred. I pretty much just patched it enough to >> > get it >> > >> working - so if you end up using it and patching things / removing >> > >> unnecessaries, send them up on a PR! >> > >> >> > >> Here, we rely on Selenium Hub/Node's self-healing set-up, and just >> pass >> > >> requests for pages to that system, and receive html content as the >> > response. >> > >> >> > >> I've been using it in production for a month now; and, there are some >> > >> obvious things that need patching like >> > >> >> > >> - Enabling for https pages >> > >> - It would probably be best for the overall use case to retrieve all >> of >> > the >> > >> document's html, rather than just a <body> tag (if exists). >> > >> >> > >> Available at: https://github.com/momer/nutch-selenium-grid-plugin >> > > >> > >> >> >> >> -- >> >> Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> http://twitter.com/digitalpebble >> > >

