All, a little post about how I arrived at using Selenium with Nutch is at
http://soryy.com/blog/2014/ajax-javascript-enabled-parsing-apache-nutch-selenium/

I didn't have time to also go through setting up the individual components,
but I'll save that for next week.

Figured it might make for a fun read for you all, and a reminder that while
many sites promise to implement a work-around, not all of them keep that
promise!

Mo


On Thu, Jul 31, 2014 at 5:43 PM, Mohammed Omer <[email protected]>
wrote:

> Hey Julien,
>
> I definitely should have thanked all the work that goes into Nutch before
> that (at least I said that Nutch was an awesome, world class, web crawler
> though!). I get that patches are in the hands of the community, but for
> someone like me or the person who submitted
> https://issues.apache.org/jira/browse/NUTCH-1323 and asked for input, it
> didn't seem like any existing committers were willing to vote, review it,
> etc.
>
> I'll keep that in mind though about being more vocal and active in this
> and other Apache projects I use/am interested in!
>
> Back-porting this to Nutch 1.x isn't something I plan on doing; but, if
> someone using 1.x and would like to make a PR for a 1.x branch, that'd be
> jiggy and I'd merge it in.
>
> Thank you,
>
> Mo
>
>
> On Thu, Jul 31, 2014 at 2:56 AM, Julien Nioche <
> [email protected]> wrote:
>
>> Hi,
>>
>> Just to add to what Seb said below :
>>
>>
>>
>>
>>
>>
>>
>>
>> *> (from
>> https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium
>> <https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium>)> C)
>> Not have to wait another 2 years for Nutch to patch in either the Ajax
>> crawler> hashbang workaround and then, not having to patch it to get the
>> use case of ammending the> original url with the hashbang-workaround's
>> content.Your are right: it's a shame for many issues and patches lying
>> aroundfor years until they get integrated. On the other hand: everyoneis
>> welcome to participate, provide and review patches, improve codeand
>> documentation, etc.  There is lot of work to do...*
>>
>> Open source projects like Nutch rely on the participation of the
>> community.
>> Everyone is welcome to contribute is any way possible.
>> If you wanted NUTCH-1323 to be committed quicker you could have helped
>> review the patch, voted for it, expressed yourself on the mailing list,
>> etc... Nutch is not a top-down organisation where things are decided
>> entirely by PMC members but an evolutionary process where things get done
>> because they are needed, get improved because they are used and so on...
>> Your contribution with this plugin is a good example of this : you needed
>> it, shared it and it might get improved as more people start using it.
>>
>> Glad to see interest, and more importantly, people still interested in
>> > nutch on the mailing list!
>>
>>
>> Crawling is a bit of a niche activity and the traffic on the lists is
>> never
>> huge but Nutch is a very healthy project, and keeps getting better and
>> better (even if some JIRA issues to not get committed very quickly).
>> Having
>> to maintain 2 versions definitely doesn't help focusing the effort.
>>
>> BTW what about porting your plugin to Nutch 1.x?
>>
>> Thanks again for sharing your work
>>
>> Julien
>>
>>
>>
>>
>>
>>
>> On 31 July 2014 06:25, Mo Omer <[email protected]> wrote:
>>
>> > Sorry for the multiple emails, I didn't see the rest of your email
>> > Sebastian.
>> >
>> > Re httpclient - I had a total of just a few hours to hack together my
>> > previous selenium stand alone plugin, and even less time to put together
>> > this solution so there is looooots of stuff that can be pulled out
>> that's
>> > leftover from httpclient!
>> >
>> > Unfortunately lately my work queue is heavy; and, I've already moved on
>> > from the project using this plugin. I'll happily look at and merge PRs,
>> but
>> > can't promise any additional refactoring or curation on my end.
>> >
>> > I will put together a tutorial, as I mentioned in the previous email,
>> > showing
>> >
>> > A) What selenium is
>> > B) Why it's a good compromise
>> > C) Setting up Selenium Hub on Ubuntu 14.04
>> > D) Setting up Selenium Node on Ubuntu 14.04
>> > E) Some issues I've encountered with selenium node
>> >
>> > Glad to see interest, and more importantly, people still interested in
>> > nutch on the mailing list!
>> >
>> > Thank you,
>> >
>> > Mo
>> >
>> > This message was drafted on a tiny touch screen; please forgive brevity
>> &
>> > tpyos
>> >
>> > > On Jul 30, 2014, at 5:22 PM, Sebastian Nagel <
>> [email protected]>
>> > wrote:
>> > >
>> > > Hi Mohammed,
>> > >
>> > > sounds interesting. I'll give it a try soon.
>> > >
>> > >> I've been using it in production for a month now; and, there are some
>> > >> obvious things that need patching like
>> > >> - Enabling for https pages
>> > >> - It would probably be best for the overall use case to retrieve all
>> of
>> > the
>> > >> document's html, rather than just a <body> tag (if exists).
>> > > At a first glance, looks like long passages of code are from
>> > protocol-http.
>> > > Would be good to pull-out the parts specific to selenium and integrate
>> > > them with the existing code base. This might require some refactoring.
>> > >
>> > >> (from
>> > https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium)
>> > >> C) Not have to wait another 2 years for Nutch to patch in either the
>> > Ajax crawler
>> > >> hashbang workaround and then, not having to patch it to get the use
>> > case of ammending the
>> > >> original url with the hashbang-workaround's content.
>> > > Your are right: it's a shame for many issues and patches lying around
>> > > for years until they get integrated. On the other hand: everyone
>> > > is welcome to participate, provide and review patches, improve code
>> > > and documentation, etc.  There is lot of work to do...
>> > >
>> > > Thanks for sharing the plugin,
>> > > would be great to here more from you!
>> > >
>> > > Sebastian
>> > >
>> > >
>> > >
>> > >> On 07/30/2014 09:26 PM, Lewis John Mcgibbney wrote:
>> > >> This looks fantastic. Are you interested in bringing in into the
>> > codebase?I
>> > >> think that this would be very useful to many users of Nutch and
>> would be
>> > >> extremely interested in hashing out a patch with you in order to do
>> so.
>> > >> Thanks
>> > >> Lewis
>> > >
>> > >
>> > >> On 07/29/2014 04:26 PM, Mohammed Omer wrote:
>> > >> Morning everyone,
>> > >>
>> > >> Figured I'd share out a little plugin that delegates fetching and
>> > crawling
>> > >> to a Selenium Hub/Node system, so that you can rely on Firefox to
>> > correctly
>> > >> render and parse javascript as it would, and Selenium to pull out the
>> > >> content you care about.
>> > >>
>> > >> At the moment, the plugin is set to pull just the innerHTML of the
>> > page's
>> > >> <body>; as I just needed a quick and dirty fix. It's forked from my
>> > >> patching of another user's previous attempt at getting Selenium
>> > standalone
>> > >> working with Nutch; that was in turn a fork of httpclient. That
>> worked
>> > >> fine, but it was vulnerable to leaving lots of zombie processes
>> hanging
>> > >> around when errors occurred. I pretty much just patched it enough to
>> > get it
>> > >> working - so if you end up using it and patching things / removing
>> > >> unnecessaries, send them up on a PR!
>> > >>
>> > >> Here, we rely on Selenium Hub/Node's self-healing set-up, and just
>> pass
>> > >> requests for pages to that system, and receive html content as the
>> > response.
>> > >>
>> > >> I've been using it in production for a month now; and, there are some
>> > >> obvious things that need patching like
>> > >>
>> > >> - Enabling for https pages
>> > >> - It would probably be best for the overall use case to retrieve all
>> of
>> > the
>> > >> document's html, rather than just a <body> tag (if exists).
>> > >>
>> > >> Available at: https://github.com/momer/nutch-selenium-grid-plugin
>> > >
>> >
>>
>>
>>
>> --
>>
>> Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>
>

Reply via email to