Re: [New Nutch Plugin] Delegate fetching to Selenium/Firefox for those jobs where you neeeeed javascript parsing

Bin Wang Wed, 30 Jul 2014 18:47:56 -0700

I am super interested in this plugin developed by Mohammed, and wondering
if there is anything that I can help in this integration.
I am doing some homework on both Selenium grid and how Nutch plugin works.
I have also written a few posts on my blog - datafireball.com.   Maybe I
can help edit the Wiki page for the plugin?



On Wed, Jul 30, 2014 at 4:22 PM, Sebastian Nagel <[email protected]
> wrote:

> Hi Mohammed,
>
> sounds interesting. I'll give it a try soon.
>
> > I've been using it in production for a month now; and, there are some
> > obvious things that need patching like
> > - Enabling for https pages
> > - It would probably be best for the overall use case to retrieve all of
> the
> > document's html, rather than just a <body> tag (if exists).
> At a first glance, looks like long passages of code are from protocol-http.
> Would be good to pull-out the parts specific to selenium and integrate
> them with the existing code base. This might require some refactoring.
>
> > (from https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium
> )
> > C) Not have to wait another 2 years for Nutch to patch in either the
> Ajax crawler
> > hashbang workaround and then, not having to patch it to get the use case
> of ammending the
> > original url with the hashbang-workaround's content.
> Your are right: it's a shame for many issues and patches lying around
> for years until they get integrated. On the other hand: everyone
> is welcome to participate, provide and review patches, improve code
> and documentation, etc.  There is lot of work to do...
>
> Thanks for sharing the plugin,
> would be great to here more from you!
>
> Sebastian
>
>
>
> On 07/30/2014 09:26 PM, Lewis John Mcgibbney wrote:
> > This looks fantastic. Are you interested in bringing in into the
> codebase?I
> > think that this would be very useful to many users of Nutch and would be
> > extremely interested in hashing out a patch with you in order to do so.
> > Thanks
> > Lewis
> >
>
>
> On 07/29/2014 04:26 PM, Mohammed Omer wrote:
> > Morning everyone,
> >
> > Figured I'd share out a little plugin that delegates fetching and
> crawling
> > to a Selenium Hub/Node system, so that you can rely on Firefox to
> correctly
> > render and parse javascript as it would, and Selenium to pull out the
> > content you care about.
> >
> > At the moment, the plugin is set to pull just the innerHTML of the page's
> > <body>; as I just needed a quick and dirty fix. It's forked from my
> > patching of another user's previous attempt at getting Selenium
> standalone
> > working with Nutch; that was in turn a fork of httpclient. That worked
> > fine, but it was vulnerable to leaving lots of zombie processes hanging
> > around when errors occurred. I pretty much just patched it enough to get
> it
> > working - so if you end up using it and patching things / removing
> > unnecessaries, send them up on a PR!
> >
> > Here, we rely on Selenium Hub/Node's self-healing set-up, and just pass
> > requests for pages to that system, and receive html content as the
> response.
> >
> > I've been using it in production for a month now; and, there are some
> > obvious things that need patching like
> >
> > - Enabling for https pages
> > - It would probably be best for the overall use case to retrieve all of
> the
> > document's html, rather than just a <body> tag (if exists).
> >
> > Available at: https://github.com/momer/nutch-selenium-grid-plugin
> >
>
>

Re: [New Nutch Plugin] Delegate fetching to Selenium/Firefox for those jobs where you neeeeed javascript parsing

Reply via email to