Re: [New Nutch Plugin] Delegate fetching to Selenium/Firefox for those jobs where you neeeeed javascript parsing

Sebastian Nagel Wed, 30 Jul 2014 15:23:32 -0700

Hi Mohammed,

sounds interesting. I'll give it a try soon.


> I've been using it in production for a month now; and, there are some
> obvious things that need patching like
> - Enabling for https pages
> - It would probably be best for the overall use case to retrieve all of the
> document's html, rather than just a <body> tag (if exists).
At a first glance, looks like long passages of code are from protocol-http.
Would be good to pull-out the parts specific to selenium and integrate
them with the existing code base. This might require some refactoring.

> (from https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium)
> C) Not have to wait another 2 years for Nutch to patch in either the Ajax 
> crawler
> hashbang workaround and then, not having to patch it to get the use case of 
> ammending the
> original url with the hashbang-workaround's content.
Your are right: it's a shame for many issues and patches lying around
for years until they get integrated. On the other hand: everyone
is welcome to participate, provide and review patches, improve code
and documentation, etc.  There is lot of work to do...

Thanks for sharing the plugin,
would be great to here more from you!

Sebastian



On 07/30/2014 09:26 PM, Lewis John Mcgibbney wrote:
> This looks fantastic. Are you interested in bringing in into the codebase?I
> think that this would be very useful to many users of Nutch and would be
> extremely interested in hashing out a patch with you in order to do so.
> Thanks
> Lewis
>


On 07/29/2014 04:26 PM, Mohammed Omer wrote:
> Morning everyone,
> 
> Figured I'd share out a little plugin that delegates fetching and crawling
> to a Selenium Hub/Node system, so that you can rely on Firefox to correctly
> render and parse javascript as it would, and Selenium to pull out the
> content you care about.
> 
> At the moment, the plugin is set to pull just the innerHTML of the page's
> <body>; as I just needed a quick and dirty fix. It's forked from my
> patching of another user's previous attempt at getting Selenium standalone
> working with Nutch; that was in turn a fork of httpclient. That worked
> fine, but it was vulnerable to leaving lots of zombie processes hanging
> around when errors occurred. I pretty much just patched it enough to get it
> working - so if you end up using it and patching things / removing
> unnecessaries, send them up on a PR!
> 
> Here, we rely on Selenium Hub/Node's self-healing set-up, and just pass
> requests for pages to that system, and receive html content as the response.
> 
> I've been using it in production for a month now; and, there are some
> obvious things that need patching like
> 
> - Enabling for https pages
> - It would probably be best for the overall use case to retrieve all of the
> document's html, rather than just a <body> tag (if exists).
> 
> Available at: https://github.com/momer/nutch-selenium-grid-plugin
>

Re: [New Nutch Plugin] Delegate fetching to Selenium/Firefox for those jobs where you neeeeed javascript parsing

Reply via email to