Re: [New Nutch Plugin] Delegate fetching to Selenium/Firefox for those jobs where you neeeeed javascript parsing

Mo Omer Wed, 30 Jul 2014 22:15:07 -0700

Sure,

I'll have some time tomorrow to put together a general post about how to set up 
Selenium node / hub on Debian you asked about; I'm open to any ideas or wiki 
additions. I figure it's a small amount of people who'll ever use it, so any 
notes or additional docs to make it easier are welcome.


Thank you,

Mo

This message was drafted on a tiny touch screen; please forgive brevity & tpyos

> On Jul 30, 2014, at 8:47 PM, Bin Wang <[email protected]> wrote:
> 
> I am super interested in this plugin developed by Mohammed, and wondering
> if there is anything that I can help in this integration.
> I am doing some homework on both Selenium grid and how Nutch plugin works.
> I have also written a few posts on my blog - datafireball.com.   Maybe I
> can help edit the Wiki page for the plugin?
> 
> 
> On Wed, Jul 30, 2014 at 4:22 PM, Sebastian Nagel <[email protected]
>> wrote:
> 
>> Hi Mohammed,
>> 
>> sounds interesting. I'll give it a try soon.
>> 
>>> I've been using it in production for a month now; and, there are some
>>> obvious things that need patching like
>>> - Enabling for https pages
>>> - It would probably be best for the overall use case to retrieve all of
>> the
>>> document's html, rather than just a <body> tag (if exists).
>> At a first glance, looks like long passages of code are from protocol-http.
>> Would be good to pull-out the parts specific to selenium and integrate
>> them with the existing code base. This might require some refactoring.
>> 
>>> (from https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium
>> )
>>> C) Not have to wait another 2 years for Nutch to patch in either the
>> Ajax crawler
>>> hashbang workaround and then, not having to patch it to get the use case
>> of ammending the
>>> original url with the hashbang-workaround's content.
>> Your are right: it's a shame for many issues and patches lying around
>> for years until they get integrated. On the other hand: everyone
>> is welcome to participate, provide and review patches, improve code
>> and documentation, etc.  There is lot of work to do...
>> 
>> Thanks for sharing the plugin,
>> would be great to here more from you!
>> 
>> Sebastian
>> 
>> 
>> 
>>> On 07/30/2014 09:26 PM, Lewis John Mcgibbney wrote:
>>> This looks fantastic. Are you interested in bringing in into the
>> codebase?I
>>> think that this would be very useful to many users of Nutch and would be
>>> extremely interested in hashing out a patch with you in order to do so.
>>> Thanks
>>> Lewis
>> 
>> 
>>> On 07/29/2014 04:26 PM, Mohammed Omer wrote:
>>> Morning everyone,
>>> 
>>> Figured I'd share out a little plugin that delegates fetching and
>> crawling
>>> to a Selenium Hub/Node system, so that you can rely on Firefox to
>> correctly
>>> render and parse javascript as it would, and Selenium to pull out the
>>> content you care about.
>>> 
>>> At the moment, the plugin is set to pull just the innerHTML of the page's
>>> <body>; as I just needed a quick and dirty fix. It's forked from my
>>> patching of another user's previous attempt at getting Selenium
>> standalone
>>> working with Nutch; that was in turn a fork of httpclient. That worked
>>> fine, but it was vulnerable to leaving lots of zombie processes hanging
>>> around when errors occurred. I pretty much just patched it enough to get
>> it
>>> working - so if you end up using it and patching things / removing
>>> unnecessaries, send them up on a PR!
>>> 
>>> Here, we rely on Selenium Hub/Node's self-healing set-up, and just pass
>>> requests for pages to that system, and receive html content as the
>> response.
>>> 
>>> I've been using it in production for a month now; and, there are some
>>> obvious things that need patching like
>>> 
>>> - Enabling for https pages
>>> - It would probably be best for the overall use case to retrieve all of
>> the
>>> document's html, rather than just a <body> tag (if exists).
>>> 
>>> Available at: https://github.com/momer/nutch-selenium-grid-plugin
>> 
>>

Re: [New Nutch Plugin] Delegate fetching to Selenium/Firefox for those jobs where you neeeeed javascript parsing

Reply via email to