Thank you Lewis for your reply.

I initially looked into the above protocol-htmlunit
<https://github.com/apache/nutch/tree/master/src/plugin/protocol-htmlunit>
 and protocol-interactiveselenium
<https://github.com/apache/nutch/tree/master/src/plugin/protocol-interactiveselenium>
plugins
you mentioned.

Based on selenium I created a microservice (which handles all required SSO
redirections/ OTP handlings etc) and hosted that with a selenium grid in
the kubernetes cluster for scaling.
I found that we couldn't scale this approach beyond a certain point and the
selenium hub in the selenium grid can not be scaled horizontally.

Later we switched using Puppetter <https://github.com/puppeteer/puppeteer>
to drive headless chrome and scaled this in kubernetes using browserless
<https://github.com/browserless/chrome>
The nutch plugin developed to call these hosted APIs. This helps but still
this is very slow compared to traditional httpclient approach.

As this is a common problem in the intranet environment, I was wondering
how people are handling this. I would be happy to discuss this further.

Thank you
Abhay





On Wed, Jun 9, 2021 at 6:41 PM Lewis John McGibbney <lewi...@apache.org>
wrote:

> Hi Abhay,
>
> This is a problem space we looked at a while ago and made quite a bit of
> progress on.
>
> Firstly, the protocol-httpclient plugin has been considered in a
> deprecated state for a while.
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-httpclient
> I'm pretty sure that it will NOT cater for your use case. More information
> on the functionality and limits of this plugin can be found at
> https://cwiki.apache.org/confluence/display/nutch/HttpAuthenticationSchemes
> some more recent initiatives can be found at
> https://cwiki.apache.org/confluence/display/nutch/HttpPostAuthentication
>
> Now, some of the plugins which may be used/adapted for your use case
> include
>
> 1.
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-htmlunit
> - customizable through
> https://github.com/apache/nutch/blob/master/src/plugin/lib-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/HtmlUnitWebDriver.java
>
> 2. both
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium
>
> https://github.com/apache/nutch/tree/master/src/plugin/protocol-interactiveselenium
> some documentation exists at
> https://cwiki.apache.org/confluence/display/NUTCH/AdvancedAjaxInteraction
>
> Admittedly, I've not tried to run these plugins against a modern SSO site
> recently. I suspect that some dependency updates would not go a miss so
> please take that info consideration.
>
> Your note regarding the time it takes for the 'chaining' of systems
> together to achieve the login is well made. This was easily observed and
> needs a more consolidated/calculated approach IMHO.
>
> I would be interested to discuss this further with you...
>
> hth
> lewismc
>
> On 2021/06/07 02:45:54, Abhay Ratnaparkhi <abhay.ratnapar...@gmail.com>
> wrote:
> > Hello,
> >
> > We are using Nutch to crawl intranet pages behind SSO authentication.
> >
> > I would like to know if anyone has used/updated httpclient protocol
> plugin
> > for crawling pages behind SSO authentication.
> >
> > The SSO auth redirects pages to the SSO server for login and optionally
> > asks for second factor authentication like TOTP.
> >
> > We have been using a custom plugin (which calls a nodejs service) which
> > uses a google puppeteer to drive chromium browser to do this login and
> OTP
> > handling. This is much slower and might not require as many of these
> pages
> > are rendered on server sides (so dynamic rendering isn't required)
> >
> > Thank you
> > Abhay Ratnaparkhi
> >
>

Reply via email to