Thank you Lewis for your reply. I initially looked into the above protocol-htmlunit <https://github.com/apache/nutch/tree/master/src/plugin/protocol-htmlunit> and protocol-interactiveselenium <https://github.com/apache/nutch/tree/master/src/plugin/protocol-interactiveselenium> plugins you mentioned.
Based on selenium I created a microservice (which handles all required SSO redirections/ OTP handlings etc) and hosted that with a selenium grid in the kubernetes cluster for scaling. I found that we couldn't scale this approach beyond a certain point and the selenium hub in the selenium grid can not be scaled horizontally. Later we switched using Puppetter <https://github.com/puppeteer/puppeteer> to drive headless chrome and scaled this in kubernetes using browserless <https://github.com/browserless/chrome> The nutch plugin developed to call these hosted APIs. This helps but still this is very slow compared to traditional httpclient approach. As this is a common problem in the intranet environment, I was wondering how people are handling this. I would be happy to discuss this further. Thank you Abhay On Wed, Jun 9, 2021 at 6:41 PM Lewis John McGibbney <lewi...@apache.org> wrote: > Hi Abhay, > > This is a problem space we looked at a while ago and made quite a bit of > progress on. > > Firstly, the protocol-httpclient plugin has been considered in a > deprecated state for a while. > https://github.com/apache/nutch/tree/master/src/plugin/protocol-httpclient > I'm pretty sure that it will NOT cater for your use case. More information > on the functionality and limits of this plugin can be found at > https://cwiki.apache.org/confluence/display/nutch/HttpAuthenticationSchemes > some more recent initiatives can be found at > https://cwiki.apache.org/confluence/display/nutch/HttpPostAuthentication > > Now, some of the plugins which may be used/adapted for your use case > include > > 1. > https://github.com/apache/nutch/tree/master/src/plugin/protocol-htmlunit > - customizable through > https://github.com/apache/nutch/blob/master/src/plugin/lib-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/HtmlUnitWebDriver.java > > 2. both > https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium > > https://github.com/apache/nutch/tree/master/src/plugin/protocol-interactiveselenium > some documentation exists at > https://cwiki.apache.org/confluence/display/NUTCH/AdvancedAjaxInteraction > > Admittedly, I've not tried to run these plugins against a modern SSO site > recently. I suspect that some dependency updates would not go a miss so > please take that info consideration. > > Your note regarding the time it takes for the 'chaining' of systems > together to achieve the login is well made. This was easily observed and > needs a more consolidated/calculated approach IMHO. > > I would be interested to discuss this further with you... > > hth > lewismc > > On 2021/06/07 02:45:54, Abhay Ratnaparkhi <abhay.ratnapar...@gmail.com> > wrote: > > Hello, > > > > We are using Nutch to crawl intranet pages behind SSO authentication. > > > > I would like to know if anyone has used/updated httpclient protocol > plugin > > for crawling pages behind SSO authentication. > > > > The SSO auth redirects pages to the SSO server for login and optionally > > asks for second factor authentication like TOTP. > > > > We have been using a custom plugin (which calls a nodejs service) which > > uses a google puppeteer to drive chromium browser to do this login and > OTP > > handling. This is much slower and might not require as many of these > pages > > are rendered on server sides (so dynamic rendering isn't required) > > > > Thank you > > Abhay Ratnaparkhi > > >