Hello,

We are using Nutch to crawl intranet pages behind SSO authentication.

I would like to know if anyone has used/updated httpclient protocol plugin
for crawling pages behind SSO authentication.

The SSO auth redirects pages to the SSO server for login and optionally
asks for second factor authentication like TOTP.

We have been using a custom plugin (which calls a nodejs service) which
uses a google puppeteer to drive chromium browser to do this login and OTP
handling. This is much slower and might not require as many of these pages
are rendered on server sides (so dynamic rendering isn't required)

Thank you
Abhay Ratnaparkhi

Reply via email to