Hi Sebastian, thanks for your ideas.
I have tried all cookie policies (NETSCAPE, BROWSER_COMPATIBILITY, RFC....) and none of them worked. I also changed http.enable.cookie.header to true but this didn't changed anything, either. However, when we looked into our Crowd-Logs it seems, that the initial login with username/password succeeded. Maybe, there is something wrong with the login cookie/token? In the log, I see lines like this: 2022-05-10 15:13:32,873 TRACE api.HttpBase - fetched 6028 bytes of compressed content (expanded to 10203 bytes) from https://support.compname.com/hc/en-us/articles/203962208-Document-and-version-history-dates.html 2022-05-10 15:13:32,873 TRACE httpclient.Http - url: https://support.compname.com/hc/en-us/articles/203962208-Document-and-version-history-dates.html; status code: 403; bytes received: 6028; Content-Encoding: gzip; extracted to 10203 bytes 2022-05-10 15:13:33,171 INFO fetcher.FetcherThread - FetcherThread 59 fetch of https://support.compname.com/hc/en-us/articles/203962208-Document-and-version-history-dates.html failed with: Http code=403, url=https://support.compname.com/hc/en-us/articles/203962208-Document-and-version-history-dates.html 2022-05-10 15:13:33,362 INFO fetcher.Fetcher - -activeThreads=50, spinWaiting=50, fetchQueues.totalSize=2, fetchQueues.getQueueCount=1- Best, Michael -----Original Message----- From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID> Sent: Dienstag, 10. Mai 2022 10:23 To: user@nutch.apache.org Subject: Re: FW: After update from 1.11 to 1.13 form login does not work Hi Michael, the only differences in the protocol-httpclient plugin between Nutch 1.11 and 1.13 are - NUTCH-2280 [1] which allows to configure the cookie policy - NUTCH-2355 [2] which allows to set an explicit cookie for a request URL Could this be related? Are there any useful hints what could be the reason in the log messages if you set log4j.logger.org.apache.nutch.protocol.httpclient=TRACE in the log4j.properties ? Best, Sebastian [1] https://issues.apache.org/jira/browse/NUTCH-2280 [2] https://issues.apache.org/jira/browse/NUTCH-2355 On 5/9/22 15:30, Fritsch, Michael wrote: > Hello, > > I used nutch 1.11 to crawl pages behind a login page. > The http-auth configuration looked like this: > > ---------------------------------------------------------------------- > ----- > <?xml version="1.0"?> > <auth-configuration> > <credentials authMethod="formAuth" > > loginUrl=loginURL<https://sso.coremedia.com/zendesk/index.jsp?brand_id=3187316&locale_id=1&return_to=https%3A%2F%2Fsupport.coremedia.com%2Fhc%2Fen-us&timestamp=1464019963> > loginFormId="loginForm" > loginRedirect="true"> > <loginPostData> > <field name="user[email]" > value="username"/> > <field name="user[password]" > value="password"/> > </loginPostData> > <additionalPostHeaders> > </additionalPostHeaders> > </credentials> > </auth-configuration> > -------------------------------------------------------------------- > > Everything worked fine. Then I updated to 1.13 (I also tried 1.18) and > changed the configuration as described in the http-auth.xml file: > > ---------------------------------------------------------------------- > ------- > > <auth-configuration> > <credentials authMethod="formAuth" > > loginUrl=loginURL<https://sso.coremedia.com/zendesk/index.jsp?brand_id=3187316&locale_id=1&return_to=https%3A%2F%2Fsupport.coremedia.com%2Fhc%2Fen-us&timestamp=1464019963> > loginFormId="loginForm" > loginRedirect="true"> > <loginPostData> > <field name="user[email]" > value="username"/> > <field name="user[password]" > value="password"/> > </loginPostData> > <additionalPostHeaders> > </additionalPostHeaders> > <removedFormFields> > </removedFormFields> > <loginCookie> > <policy>BROWSER_COMPATIBILITY</policy> > </loginCookie> > </credentials> > > </auth-configuration> > > ----------------------------------------------- > > Now, the login did not work anymore. After some redirects, it gives an HTML > response 403. I tried all loginCookie policy entries, but nothing worked. > The login is to a Zendesk support system with Atlassian Crowd as a login > provider. Has anything changed between 1.11 and 1.13 is something more strict > than before? > > > I found a very similar question in this mailing list > (https://www.mail-archive.com/user@nutch.apache.org/msg15746.htmlfrom ) from > 2017, which has no solutions. > > I would appreciate any help! > > Best regards > > Michael > > > Dr. Michael Fritsch > Technical Editor > > T: +49.40.325587.214 > E: michael.frit...@coremedia.com<mailto:michael.frit...@coremedia.com> > > CoreMedia GmbH - Be iconic > Ludwig-Erhard-Str. 18 > 20459 Hamburg, Germany > www.coremedia.com<http://www.coremedia.com/> > ------------------------------------------------------------ > Managing Directory: Sören Stamer > Commercial Register: Amtsgericht Hamburg, HR B 162480 > ---------------------------------------------------------------------- > Stay up to date and follow us on > LinkedIn<https://www.linkedin.com/company/coremedia-corp> or > Twitter<https://twitter.com/contentcloud> > >