Hi Sebastian,

thanks for your ideas.

I have tried all cookie policies (NETSCAPE, BROWSER_COMPATIBILITY, RFC....) and 
none of them worked.
I also changed http.enable.cookie.header to true but this didn't changed 
anything, either.

However, when we  looked into our Crowd-Logs it seems, that the initial login 
with username/password succeeded. Maybe, there is something wrong with the 
login cookie/token?
In the log, I see lines like this:

2022-05-10 15:13:32,873 TRACE api.HttpBase - fetched 6028 bytes of compressed 
content (expanded to 10203 bytes) from 
https://support.compname.com/hc/en-us/articles/203962208-Document-and-version-history-dates.html
2022-05-10 15:13:32,873 TRACE httpclient.Http - url: 
https://support.compname.com/hc/en-us/articles/203962208-Document-and-version-history-dates.html;
 status code: 403; bytes received: 6028; Content-Encoding: gzip; extracted to 
10203 bytes
2022-05-10 15:13:33,171 INFO  fetcher.FetcherThread - FetcherThread 59 fetch of 
https://support.compname.com/hc/en-us/articles/203962208-Document-and-version-history-dates.html
 failed with: Http code=403, 
url=https://support.compname.com/hc/en-us/articles/203962208-Document-and-version-history-dates.html
2022-05-10 15:13:33,362 INFO  fetcher.Fetcher - -activeThreads=50, 
spinWaiting=50, fetchQueues.totalSize=2, fetchQueues.getQueueCount=1-



Best, 
Michael


-----Original Message-----
From: Sebastian Nagel <wastl.na...@googlemail.com.INVALID> 
Sent: Dienstag, 10. Mai 2022 10:23
To: user@nutch.apache.org
Subject: Re: FW: After update from 1.11 to 1.13 form login does not work

Hi Michael,

the only differences in the protocol-httpclient plugin between Nutch 1.11 and
1.13 are
- NUTCH-2280 [1] which allows to configure the cookie policy
- NUTCH-2355 [2] which allows to set an explicit cookie for a request URL

Could this be related?

Are there any useful hints what could be the reason in the log messages if you 
set
  log4j.logger.org.apache.nutch.protocol.httpclient=TRACE
in the log4j.properties ?

Best,
Sebastian

[1] https://issues.apache.org/jira/browse/NUTCH-2280
[2] https://issues.apache.org/jira/browse/NUTCH-2355

On 5/9/22 15:30, Fritsch, Michael wrote:
> Hello,
> 
> I used nutch 1.11 to crawl pages behind a login page.
> The http-auth configuration looked like this:
> 
> ----------------------------------------------------------------------
> -----
> <?xml version="1.0"?>
> <auth-configuration>
>   <credentials authMethod="formAuth"
>                
> loginUrl=loginURL<https://sso.coremedia.com/zendesk/index.jsp?brand_id=3187316&amp;locale_id=1&amp;return_to=https%3A%2F%2Fsupport.coremedia.com%2Fhc%2Fen-us&amp;timestamp=1464019963>
>                loginFormId="loginForm"
>                loginRedirect="true">
>     <loginPostData>
>       <field name="user[email]"
>              value="username"/>
>       <field name="user[password]"
>              value="password"/>
>     </loginPostData>
>     <additionalPostHeaders>
>     </additionalPostHeaders>
>   </credentials>
> </auth-configuration>
> --------------------------------------------------------------------
> 
> Everything worked fine. Then I updated to 1.13 (I also tried 1.18) and 
> changed the configuration as described in the http-auth.xml file:
> 
> ----------------------------------------------------------------------
> -------
> 
> <auth-configuration>
>   <credentials authMethod="formAuth"
>                
> loginUrl=loginURL<https://sso.coremedia.com/zendesk/index.jsp?brand_id=3187316&amp;locale_id=1&amp;return_to=https%3A%2F%2Fsupport.coremedia.com%2Fhc%2Fen-us&amp;timestamp=1464019963>
>                loginFormId="loginForm"
>                loginRedirect="true">
>     <loginPostData>
>       <field name="user[email]"
>              value="username"/>
>       <field name="user[password]"
>              value="password"/>
>     </loginPostData>
>     <additionalPostHeaders>
>     </additionalPostHeaders>
>     <removedFormFields>
>     </removedFormFields>
>     <loginCookie>
>       <policy>BROWSER_COMPATIBILITY</policy>
>     </loginCookie>
>   </credentials>
> 
> </auth-configuration>
> 
> -----------------------------------------------
> 
> Now, the login did not work anymore. After some redirects, it gives an HTML 
> response 403. I tried all loginCookie policy entries, but nothing worked.
> The login is to a Zendesk support system with Atlassian Crowd as a login 
> provider. Has anything changed between 1.11 and 1.13 is something more strict 
> than before?
> 
> 
> I found a very similar question in this mailing list 
> (https://www.mail-archive.com/user@nutch.apache.org/msg15746.htmlfrom ) from 
> 2017, which has no solutions.
> 
> I would appreciate any help!
> 
> Best regards
> 
> Michael
> 
> 
> Dr. Michael Fritsch
> Technical Editor
> 
> T: +49.40.325587.214
> E: michael.frit...@coremedia.com<mailto:michael.frit...@coremedia.com>
> 
> CoreMedia GmbH - Be iconic
> Ludwig-Erhard-Str. 18
> 20459 Hamburg, Germany
> www.coremedia.com<http://www.coremedia.com/>
> ------------------------------------------------------------
> Managing Directory: Sören Stamer
> Commercial Register: Amtsgericht Hamburg, HR B 162480
> ----------------------------------------------------------------------
> Stay up to date and follow us on 
> LinkedIn<https://www.linkedin.com/company/coremedia-corp> or 
> Twitter<https://twitter.com/contentcloud>
> 
> 

Reply via email to