Hi Michael,

the only differences in the protocol-httpclient plugin between Nutch 1.11 and
1.13 are
- NUTCH-2280 [1] which allows to configure the cookie policy
- NUTCH-2355 [2] which allows to set an explicit cookie for a request URL

Could this be related?

Are there any useful hints what could be the reason in the log messages
if you set
  log4j.logger.org.apache.nutch.protocol.httpclient=TRACE
in the log4j.properties ?

Best,
Sebastian

[1] https://issues.apache.org/jira/browse/NUTCH-2280
[2] https://issues.apache.org/jira/browse/NUTCH-2355

On 5/9/22 15:30, Fritsch, Michael wrote:
> Hello,
> 
> I used nutch 1.11 to crawl pages behind a login page.
> The http-auth configuration looked like this:
> 
> ---------------------------------------------------------------------------
> <?xml version="1.0"?>
> <auth-configuration>
>   <credentials authMethod="formAuth"
>                
> loginUrl=loginURL<https://sso.coremedia.com/zendesk/index.jsp?brand_id=3187316&amp;locale_id=1&amp;return_to=https%3A%2F%2Fsupport.coremedia.com%2Fhc%2Fen-us&amp;timestamp=1464019963>
>                loginFormId="loginForm"
>                loginRedirect="true">
>     <loginPostData>
>       <field name="user[email]"
>              value="username"/>
>       <field name="user[password]"
>              value="password"/>
>     </loginPostData>
>     <additionalPostHeaders>
>     </additionalPostHeaders>
>   </credentials>
> </auth-configuration>
> --------------------------------------------------------------------
> 
> Everything worked fine. Then I updated to 1.13 (I also tried 1.18) and 
> changed the configuration as described in the http-auth.xml file:
> 
> -----------------------------------------------------------------------------
> 
> <auth-configuration>
>   <credentials authMethod="formAuth"
>                
> loginUrl=loginURL<https://sso.coremedia.com/zendesk/index.jsp?brand_id=3187316&amp;locale_id=1&amp;return_to=https%3A%2F%2Fsupport.coremedia.com%2Fhc%2Fen-us&amp;timestamp=1464019963>
>                loginFormId="loginForm"
>                loginRedirect="true">
>     <loginPostData>
>       <field name="user[email]"
>              value="username"/>
>       <field name="user[password]"
>              value="password"/>
>     </loginPostData>
>     <additionalPostHeaders>
>     </additionalPostHeaders>
>     <removedFormFields>
>     </removedFormFields>
>     <loginCookie>
>       <policy>BROWSER_COMPATIBILITY</policy>
>     </loginCookie>
>   </credentials>
> 
> </auth-configuration>
> 
> -----------------------------------------------
> 
> Now, the login did not work anymore. After some redirects, it gives an HTML 
> response 403. I tried all loginCookie policy entries, but nothing worked.
> The login is to a Zendesk support system with Atlassian Crowd as a login 
> provider. Has anything changed between 1.11 and 1.13 is something more strict 
> than before?
> 
> 
> I found a very similar question in this mailing list 
> (https://www.mail-archive.com/user@nutch.apache.org/msg15746.htmlfrom ) from 
> 2017, which has no solutions.
> 
> I would appreciate any help!
> 
> Best regards
> 
> Michael
> 
> 
> Dr. Michael Fritsch
> Technical Editor
> 
> T: +49.40.325587.214
> E: michael.frit...@coremedia.com<mailto:michael.frit...@coremedia.com>
> 
> CoreMedia GmbH - Be iconic
> Ludwig-Erhard-Str. 18
> 20459 Hamburg, Germany
> www.coremedia.com<http://www.coremedia.com/>
> ------------------------------------------------------------
> Managing Directory: Sören Stamer
> Commercial Register: Amtsgericht Hamburg, HR B 162480
> ----------------------------------------------------------------------
> Stay up to date and follow us on 
> LinkedIn<https://www.linkedin.com/company/coremedia-corp> or 
> Twitter<https://twitter.com/contentcloud>
> 
> 

Reply via email to