[ 
https://issues.apache.org/jira/browse/CONNECTORS-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808808#comment-16808808
 ] 

Karl Wright commented on CONNECTORS-1598:
-----------------------------------------

Hi [~goovaertsr], if your site authentication involves cookies, you really need 
to have session-based authentication set up.  Furthermore, you do NOT want to 
include the login page in your seed list.  Instead, you want to set up a login 
sequence (which is attached to a specific set of URLs that define the 
session-protected part of your site), which will be triggered if the login 
needs to be done.

What session-based Auth does is the following:
- detects when accessing a content page fails because of missing session login
- provides a way of walking through the session login page sequence that sets 
the cookies
- retries the content page fetch with the correct cookies that have been logged 
in

It is therefore critical to configure the session-based access so that it 
properly detects when an invalid, missing, or expired session cookie is 
detected by the site you are crawling.  If you've already read the end-user 
documentation about this, then this should be clear.

If I understand your problem, your site does not redirect to a login page when 
there is no session cookie: it simply returns a 401.  That's not a very typical 
flow for session-based code, but you should nevertheless be able to match 
specific page contents associated with the 401 response.  In HTTP, all response 
codes can have content, and 401 is no different, so I assume this is the case?

To answer your question:
{quote}
is 4xx tied to page base authentication and 3xx tied to session based 
authentication?
{quote}

4xx responses are handled only as page-based authentication.  That is the 
meaning of 4xx responses typically in HTTP.


> session based authentication cannot register 401
> ------------------------------------------------
>
>                 Key: CONNECTORS-1598
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1598
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>    Affects Versions: ManifoldCF 2.12
>            Reporter: roel goovaerts
>            Priority: Major
>
> Description:
> Access to a specific domain is restricted by being A) an intranet service B) 
> based on an employee/costumer profile.
> For manifold to be able to be authenticated there is a specific 
> '\{{domain}}/login' page with a form where manifold was configured to enter 
> it's username and password. A session-cookie is then set so manifold is 
> authenticated to access all resources. If a request for a resource is not 
> authenticated the service throws a 401. When the service returns a 401 the 
> actual content of the resource includes the same form as is present in 
> '\{{domain}}/login'.
> Problem:
> The only way we have been able to configure manifold to be authenticated was 
> by specifying session-based credentials AND providing '\{{domain}}/login' as 
> a seed in the job as well. The only other seed in the job is a sitemap.
> This is of course not ideal since it can easily happen that the seed for the 
> sitemap gets processed first, which then throws a 401 on the sitemap and the 
> job stops.
> Another possible scenario with this configuration is that the cookie expires 
> and all other resources throw 401 and get deleted from the index 
> (elasticsearch). There is also another job (different language, same domain), 
> usage of the cookie from the previous job has also been registered.
> Current session-based access credentials configuration:
> --url regular expression : https://\{{domain}}/
> --login pages: 
> ---login url regexp : 'login'
> ---page type : form
> ---identification regexp is set to match the form-name
> ---form parameters are filled with the correct parameters
> This is verified to work, but as my understanding this only works because the 
> login-page is part of the seeds and so it matches the url when it comes 
> across it when crawling. There is no configuration yet which redirects (for 
> example) to this page when manifold receives a 401.
> My goal was then to remove the login-page from the seeds and configure the 
> job so that each time a fetch returns a 401, manifold knows to go to the 
> login page. in pseudo code:
> --If authenticated
> ---process 
> --else
> ---redirect to login
> ---retry resource
>  
> Based on the documentation here: 
> https://manifoldcf.apache.org/release/release-2.12/en_US/end-user-documentation.html#webrepository
>  I tried a few different configurations. The first thing to notice is in the 
> comparison table, 'page based authentication' only mentions 4xx and 'session 
> based authentication' only mentions 3xx.
> At this time my biggest question is; are these response codes bound to the 
> difference in settings between page and session based? As far I have been 
> able to see, whenever manifold receives a 401 it logs "ignoring url 
> \{‌{url}‌} because it failed to fetch (status=401, ..."
> Am I not able to work with session based authentication when the service 
> returns 401's?
>  
> Configuration attempts (all failed):
> - for all attempts the login page was removed from the seeds.
> - in general I have kept the above configuration of page type 'form', in the 
> case I was able to redirect manifold to this page.
> - The kinds of content that a web connection can recognize as a login page 
> specified in the documentation lists an option "A page that has specific 
> content on it, as described by a regular expression". As the description of 
> this case specified I tried the page type 'content' setting, with 
> identification regexp set to '.*' for testing and an override url set to 
> '\{{domain}}/login'. My hopes were that in this test the match-all-regexp 
> would override to the login page for every url it fetches.
> - Since the content of a 401 also includes the same form as the login page, i 
> tried with page type 'form', supplied identification regexp en override form 
> parameters, just like above, only with the "login url regexp" set to '.*'. My 
> hopes were that each page has the possibility to have the form recognized if 
> it is returned as a 401.
> In both cases the only thing I could see is that manifold fetched the 
> sitemap, received a 401 and in manifold logged "ignoring url \{‌{url}‌} 
> because it failed to fetch (status=401, ..."
> Some questions:
> - Is there anything to be done when manifold receives a 401?
> - is 4xx tied to page base authentication and 3xx tied to session based 
> authentication?
> - is there some other configuration/logic that I am missing, that I could try 
> out?
> A minimal effort solution would be if there was a way to make manifold start 
> at the login and not do any crawling (most importanly no deleting) when it is 
> unable to be authenticated. Together with this a way to remove the session 
> cookie when the job is done would also be necessary, so as to avoid the 
> expiry of the cookie as a result of manifold using an old cookie.
> Side-note; is there any way to make manifold not delete documents when it 
> receives a 401?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to