[ 
https://issues.apache.org/jira/browse/CONNECTORS-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808812#comment-16808812
 ] 

roel goovaerts commented on CONNECTORS-1598:
--------------------------------------------

Hi Karl, thanks for the reply. This indeed what I expected, I wanted to be sure 
that there is no work-around via (for example) the configurations I tried. 
There have already been talks about returning 3xx instead of 4xx.

Regards

> session based authentication cannot register 401
> ------------------------------------------------
>
>                 Key: CONNECTORS-1598
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1598
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>    Affects Versions: ManifoldCF 2.12
>            Reporter: roel goovaerts
>            Priority: Major
>
> Description:
> Access to a specific domain is restricted by being A) an intranet service B) 
> based on an employee/costumer profile.
> For manifold to be able to be authenticated there is a specific 
> '\{{domain}}/login' page with a form where manifold was configured to enter 
> it's username and password. A session-cookie is then set so manifold is 
> authenticated to access all resources. If a request for a resource is not 
> authenticated the service throws a 401. When the service returns a 401 the 
> actual content of the resource includes the same form as is present in 
> '\{{domain}}/login'.
> Problem:
> The only way we have been able to configure manifold to be authenticated was 
> by specifying session-based credentials AND providing '\{{domain}}/login' as 
> a seed in the job as well. The only other seed in the job is a sitemap.
> This is of course not ideal since it can easily happen that the seed for the 
> sitemap gets processed first, which then throws a 401 on the sitemap and the 
> job stops.
> Another possible scenario with this configuration is that the cookie expires 
> and all other resources throw 401 and get deleted from the index 
> (elasticsearch). There is also another job (different language, same domain), 
> usage of the cookie from the previous job has also been registered.
> Current session-based access credentials configuration:
> --url regular expression : https://\{{domain}}/
> --login pages: 
> ---login url regexp : 'login'
> ---page type : form
> ---identification regexp is set to match the form-name
> ---form parameters are filled with the correct parameters
> This is verified to work, but as my understanding this only works because the 
> login-page is part of the seeds and so it matches the url when it comes 
> across it when crawling. There is no configuration yet which redirects (for 
> example) to this page when manifold receives a 401.
> My goal was then to remove the login-page from the seeds and configure the 
> job so that each time a fetch returns a 401, manifold knows to go to the 
> login page. in pseudo code:
> --If authenticated
> ---process 
> --else
> ---redirect to login
> ---retry resource
>  
> Based on the documentation here: 
> https://manifoldcf.apache.org/release/release-2.12/en_US/end-user-documentation.html#webrepository
>  I tried a few different configurations. The first thing to notice is in the 
> comparison table, 'page based authentication' only mentions 4xx and 'session 
> based authentication' only mentions 3xx.
> At this time my biggest question is; are these response codes bound to the 
> difference in settings between page and session based? As far I have been 
> able to see, whenever manifold receives a 401 it logs "ignoring url 
> \{‌{url}‌} because it failed to fetch (status=401, ..."
> Am I not able to work with session based authentication when the service 
> returns 401's?
>  
> Configuration attempts (all failed):
> - for all attempts the login page was removed from the seeds.
> - in general I have kept the above configuration of page type 'form', in the 
> case I was able to redirect manifold to this page.
> - The kinds of content that a web connection can recognize as a login page 
> specified in the documentation lists an option "A page that has specific 
> content on it, as described by a regular expression". As the description of 
> this case specified I tried the page type 'content' setting, with 
> identification regexp set to '.*' for testing and an override url set to 
> '\{{domain}}/login'. My hopes were that in this test the match-all-regexp 
> would override to the login page for every url it fetches.
> - Since the content of a 401 also includes the same form as the login page, i 
> tried with page type 'form', supplied identification regexp en override form 
> parameters, just like above, only with the "login url regexp" set to '.*'. My 
> hopes were that each page has the possibility to have the form recognized if 
> it is returned as a 401.
> In both cases the only thing I could see is that manifold fetched the 
> sitemap, received a 401 and in manifold logged "ignoring url \{‌{url}‌} 
> because it failed to fetch (status=401, ..."
> Some questions:
> - Is there anything to be done when manifold receives a 401?
> - is 4xx tied to page base authentication and 3xx tied to session based 
> authentication?
> - is there some other configuration/logic that I am missing, that I could try 
> out?
> A minimal effort solution would be if there was a way to make manifold start 
> at the login and not do any crawling (most importanly no deleting) when it is 
> unable to be authenticated. Together with this a way to remove the session 
> cookie when the job is done would also be necessary, so as to avoid the 
> expiry of the cookie as a result of manifold using an old cookie.
> Side-note; is there any way to make manifold not delete documents when it 
> receives a 401?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to