[ https://issues.apache.org/jira/browse/CONNECTORS-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808812#comment-16808812 ]
roel goovaerts commented on CONNECTORS-1598: -------------------------------------------- Hi Karl, thanks for the reply. This indeed what I expected, I wanted to be sure that there is no work-around via (for example) the configurations I tried. There have already been talks about returning 3xx instead of 4xx. Regards > session based authentication cannot register 401 > ------------------------------------------------ > > Key: CONNECTORS-1598 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1598 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector > Affects Versions: ManifoldCF 2.12 > Reporter: roel goovaerts > Priority: Major > > Description: > Access to a specific domain is restricted by being A) an intranet service B) > based on an employee/costumer profile. > For manifold to be able to be authenticated there is a specific > '\{{domain}}/login' page with a form where manifold was configured to enter > it's username and password. A session-cookie is then set so manifold is > authenticated to access all resources. If a request for a resource is not > authenticated the service throws a 401. When the service returns a 401 the > actual content of the resource includes the same form as is present in > '\{{domain}}/login'. > Problem: > The only way we have been able to configure manifold to be authenticated was > by specifying session-based credentials AND providing '\{{domain}}/login' as > a seed in the job as well. The only other seed in the job is a sitemap. > This is of course not ideal since it can easily happen that the seed for the > sitemap gets processed first, which then throws a 401 on the sitemap and the > job stops. > Another possible scenario with this configuration is that the cookie expires > and all other resources throw 401 and get deleted from the index > (elasticsearch). There is also another job (different language, same domain), > usage of the cookie from the previous job has also been registered. > Current session-based access credentials configuration: > --url regular expression : https://\{{domain}}/ > --login pages: > ---login url regexp : 'login' > ---page type : form > ---identification regexp is set to match the form-name > ---form parameters are filled with the correct parameters > This is verified to work, but as my understanding this only works because the > login-page is part of the seeds and so it matches the url when it comes > across it when crawling. There is no configuration yet which redirects (for > example) to this page when manifold receives a 401. > My goal was then to remove the login-page from the seeds and configure the > job so that each time a fetch returns a 401, manifold knows to go to the > login page. in pseudo code: > --If authenticated > ---process > --else > ---redirect to login > ---retry resource > > Based on the documentation here: > https://manifoldcf.apache.org/release/release-2.12/en_US/end-user-documentation.html#webrepository > I tried a few different configurations. The first thing to notice is in the > comparison table, 'page based authentication' only mentions 4xx and 'session > based authentication' only mentions 3xx. > At this time my biggest question is; are these response codes bound to the > difference in settings between page and session based? As far I have been > able to see, whenever manifold receives a 401 it logs "ignoring url > \{{url}} because it failed to fetch (status=401, ..." > Am I not able to work with session based authentication when the service > returns 401's? > > Configuration attempts (all failed): > - for all attempts the login page was removed from the seeds. > - in general I have kept the above configuration of page type 'form', in the > case I was able to redirect manifold to this page. > - The kinds of content that a web connection can recognize as a login page > specified in the documentation lists an option "A page that has specific > content on it, as described by a regular expression". As the description of > this case specified I tried the page type 'content' setting, with > identification regexp set to '.*' for testing and an override url set to > '\{{domain}}/login'. My hopes were that in this test the match-all-regexp > would override to the login page for every url it fetches. > - Since the content of a 401 also includes the same form as the login page, i > tried with page type 'form', supplied identification regexp en override form > parameters, just like above, only with the "login url regexp" set to '.*'. My > hopes were that each page has the possibility to have the form recognized if > it is returned as a 401. > In both cases the only thing I could see is that manifold fetched the > sitemap, received a 401 and in manifold logged "ignoring url \{{url}} > because it failed to fetch (status=401, ..." > Some questions: > - Is there anything to be done when manifold receives a 401? > - is 4xx tied to page base authentication and 3xx tied to session based > authentication? > - is there some other configuration/logic that I am missing, that I could try > out? > A minimal effort solution would be if there was a way to make manifold start > at the login and not do any crawling (most importanly no deleting) when it is > unable to be authenticated. Together with this a way to remove the session > cookie when the job is done would also be necessary, so as to avoid the > expiry of the cookie as a result of manifold using an old cookie. > Side-note; is there any way to make manifold not delete documents when it > receives a 401? -- This message was sent by Atlassian JIRA (v7.6.3#76005)