[
https://issues.apache.org/jira/browse/CONNECTORS-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16808812#comment-16808812
]
roel goovaerts commented on CONNECTORS-1598:
--------------------------------------------
Hi Karl, thanks for the reply. This indeed what I expected, I wanted to be sure
that there is no work-around via (for example) the configurations I tried.
There have already been talks about returning 3xx instead of 4xx.
Regards
> session based authentication cannot register 401
> ------------------------------------------------
>
> Key: CONNECTORS-1598
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1598
> Project: ManifoldCF
> Issue Type: Bug
> Components: Web connector
> Affects Versions: ManifoldCF 2.12
> Reporter: roel goovaerts
> Priority: Major
>
> Description:
> Access to a specific domain is restricted by being A) an intranet service B)
> based on an employee/costumer profile.
> For manifold to be able to be authenticated there is a specific
> '\{{domain}}/login' page with a form where manifold was configured to enter
> it's username and password. A session-cookie is then set so manifold is
> authenticated to access all resources. If a request for a resource is not
> authenticated the service throws a 401. When the service returns a 401 the
> actual content of the resource includes the same form as is present in
> '\{{domain}}/login'.
> Problem:
> The only way we have been able to configure manifold to be authenticated was
> by specifying session-based credentials AND providing '\{{domain}}/login' as
> a seed in the job as well. The only other seed in the job is a sitemap.
> This is of course not ideal since it can easily happen that the seed for the
> sitemap gets processed first, which then throws a 401 on the sitemap and the
> job stops.
> Another possible scenario with this configuration is that the cookie expires
> and all other resources throw 401 and get deleted from the index
> (elasticsearch). There is also another job (different language, same domain),
> usage of the cookie from the previous job has also been registered.
> Current session-based access credentials configuration:
> --url regular expression : https://\{{domain}}/
> --login pages:
> ---login url regexp : 'login'
> ---page type : form
> ---identification regexp is set to match the form-name
> ---form parameters are filled with the correct parameters
> This is verified to work, but as my understanding this only works because the
> login-page is part of the seeds and so it matches the url when it comes
> across it when crawling. There is no configuration yet which redirects (for
> example) to this page when manifold receives a 401.
> My goal was then to remove the login-page from the seeds and configure the
> job so that each time a fetch returns a 401, manifold knows to go to the
> login page. in pseudo code:
> --If authenticated
> ---process
> --else
> ---redirect to login
> ---retry resource
>
> Based on the documentation here:
> https://manifoldcf.apache.org/release/release-2.12/en_US/end-user-documentation.html#webrepository
> I tried a few different configurations. The first thing to notice is in the
> comparison table, 'page based authentication' only mentions 4xx and 'session
> based authentication' only mentions 3xx.
> At this time my biggest question is; are these response codes bound to the
> difference in settings between page and session based? As far I have been
> able to see, whenever manifold receives a 401 it logs "ignoring url
> \{{url}} because it failed to fetch (status=401, ..."
> Am I not able to work with session based authentication when the service
> returns 401's?
>
> Configuration attempts (all failed):
> - for all attempts the login page was removed from the seeds.
> - in general I have kept the above configuration of page type 'form', in the
> case I was able to redirect manifold to this page.
> - The kinds of content that a web connection can recognize as a login page
> specified in the documentation lists an option "A page that has specific
> content on it, as described by a regular expression". As the description of
> this case specified I tried the page type 'content' setting, with
> identification regexp set to '.*' for testing and an override url set to
> '\{{domain}}/login'. My hopes were that in this test the match-all-regexp
> would override to the login page for every url it fetches.
> - Since the content of a 401 also includes the same form as the login page, i
> tried with page type 'form', supplied identification regexp en override form
> parameters, just like above, only with the "login url regexp" set to '.*'. My
> hopes were that each page has the possibility to have the form recognized if
> it is returned as a 401.
> In both cases the only thing I could see is that manifold fetched the
> sitemap, received a 401 and in manifold logged "ignoring url \{{url}}
> because it failed to fetch (status=401, ..."
> Some questions:
> - Is there anything to be done when manifold receives a 401?
> - is 4xx tied to page base authentication and 3xx tied to session based
> authentication?
> - is there some other configuration/logic that I am missing, that I could try
> out?
> A minimal effort solution would be if there was a way to make manifold start
> at the login and not do any crawling (most importanly no deleting) when it is
> unable to be authenticated. Together with this a way to remove the session
> cookie when the job is done would also be necessary, so as to avoid the
> expiry of the cookie as a result of manifold using an old cookie.
> Side-note; is there any way to make manifold not delete documents when it
> receives a 401?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)