[jira] [Commented] (CONNECTORS-1599) response code 401 still gets deleted with the setting "keep unreachable documents"
[ https://issues.apache.org/jira/browse/CONNECTORS-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820875#comment-16820875 ] roel goovaerts commented on CONNECTORS-1599: Hi [~kwri...@metacarta.com] Is there some resource which specifies the different technical responses of manifold to different http codes? Regards, Roel > response code 401 still gets deleted with the setting "keep unreachable > documents" > -- > > Key: CONNECTORS-1599 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1599 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.12 >Reporter: roel goovaerts >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > > Even with the "Hop count mode" set to "keep unreachable documents, 'for now' > || forever" manifold deletes documents for which it receives a 401 response > code. > The documentation does not specify such a distinction as described above. Is > there some information/configuration that I'm missing? Is there a reasoning > behind the guaranteed deletion of a 401? > Ideally, for our use-case, we would want to remove all documents that return > 404, but keep everything which is due the server not responding or the > crawler being unauthenticated. > Is there a way to configure this in a more granular fashion? > Regards, > roel -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1599) response code 401 still gets deleted with the setting "keep unreachable documents"
[ https://issues.apache.org/jira/browse/CONNECTORS-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815471#comment-16815471 ] roel goovaerts commented on CONNECTORS-1599: fair enough, so the hop count mode is purely for documents being "unreachable" in terms of pathing via the intrinsiclink-table. Is there some resource where different responses of manifold to different http codes are documented? > response code 401 still gets deleted with the setting "keep unreachable > documents" > -- > > Key: CONNECTORS-1599 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1599 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.12 >Reporter: roel goovaerts >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > > Even with the "Hop count mode" set to "keep unreachable documents, 'for now' > || forever" manifold deletes documents for which it receives a 401 response > code. > The documentation does not specify such a distinction as described above. Is > there some information/configuration that I'm missing? Is there a reasoning > behind the guaranteed deletion of a 401? > Ideally, for our use-case, we would want to remove all documents that return > 404, but keep everything which is due the server not responding or the > crawler being unauthenticated. > Is there a way to configure this in a more granular fashion? > Regards, > roel -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1599) response code 401 still gets deleted with the setting "keep unreachable documents"
[ https://issues.apache.org/jira/browse/CONNECTORS-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815389#comment-16815389 ] Karl Wright commented on CONNECTORS-1599: - [~goovaertsr], 401 means the document is not accessible. This has nothing to do with being "unreachable", because "unreachable" means there is no path to it from the seeds. > response code 401 still gets deleted with the setting "keep unreachable > documents" > -- > > Key: CONNECTORS-1599 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1599 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.12 >Reporter: roel goovaerts >Priority: Major > > Even with the "Hop count mode" set to "keep unreachable documents, 'for now' > || forever" manifold deletes documents for which it receives a 401 response > code. > The documentation does not specify such a distinction as described above. Is > there some information/configuration that I'm missing? Is there a reasoning > behind the guaranteed deletion of a 401? > Ideally, for our use-case, we would want to remove all documents that return > 404, but keep everything which is due the server not responding or the > crawler being unauthenticated. > Is there a way to configure this in a more granular fashion? > Regards, > roel -- This message was sent by Atlassian JIRA (v7.6.3#76005)