[jira] [Created] (CONNECTORS-1599) response code 401 still gets deleted with the setting "keep unreachable documents"

2019-04-11 Thread roel goovaerts (JIRA)
roel goovaerts created CONNECTORS-1599:
--

 Summary: response code 401 still gets deleted with the setting 
"keep unreachable documents"
 Key: CONNECTORS-1599
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1599
 Project: ManifoldCF
  Issue Type: Bug
  Components: Web connector
Affects Versions: ManifoldCF 2.12
Reporter: roel goovaerts


Even with the "Hop count mode" set to "keep unreachable documents, 'for now' || 
forever" manifold deletes documents for which it receives a 401 response code.

The documentation does not specify such a distinction as described above. Is 
there some information/configuration that I'm missing? Is there a reasoning 
behind the guaranteed deletion of a 401?

Ideally, for our use-case, we would want to remove all documents that return 
404, but keep everything which is due the server not responding or the crawler 
being unauthenticated.

Is there a way to configure this in a more granular fashion?

Regards,
roel



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint

2019-04-11 Thread Durai (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815165#comment-16815165
 ] 

Durai commented on CONNECTORS-1449:
---

Karl,

I need this feature implemented in our current project.

Basically, need to exclude the SP library and lists if they are marked with 
NoCrawl - True.

Alos, when this flag is changed to false, need to include them in the ingest 
paths.

Version: Sharepoint 2013.

If you can give me idea about where to plugin this code to check for NoCrawl 
flag and exclude them in seeding, I can implement and submit the code here for 
community.

Thanks

> Add support for respecting the NoCrawl flag in Sharepoint
> -
>
> Key: CONNECTORS-1449
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1449
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: SharePoint connector
>Reporter: Markus Schuch
>Assignee: Markus Schuch
>Priority: Major
> Fix For: ManifoldCF next
>
>
> There is a flag {{NoCrawl}} in sharepoint that indicates whether an object 
> should be crawled or not:
> Lists
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx
> Web
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx
> Field
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx
> Wouldn't it be nice to respect that flag?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1599) response code 401 still gets deleted with the setting "keep unreachable documents"

2019-04-11 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1599.
-
   Resolution: Not A Problem
 Assignee: Karl Wright
Fix Version/s: ManifoldCF 2.13

Working as designed.


> response code 401 still gets deleted with the setting "keep unreachable 
> documents"
> --
>
> Key: CONNECTORS-1599
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1599
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> Even with the "Hop count mode" set to "keep unreachable documents, 'for now' 
> || forever" manifold deletes documents for which it receives a 401 response 
> code.
> The documentation does not specify such a distinction as described above. Is 
> there some information/configuration that I'm missing? Is there a reasoning 
> behind the guaranteed deletion of a 401?
> Ideally, for our use-case, we would want to remove all documents that return 
> 404, but keep everything which is due the server not responding or the 
> crawler being unauthenticated.
> Is there a way to configure this in a more granular fashion?
> Regards,
> roel



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1599) response code 401 still gets deleted with the setting "keep unreachable documents"

2019-04-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815389#comment-16815389
 ] 

Karl Wright commented on CONNECTORS-1599:
-

[~goovaertsr], 401 means the document is not accessible.  This has nothing to 
do with being "unreachable", because "unreachable" means there is no path to it 
from the seeds.


> response code 401 still gets deleted with the setting "keep unreachable 
> documents"
> --
>
> Key: CONNECTORS-1599
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1599
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Priority: Major
>
> Even with the "Hop count mode" set to "keep unreachable documents, 'for now' 
> || forever" manifold deletes documents for which it receives a 401 response 
> code.
> The documentation does not specify such a distinction as described above. Is 
> there some information/configuration that I'm missing? Is there a reasoning 
> behind the guaranteed deletion of a 401?
> Ideally, for our use-case, we would want to remove all documents that return 
> 404, but keep everything which is due the server not responding or the 
> crawler being unauthenticated.
> Is there a way to configure this in a more granular fashion?
> Regards,
> roel



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint

2019-04-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815388#comment-16815388
 ] 

Karl Wright commented on CONNECTORS-1449:
-

[~durai-jira], I do not have a sharepoint instance here.  Nor do I have any 
indication that any of the web services included with SharePoint provide the 
value for this flag.  If you can show me how the standard services return the 
flag value, I can easily implement this.  Please let me know.


> Add support for respecting the NoCrawl flag in Sharepoint
> -
>
> Key: CONNECTORS-1449
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1449
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: SharePoint connector
>Reporter: Markus Schuch
>Assignee: Markus Schuch
>Priority: Major
> Fix For: ManifoldCF next
>
>
> There is a flag {{NoCrawl}} in sharepoint that indicates whether an object 
> should be crawled or not:
> Lists
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx
> Web
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx
> Field
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx
> Wouldn't it be nice to respect that flag?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1599) response code 401 still gets deleted with the setting "keep unreachable documents"

2019-04-11 Thread roel goovaerts (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815471#comment-16815471
 ] 

roel goovaerts commented on CONNECTORS-1599:


fair enough, so the hop count mode is purely for documents being "unreachable" 
in terms of pathing via the intrinsiclink-table.
Is there some resource where different responses of manifold to different http 
codes are documented?

> response code 401 still gets deleted with the setting "keep unreachable 
> documents"
> --
>
> Key: CONNECTORS-1599
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1599
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.12
>Reporter: roel goovaerts
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> Even with the "Hop count mode" set to "keep unreachable documents, 'for now' 
> || forever" manifold deletes documents for which it receives a 401 response 
> code.
> The documentation does not specify such a distinction as described above. Is 
> there some information/configuration that I'm missing? Is there a reasoning 
> behind the guaranteed deletion of a 401?
> Ideally, for our use-case, we would want to remove all documents that return 
> 404, but keep everything which is due the server not responding or the 
> crawler being unauthenticated.
> Is there a way to configure this in a more granular fashion?
> Regards,
> roel



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint

2019-04-11 Thread Drai (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815964#comment-16815964
 ] 

Drai commented on CONNECTORS-1449:
--

I am checking if SOAP response can return NoCtawl flag.
But, is is possible/feasible design to do this at transformer level? This
would involve creating a new   SP doc filter transformer which makes SP
rest API call to get "no crawl" flag and decide to pass/restrict the
document to Solr output or other output  connector.
Thanks



-- 
*Durai Kalaiselvan*
Founder, Cumilisys LLC
Office: 408 940-5135 Mobile: 408 835 0309



This email and (any accompanying attachments) may contain confidential
information belonging to the sender which is legally protected. The
information is intended only for the use of the individual or entity to
whom it is addressed, and others who have been specifically authorized by
the addressee to receive it. If you are not an intended recipient, you are
hereby notified that any disclosure, copying, or distribution of, or the
taking of any action in reliance on, this communication or the information
contained herein is strictly prohibited. If you have received this
communication in error, please notify us immediately by email or telephone.
Thank you for your cooperation.


> Add support for respecting the NoCrawl flag in Sharepoint
> -
>
> Key: CONNECTORS-1449
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1449
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: SharePoint connector
>Reporter: Markus Schuch
>Assignee: Markus Schuch
>Priority: Major
> Fix For: ManifoldCF next
>
>
> There is a flag {{NoCrawl}} in sharepoint that indicates whether an object 
> should be crawled or not:
> Lists
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx
> Web
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx
> Field
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx
> Wouldn't it be nice to respect that flag?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint

2019-04-11 Thread Drai (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815644#comment-16815644
 ] 

Drai commented on CONNECTORS-1449:
--

Karl,

Enabled connector log and I could not see this flag as well in log.

But, rest api returns this flag as True/False based on Sharepoint library 
advance settings.

Rest API Url:

http://:/sites//_api/web/lists/getbytitle('LibraryTitle')

 

API Response contains:

true

Not sure why SOAP response does not contain this flag.

Is it possible to call rest API  to get this flag and not crawl those 
library/list items?

Thanks

> Add support for respecting the NoCrawl flag in Sharepoint
> -
>
> Key: CONNECTORS-1449
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1449
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: SharePoint connector
>Reporter: Markus Schuch
>Assignee: Markus Schuch
>Priority: Major
> Fix For: ManifoldCF next
>
>
> There is a flag {{NoCrawl}} in sharepoint that indicates whether an object 
> should be crawled or not:
> Lists
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx
> Web
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx
> Field
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx
> Wouldn't it be nice to respect that flag?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1449) Add support for respecting the NoCrawl flag in Sharepoint

2019-04-11 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16815765#comment-16815765
 ] 

Karl Wright commented on CONNECTORS-1449:
-

[~durai-jira] Rewriting the SharePoint connector to use the REST API is well 
beyond the scope of a bug fix of this kind.

If you want to attempt this work and contribute it to ManifoldCF, I can 
certainly *help*, but I cannot do something of this scale myself, while working 
full time on other things, without even a SharePoint instance to work with.



> Add support for respecting the NoCrawl flag in Sharepoint
> -
>
> Key: CONNECTORS-1449
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1449
> Project: ManifoldCF
>  Issue Type: New Feature
>  Components: SharePoint connector
>Reporter: Markus Schuch
>Assignee: Markus Schuch
>Priority: Major
> Fix For: ManifoldCF next
>
>
> There is a flag {{NoCrawl}} in sharepoint that indicates whether an object 
> should be crawled or not:
> Lists
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.splist.nocrawl.aspx
> Web
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spweb.nocrawl.aspx
> Field
> https://msdn.microsoft.com/en-us/library/office/microsoft.sharepoint.spfield.nocrawl.aspx
> Wouldn't it be nice to respect that flag?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)