[jira] [Commented] (CONNECTORS-1695) Sitemap xml not detected in version 2.17 webconnector

2022-01-24 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17481420#comment-17481420
 ] 

Karl Wright commented on CONNECTORS-1695:
-

The problem is that there is no parser in the web connector for the sitemap xml 
and text/xml is not sufficient indication as to what kind of XML it is to do 
anything special with it.  The web crawler accepts files of this kind but 
assumes they are xhtml, which doesn't work because these aren't.

As for processing the whole file, well certainly.  That is what should happen.  
The file should be read in and parsed and the documents within queued for 
further processing.



> Sitemap xml not detected in version 2.17 webconnector
> -
>
> Key: CONNECTORS-1695
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1695
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.17
>Reporter: DK
>Priority: Major
>
> Trying to index sitemap xml and web connector index the whole xml into solr.
> Please fix in version 2.17.
> If it is any special config that needs to be taken care, please add here and 
> add in documentation to make it clear.
>  
> Sitemap.xml:
> http://www.sitemaps.org/schemas/sitemap/0.9;>
> 
> https:///sitemap_1.xml
> 2022-01-21T16:04:45Z
> 
> 
>  
> sitemap_1.xml:
> 
> 
> https://
> 2018-10-31T11:25:27Z
> 
> 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (CONNECTORS-1695) Sitemap xml not detected in version 2.17 webconnector

2022-01-24 Thread DK (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17481393#comment-17481393
 ] 

DK commented on CONNECTORS-1695:


Server returns valid sitemap xml and with mime type text/xml as mime type. As 
per another defect, it is in 'interestingMimeType' and should be supported. I 
also exclude it in solr output connector. But, I just get an error in job 
history indicating text/xml is restricted and web connector is still trying to 
process sitemap.xm as one full xml file.

Appreciate any pointers or help fixing it.

> Sitemap xml not detected in version 2.17 webconnector
> -
>
> Key: CONNECTORS-1695
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1695
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.17
>Reporter: DK
>Priority: Major
>
> Trying to index sitemap xml and web connector index the whole xml into solr.
> Please fix in version 2.17.
> If it is any special config that needs to be taken care, please add here and 
> add in documentation to make it clear.
>  
> Sitemap.xml:
> http://www.sitemaps.org/schemas/sitemap/0.9;>
> 
> https:///sitemap_1.xml
> 2022-01-21T16:04:45Z
> 
> 
>  
> sitemap_1.xml:
> 
> 
> https://
> 2018-10-31T11:25:27Z
> 
> 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (CONNECTORS-1695) Sitemap xml not detected in version 2.17 webconnector

2022-01-24 Thread DK (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DK updated CONNECTORS-1695:
---
Summary: Sitemap xml not detected in version 2.17 webconnector  (was: 
Sitemap xml not detected in version 2.17)

> Sitemap xml not detected in version 2.17 webconnector
> -
>
> Key: CONNECTORS-1695
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1695
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.17
>Reporter: DK
>Priority: Major
>
> Trying to index sitemap xml and web connector index the whole xml into solr.
> Please fix in version 2.17.
> If it is any special config that needs to be taken care, please add here and 
> add in documentation to make it clear.
>  
> Sitemap.xml:
> http://www.sitemaps.org/schemas/sitemap/0.9;>
> 
> https:///sitemap_1.xml
> 2022-01-21T16:04:45Z
> 
> 
>  
> sitemap_1.xml:
> 
> 
> https://
> 2018-10-31T11:25:27Z
> 
> 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (CONNECTORS-1695) Sitemap xml not detected in version 2.17

2022-01-24 Thread DK (Jira)
DK created CONNECTORS-1695:
--

 Summary: Sitemap xml not detected in version 2.17
 Key: CONNECTORS-1695
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1695
 Project: ManifoldCF
  Issue Type: Bug
  Components: Web connector
Affects Versions: ManifoldCF 2.17
Reporter: DK


Trying to index sitemap xml and web connector index the whole xml into solr.

Please fix in version 2.17.

If it is any special config that needs to be taken care, please add here and 
add in documentation to make it clear.

 

Sitemap.xml:

http://www.sitemaps.org/schemas/sitemap/0.9;>

https:///sitemap_1.xml
2022-01-21T16:04:45Z




 

sitemap_1.xml:



https://
2018-10-31T11:25:27Z





--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (CONNECTORS-1665) WebConnector: Add activity records for excluded URLs

2022-01-24 Thread Julien Massiera (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480957#comment-17480957
 ] 

Julien Massiera commented on CONNECTORS-1665:
-

r1897405

> WebConnector: Add activity records for excluded URLs 
> -
>
> Key: CONNECTORS-1665
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1665
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Web connector
>Affects Versions: ManifoldCF 2.18
>Reporter: Julien Massiera
>Assignee: Julien Massiera
>Priority: Trivial
> Fix For: ManifoldCF 2.19
>
> Attachments: patch-CONNECTORS-1665
>
>
> It would be interesting to add activity records in the WebConnector to keep 
> track of excluded URLs that match an exclude filter



--
This message was sent by Atlassian Jira
(v8.20.1#820001)