[jira] [Commented] (CONNECTORS-1695) Sitemap xml not detected in version 2.17 webconnector
[ https://issues.apache.org/jira/browse/CONNECTORS-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17481420#comment-17481420 ] Karl Wright commented on CONNECTORS-1695: - The problem is that there is no parser in the web connector for the sitemap xml and text/xml is not sufficient indication as to what kind of XML it is to do anything special with it. The web crawler accepts files of this kind but assumes they are xhtml, which doesn't work because these aren't. As for processing the whole file, well certainly. That is what should happen. The file should be read in and parsed and the documents within queued for further processing. > Sitemap xml not detected in version 2.17 webconnector > - > > Key: CONNECTORS-1695 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1695 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.17 >Reporter: DK >Priority: Major > > Trying to index sitemap xml and web connector index the whole xml into solr. > Please fix in version 2.17. > If it is any special config that needs to be taken care, please add here and > add in documentation to make it clear. > > Sitemap.xml: > http://www.sitemaps.org/schemas/sitemap/0.9;> > > https:///sitemap_1.xml > 2022-01-21T16:04:45Z > > > > sitemap_1.xml: > > > https:// > 2018-10-31T11:25:27Z > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (CONNECTORS-1695) Sitemap xml not detected in version 2.17 webconnector
[ https://issues.apache.org/jira/browse/CONNECTORS-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17481393#comment-17481393 ] DK commented on CONNECTORS-1695: Server returns valid sitemap xml and with mime type text/xml as mime type. As per another defect, it is in 'interestingMimeType' and should be supported. I also exclude it in solr output connector. But, I just get an error in job history indicating text/xml is restricted and web connector is still trying to process sitemap.xm as one full xml file. Appreciate any pointers or help fixing it. > Sitemap xml not detected in version 2.17 webconnector > - > > Key: CONNECTORS-1695 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1695 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.17 >Reporter: DK >Priority: Major > > Trying to index sitemap xml and web connector index the whole xml into solr. > Please fix in version 2.17. > If it is any special config that needs to be taken care, please add here and > add in documentation to make it clear. > > Sitemap.xml: > http://www.sitemaps.org/schemas/sitemap/0.9;> > > https:///sitemap_1.xml > 2022-01-21T16:04:45Z > > > > sitemap_1.xml: > > > https:// > 2018-10-31T11:25:27Z > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (CONNECTORS-1695) Sitemap xml not detected in version 2.17 webconnector
[ https://issues.apache.org/jira/browse/CONNECTORS-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DK updated CONNECTORS-1695: --- Summary: Sitemap xml not detected in version 2.17 webconnector (was: Sitemap xml not detected in version 2.17) > Sitemap xml not detected in version 2.17 webconnector > - > > Key: CONNECTORS-1695 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1695 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.17 >Reporter: DK >Priority: Major > > Trying to index sitemap xml and web connector index the whole xml into solr. > Please fix in version 2.17. > If it is any special config that needs to be taken care, please add here and > add in documentation to make it clear. > > Sitemap.xml: > http://www.sitemaps.org/schemas/sitemap/0.9;> > > https:///sitemap_1.xml > 2022-01-21T16:04:45Z > > > > sitemap_1.xml: > > > https:// > 2018-10-31T11:25:27Z > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (CONNECTORS-1695) Sitemap xml not detected in version 2.17
DK created CONNECTORS-1695: -- Summary: Sitemap xml not detected in version 2.17 Key: CONNECTORS-1695 URL: https://issues.apache.org/jira/browse/CONNECTORS-1695 Project: ManifoldCF Issue Type: Bug Components: Web connector Affects Versions: ManifoldCF 2.17 Reporter: DK Trying to index sitemap xml and web connector index the whole xml into solr. Please fix in version 2.17. If it is any special config that needs to be taken care, please add here and add in documentation to make it clear. Sitemap.xml: http://www.sitemaps.org/schemas/sitemap/0.9;> https:///sitemap_1.xml 2022-01-21T16:04:45Z sitemap_1.xml: https:// 2018-10-31T11:25:27Z -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (CONNECTORS-1665) WebConnector: Add activity records for excluded URLs
[ https://issues.apache.org/jira/browse/CONNECTORS-1665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480957#comment-17480957 ] Julien Massiera commented on CONNECTORS-1665: - r1897405 > WebConnector: Add activity records for excluded URLs > - > > Key: CONNECTORS-1665 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1665 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector >Affects Versions: ManifoldCF 2.18 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Trivial > Fix For: ManifoldCF 2.19 > > Attachments: patch-CONNECTORS-1665 > > > It would be interesting to add activity records in the WebConnector to keep > track of excluded URLs that match an exclude filter -- This message was sent by Atlassian Jira (v8.20.1#820001)