[jira] [Commented] (CONNECTORS-1695) Sitemap xml not detected in version 2.17 webconnector

Karl Wright (Jira) Mon, 24 Jan 2022 13:48:04 -0800


    [ 
https://issues.apache.org/jira/browse/CONNECTORS-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481420#comment-17481420
 ]


Karl Wright commented on CONNECTORS-1695:
-----------------------------------------

The problem is that there is no parser in the web connector for the sitemap xml 
and text/xml is not sufficient indication as to what kind of XML it is to do 
anything special with it.  The web crawler accepts files of this kind but 
assumes they are xhtml, which doesn't work because these aren't.

As for processing the whole file, well certainly.  That is what should happen.  
The file should be read in and parsed and the documents within queued for 
further processing.



> Sitemap xml not detected in version 2.17 webconnector
> -----------------------------------------------------
>
>                 Key: CONNECTORS-1695
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1695
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>    Affects Versions: ManifoldCF 2.17
>            Reporter: DK
>            Priority: Major
>
> Trying to index sitemap xml and web connector index the whole xml into solr.
> Please fix in version 2.17.
> If it is any special config that needs to be taken care, please add here and 
> add in documentation to make it clear.
>  
> Sitemap.xml:
> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9";>
> <sitemap>
> <loc>https://<url>/sitemap_1.xml</loc>
> <lastmod>2022-01-21T16:04:45Z</lastmod>
> </sitemap>
> </sitemapindex>
>  
> sitemap_1.xml:
> <urlset>
> <url>
> <loc>https://<docurl></loc>
> <lastmod>2018-10-31T11:25:27Z</lastmod>
> </url>
> </urlset>



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (CONNECTORS-1695) Sitemap xml not detected in version 2.17 webconnector

Reply via email to