Hi Markus,
I don’t believe this is a valid sitemapindex. Each <sitemap> should include
exactly one <loc>.
See also https://www.sitemaps.org/protocol.html#index and
https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd.
I agree that the this is not the ideal error behaviour, but I guess the code
was written from the assumption that the document is valid and conformant.
Yossi.
> -----Original Message-----
> From: Markus Jelsma <[email protected]>
> Sent: 25 May 2018 23:45
> To: User <[email protected]>
> Subject: Sitemap URL's concatenated, causing status 14 not found
>
> Hello,
>
> We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but
> Nutch things those two sitemap URL's are actually one consisting of both
> concatenated.
>
> Here is https://www.saxion.nl/sitemap.xml
>
> <?xml version="1.0" encoding="UTF-8"?>
> <ns2:sitemapindex
> xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9">
> <sitemap>
> <loc>https://www.saxion.nl/opleidingen-sitemap.xml</loc>
> <loc>https://www.saxion.nl/content-sitemap.xml</loc>
> </sitemap>
> </ns2:sitemapindex>
>
> This seems fine, but Nutch attempts, and obviously fails to load:
>
> 2018-05-25 16:27:50,515 ERROR [Thread-30]
> org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap.
> Status code: 14 for https://www.saxion.nl/opleidingen-
> sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml
>
> What is going on here? Why does Nutch, or CC's sitemap util behave like this?
>
> Thanks,
> Markus