> I agree that the this is not the ideal error behaviour, but I guess the code > was written from the assumption that the document is valid and conformant.
Over time the crawler-commons sitemap parser has been extended to get as much as possible from non-conforming sitemaps as well. Of course, it's hard to foresee and handle all possible mistakes... The equivalent syntax error for sitemaps (missing closing/next <url> in <urlset> is handled. @Markus: Please open an issue for crawler-commons https://github.com/crawler-commons/crawler-commons/issues/ Thanks, Sebastian On 05/26/2018 02:57 AM, Yossi Tamari wrote: > Hi Markus, > > I don’t believe this is a valid sitemapindex. Each <sitemap> should include > exactly one <loc>. > See also https://www.sitemaps.org/protocol.html#index and > https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd. > I agree that the this is not the ideal error behaviour, but I guess the code > was written from the assumption that the document is valid and conformant. > > Yossi. > >> -----Original Message----- >> From: Markus Jelsma <[email protected]> >> Sent: 25 May 2018 23:45 >> To: User <[email protected]> >> Subject: Sitemap URL's concatenated, causing status 14 not found >> >> Hello, >> >> We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but >> Nutch things those two sitemap URL's are actually one consisting of both >> concatenated. >> >> Here is https://www.saxion.nl/sitemap.xml >> >> <?xml version="1.0" encoding="UTF-8"?> >> <ns2:sitemapindex >> xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9"> >> <sitemap> >> <loc>https://www.saxion.nl/opleidingen-sitemap.xml</loc> >> <loc>https://www.saxion.nl/content-sitemap.xml</loc> >> </sitemap> >> </ns2:sitemapindex> >> >> This seems fine, but Nutch attempts, and obviously fails to load: >> >> 2018-05-25 16:27:50,515 ERROR [Thread-30] >> org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap. >> Status code: 14 for https://www.saxion.nl/opleidingen- >> sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml >> >> What is going on here? Why does Nutch, or CC's sitemap util behave like this? >> >> Thanks, >> Markus >

