Ah, of course, i missed that!
Thanks, Markus -----Original message----- > From:Yossi Tamari <yossi.tam...@pipl.com> > Sent: Saturday 26th May 2018 2:57 > To: user@nutch.apache.org > Subject: RE: Sitemap URL's concatenated, causing status 14 not found > > Hi Markus, > > I don’t believe this is a valid sitemapindex. Each <sitemap> should include > exactly one <loc>. > See also https://www.sitemaps.org/protocol.html#index and > https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd. > I agree that the this is not the ideal error behaviour, but I guess the code > was written from the assumption that the document is valid and conformant. > > Yossi. > > > -----Original Message----- > > From: Markus Jelsma <markus.jel...@openindex.io> > > Sent: 25 May 2018 23:45 > > To: User <user@nutch.apache.org> > > Subject: Sitemap URL's concatenated, causing status 14 not found > > > > Hello, > > > > We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but > > Nutch things those two sitemap URL's are actually one consisting of both > > concatenated. > > > > Here is https://www.saxion.nl/sitemap.xml > > > > <?xml version="1.0" encoding="UTF-8"?> > > <ns2:sitemapindex > > xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9"> > > <sitemap> > > <loc>https://www.saxion.nl/opleidingen-sitemap.xml</loc> > > <loc>https://www.saxion.nl/content-sitemap.xml</loc> > > </sitemap> > > </ns2:sitemapindex> > > > > This seems fine, but Nutch attempts, and obviously fails to load: > > > > 2018-05-25 16:27:50,515 ERROR [Thread-30] > > org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap. > > Status code: 14 for https://www.saxion.nl/opleidingen- > > sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml > > > > What is going on here? Why does Nutch, or CC's sitemap util behave like > > this? > > > > Thanks, > > Markus > >