Ah, of course, i missed that!

Thanks,
Markus
 
-----Original message-----
> From:Yossi Tamari <yossi.tam...@pipl.com>
> Sent: Saturday 26th May 2018 2:57
> To: user@nutch.apache.org
> Subject: RE: Sitemap URL's concatenated, causing status 14 not found
> 
> Hi Markus,
> 
> I don’t believe this is a valid sitemapindex. Each <sitemap> should include 
> exactly one <loc>.
> See also https://www.sitemaps.org/protocol.html#index and 
> https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd.
> I agree that the this is not the ideal error behaviour, but I guess the code 
> was written from the assumption that the document is valid and conformant.
> 
>       Yossi.
> 
> > -----Original Message-----
> > From: Markus Jelsma <markus.jel...@openindex.io>
> > Sent: 25 May 2018 23:45
> > To: User <user@nutch.apache.org>
> > Subject: Sitemap URL's concatenated, causing status 14 not found
> > 
> > Hello,
> > 
> > We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but
> > Nutch things those two sitemap URL's are actually one consisting of both
> > concatenated.
> > 
> > Here is https://www.saxion.nl/sitemap.xml
> > 
> > <?xml version="1.0" encoding="UTF-8"?>
> > <ns2:sitemapindex
> > xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9";>
> > <sitemap>
> > <loc>https://www.saxion.nl/opleidingen-sitemap.xml</loc>
> > <loc>https://www.saxion.nl/content-sitemap.xml</loc>
> > </sitemap>
> > </ns2:sitemapindex>
> > 
> > This seems fine, but Nutch attempts, and obviously fails to load:
> > 
> > 2018-05-25 16:27:50,515 ERROR [Thread-30]
> > org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap.
> > Status code: 14 for https://www.saxion.nl/opleidingen-
> > sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml
> > 
> > What is going on here? Why does Nutch, or CC's sitemap util behave like 
> > this?
> > 
> > Thanks,
> > Markus
> 
> 

Reply via email to