Sebastian, i do not want to be a pain in the arsch, aber ich habe nicht eine 
Github account. If you would do the honours of opening a ticket, please do so.

Entschuldiging,
Markus

 
 
-----Original message-----
> From:Sebastian Nagel <[email protected]>
> Sent: Tuesday 29th May 2018 11:33
> To: [email protected]
> Subject: Re: Sitemap URL's concatenated, causing status 14 not found
> 
> > I agree that the this is not the ideal error behaviour, but I guess the 
> > code was written from the
> assumption that the document is valid and conformant.
> 
> Over time the crawler-commons sitemap parser has been extended to get as much 
> as possible from
> non-conforming sitemaps as well. Of course, it's hard to foresee and handle 
> all possible mistakes...
> The equivalent syntax error for sitemaps (missing closing/next <url> in 
> <urlset> is handled.
> 
> @Markus: Please open an issue for crawler-commons
>   https://github.com/crawler-commons/crawler-commons/issues/
> 
> Thanks,
> Sebastian
> 
> 
> On 05/26/2018 02:57 AM, Yossi Tamari wrote:
> > Hi Markus,
> > 
> > I don’t believe this is a valid sitemapindex. Each <sitemap> should include 
> > exactly one <loc>.
> > See also https://www.sitemaps.org/protocol.html#index and 
> > https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd.
> > I agree that the this is not the ideal error behaviour, but I guess the 
> > code was written from the assumption that the document is valid and 
> > conformant.
> > 
> >     Yossi.
> > 
> >> -----Original Message-----
> >> From: Markus Jelsma <[email protected]>
> >> Sent: 25 May 2018 23:45
> >> To: User <[email protected]>
> >> Subject: Sitemap URL's concatenated, causing status 14 not found
> >>
> >> Hello,
> >>
> >> We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but
> >> Nutch things those two sitemap URL's are actually one consisting of both
> >> concatenated.
> >>
> >> Here is https://www.saxion.nl/sitemap.xml
> >>
> >> <?xml version="1.0" encoding="UTF-8"?>
> >> <ns2:sitemapindex
> >> xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9";>
> >> <sitemap>
> >> <loc>https://www.saxion.nl/opleidingen-sitemap.xml</loc>
> >> <loc>https://www.saxion.nl/content-sitemap.xml</loc>
> >> </sitemap>
> >> </ns2:sitemapindex>
> >>
> >> This seems fine, but Nutch attempts, and obviously fails to load:
> >>
> >> 2018-05-25 16:27:50,515 ERROR [Thread-30]
> >> org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap.
> >> Status code: 14 for https://www.saxion.nl/opleidingen-
> >> sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml
> >>
> >> What is going on here? Why does Nutch, or CC's sitemap util behave like 
> >> this?
> >>
> >> Thanks,
> >> Markus
> > 
> 
> 

Reply via email to