Sebastian, i do not want to be a pain in the arsch, aber ich habe nicht eine Github account. If you would do the honours of opening a ticket, please do so.
Entschuldiging, Markus -----Original message----- > From:Sebastian Nagel <[email protected]> > Sent: Tuesday 29th May 2018 11:33 > To: [email protected] > Subject: Re: Sitemap URL's concatenated, causing status 14 not found > > > I agree that the this is not the ideal error behaviour, but I guess the > > code was written from the > assumption that the document is valid and conformant. > > Over time the crawler-commons sitemap parser has been extended to get as much > as possible from > non-conforming sitemaps as well. Of course, it's hard to foresee and handle > all possible mistakes... > The equivalent syntax error for sitemaps (missing closing/next <url> in > <urlset> is handled. > > @Markus: Please open an issue for crawler-commons > https://github.com/crawler-commons/crawler-commons/issues/ > > Thanks, > Sebastian > > > On 05/26/2018 02:57 AM, Yossi Tamari wrote: > > Hi Markus, > > > > I don’t believe this is a valid sitemapindex. Each <sitemap> should include > > exactly one <loc>. > > See also https://www.sitemaps.org/protocol.html#index and > > https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd. > > I agree that the this is not the ideal error behaviour, but I guess the > > code was written from the assumption that the document is valid and > > conformant. > > > > Yossi. > > > >> -----Original Message----- > >> From: Markus Jelsma <[email protected]> > >> Sent: 25 May 2018 23:45 > >> To: User <[email protected]> > >> Subject: Sitemap URL's concatenated, causing status 14 not found > >> > >> Hello, > >> > >> We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but > >> Nutch things those two sitemap URL's are actually one consisting of both > >> concatenated. > >> > >> Here is https://www.saxion.nl/sitemap.xml > >> > >> <?xml version="1.0" encoding="UTF-8"?> > >> <ns2:sitemapindex > >> xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9"> > >> <sitemap> > >> <loc>https://www.saxion.nl/opleidingen-sitemap.xml</loc> > >> <loc>https://www.saxion.nl/content-sitemap.xml</loc> > >> </sitemap> > >> </ns2:sitemapindex> > >> > >> This seems fine, but Nutch attempts, and obviously fails to load: > >> > >> 2018-05-25 16:27:50,515 ERROR [Thread-30] > >> org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap. > >> Status code: 14 for https://www.saxion.nl/opleidingen- > >> sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml > >> > >> What is going on here? Why does Nutch, or CC's sitemap util behave like > >> this? > >> > >> Thanks, > >> Markus > > > >

