Hi Markus, ok, no problem. Done: https://github.com/crawler-commons/crawler-commons/issues/213
Sebastian On 06/07/2018 12:21 AM, Markus Jelsma wrote: > Sebastian, i do not want to be a pain in the arsch, aber ich habe nicht eine > Github account. If you would do the honours of opening a ticket, please do so. > > Entschuldiging, > Markus > > > > -----Original message----- >> From:Sebastian Nagel <wastl.na...@googlemail.com> >> Sent: Tuesday 29th May 2018 11:33 >> To: user@nutch.apache.org >> Subject: Re: Sitemap URL's concatenated, causing status 14 not found >> >>> I agree that the this is not the ideal error behaviour, but I guess the >>> code was written from the >> assumption that the document is valid and conformant. >> >> Over time the crawler-commons sitemap parser has been extended to get as >> much as possible from >> non-conforming sitemaps as well. Of course, it's hard to foresee and handle >> all possible mistakes... >> The equivalent syntax error for sitemaps (missing closing/next <url> in >> <urlset> is handled. >> >> @Markus: Please open an issue for crawler-commons >> https://github.com/crawler-commons/crawler-commons/issues/ >> >> Thanks, >> Sebastian >> >> >> On 05/26/2018 02:57 AM, Yossi Tamari wrote: >>> Hi Markus, >>> >>> I don’t believe this is a valid sitemapindex. Each <sitemap> should include >>> exactly one <loc>. >>> See also https://www.sitemaps.org/protocol.html#index and >>> https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd. >>> I agree that the this is not the ideal error behaviour, but I guess the >>> code was written from the assumption that the document is valid and >>> conformant. >>> >>> Yossi. >>> >>>> -----Original Message----- >>>> From: Markus Jelsma <markus.jel...@openindex.io> >>>> Sent: 25 May 2018 23:45 >>>> To: User <user@nutch.apache.org> >>>> Subject: Sitemap URL's concatenated, causing status 14 not found >>>> >>>> Hello, >>>> >>>> We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but >>>> Nutch things those two sitemap URL's are actually one consisting of both >>>> concatenated. >>>> >>>> Here is https://www.saxion.nl/sitemap.xml >>>> >>>> <?xml version="1.0" encoding="UTF-8"?> >>>> <ns2:sitemapindex >>>> xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9"> >>>> <sitemap> >>>> <loc>https://www.saxion.nl/opleidingen-sitemap.xml</loc> >>>> <loc>https://www.saxion.nl/content-sitemap.xml</loc> >>>> </sitemap> >>>> </ns2:sitemapindex> >>>> >>>> This seems fine, but Nutch attempts, and obviously fails to load: >>>> >>>> 2018-05-25 16:27:50,515 ERROR [Thread-30] >>>> org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap. >>>> Status code: 14 for https://www.saxion.nl/opleidingen- >>>> sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml >>>> >>>> What is going on here? Why does Nutch, or CC's sitemap util behave like >>>> this? >>>> >>>> Thanks, >>>> Markus >>> >> >>