Hi Markus,

ok, no problem. Done:
  https://github.com/crawler-commons/crawler-commons/issues/213

Sebastian

On 06/07/2018 12:21 AM, Markus Jelsma wrote:
> Sebastian, i do not want to be a pain in the arsch, aber ich habe nicht eine 
> Github account. If you would do the honours of opening a ticket, please do so.
> 
> Entschuldiging,
> Markus
> 
>  
>  
> -----Original message-----
>> From:Sebastian Nagel <wastl.na...@googlemail.com>
>> Sent: Tuesday 29th May 2018 11:33
>> To: user@nutch.apache.org
>> Subject: Re: Sitemap URL's concatenated, causing status 14 not found
>>
>>> I agree that the this is not the ideal error behaviour, but I guess the 
>>> code was written from the
>> assumption that the document is valid and conformant.
>>
>> Over time the crawler-commons sitemap parser has been extended to get as 
>> much as possible from
>> non-conforming sitemaps as well. Of course, it's hard to foresee and handle 
>> all possible mistakes...
>> The equivalent syntax error for sitemaps (missing closing/next <url> in 
>> <urlset> is handled.
>>
>> @Markus: Please open an issue for crawler-commons
>>   https://github.com/crawler-commons/crawler-commons/issues/
>>
>> Thanks,
>> Sebastian
>>
>>
>> On 05/26/2018 02:57 AM, Yossi Tamari wrote:
>>> Hi Markus,
>>>
>>> I don’t believe this is a valid sitemapindex. Each <sitemap> should include 
>>> exactly one <loc>.
>>> See also https://www.sitemaps.org/protocol.html#index and 
>>> https://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd.
>>> I agree that the this is not the ideal error behaviour, but I guess the 
>>> code was written from the assumption that the document is valid and 
>>> conformant.
>>>
>>>     Yossi.
>>>
>>>> -----Original Message-----
>>>> From: Markus Jelsma <markus.jel...@openindex.io>
>>>> Sent: 25 May 2018 23:45
>>>> To: User <user@nutch.apache.org>
>>>> Subject: Sitemap URL's concatenated, causing status 14 not found
>>>>
>>>> Hello,
>>>>
>>>> We have a sitemap.xml pointing to further sitemaps. The XML seems fine, but
>>>> Nutch things those two sitemap URL's are actually one consisting of both
>>>> concatenated.
>>>>
>>>> Here is https://www.saxion.nl/sitemap.xml
>>>>
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <ns2:sitemapindex
>>>> xmlns:ns2="http://www.sitemaps.org/schemas/sitemap/0.9";>
>>>> <sitemap>
>>>> <loc>https://www.saxion.nl/opleidingen-sitemap.xml</loc>
>>>> <loc>https://www.saxion.nl/content-sitemap.xml</loc>
>>>> </sitemap>
>>>> </ns2:sitemapindex>
>>>>
>>>> This seems fine, but Nutch attempts, and obviously fails to load:
>>>>
>>>> 2018-05-25 16:27:50,515 ERROR [Thread-30]
>>>> org.apache.nutch.util.SitemapProcessor: Error while fetching the sitemap.
>>>> Status code: 14 for https://www.saxion.nl/opleidingen-
>>>> sitemap.xmlhttps://www.saxion.nl/content-sitemap.xml
>>>>
>>>> What is going on here? Why does Nutch, or CC's sitemap util behave like 
>>>> this?
>>>>
>>>> Thanks,
>>>> Markus
>>>
>>
>>

Reply via email to