RE: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Markus Jelsma
> Subject: Re: SitemapProcessor destroyed our CrawlDB > > Hi Markus, > > What a disaster... do/did you have any crazy rules, replacements and/or > substitutions present in the urlnormalizer-regex configuration? > Lewis > > On Wed, Jan 17, 2018 at 2:51 AM, wrote: > > > >

Re: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread lewis john mcgibbney
Hi Markus, What a disaster... do/did you have any crazy rules, replacements and/or substitutions present in the urlnormalizer-regex configuration? Lewis On Wed, Jan 17, 2018 at 2:51 AM, wrote: > > From: Markus Jelsma > To: User > Cc: > Bcc: > Date: Wed, 17 Jan 2018 10:51:49 + > Subject: S

RE: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Markus Jelsma
I'll fix NUTCH-2466 this afternoon. -Original message- > From:Sebastian Nagel > Sent: Wednesday 17th January 2018 14:09 > To: user@nutch.apache.org > Subject: Re: SitemapProcessor destroyed our CrawlDB > > It was finally Omkar who brought NUTCH-2442 forward. >

Re: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Sebastian Nagel
y bad, thanks! > Markus > > -Original message- >> From:Sebastian Nagel >> Sent: Wednesday 17th January 2018 13:32 >> To: user@nutch.apache.org >> Subject: Re: SitemapProcessor destroyed our CrawlDB >> >> Hi Markus, >> >> the problem shoul

RE: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Markus Jelsma
SitemapProcessor destroyed our CrawlDB > > Hi Markus, > > the problem should be fixed with NUTCH-2442. It wasn't the case with the > first version of the > sitemap processor. It's mandatory to check also the return value of > job.waitForCompletion(true), > only c

Re: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Sebastian Nagel
Hi Markus, the problem should be fixed with NUTCH-2442. It wasn't the case with the first version of the sitemap processor. It's mandatory to check also the return value of job.waitForCompletion(true), only checking for exceptions isn't enough! Sebastian On 01/17/2018 11:51 AM, Markus Jelsma w