Re: Getting Error

2018-01-17 Thread govind nitk
Hi Sebastian and lewis, Did build on other machine and diffed the runtime log. Got the issues pretty clear yes, the build was not proper. Got it resolved. Happy crawling. Regards, GoViNd On Mon, Jan 15, 2018 at 2:04 AM, Sebastian Nagel wrote: > Hi Govind, > > thanks. At least, although it's

SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Markus Jelsma
Hello, We noticed some abnormalities in our crawl cycle caused by a sudden reduction of our CrawlDB's size. The SitemapProcessor ran, failed (timed out, see below) and left us with a decimated CrawlDB. This is odd because of:     } catch (Exception e) {   if (fs.exists(tempCrawlDb))   

Re: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Sebastian Nagel
Hi Markus, the problem should be fixed with NUTCH-2442. It wasn't the case with the first version of the sitemap processor. It's mandatory to check also the return value of job.waitForCompletion(true), only checking for exceptions isn't enough! Sebastian On 01/17/2018 11:51 AM, Markus Jelsma w

RE: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Markus Jelsma
Ah thanks! I knew you'd fixed some of these, now i know my patch of NUTCH-2466 silently removes your commit! My bad, thanks! Markus -Original message- > From:Sebastian Nagel > Sent: Wednesday 17th January 2018 13:32 > To: user@nutch.apache.org > Subject: Re: SitemapProcessor destroye

Re: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Sebastian Nagel
It was finally Omkar who brought NUTCH-2442 forward. Time to review the patch of NUTCH-2466! On 01/17/2018 01:53 PM, Markus Jelsma wrote: > Ah thanks! > > I knew you'd fixed some of these, now i know my patch of NUTCH-2466 silently > removes your commit! > > My bad, thanks! > Markus > >

RE: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Markus Jelsma
I'll fix NUTCH-2466 this afternoon. -Original message- > From:Sebastian Nagel > Sent: Wednesday 17th January 2018 14:09 > To: user@nutch.apache.org > Subject: Re: SitemapProcessor destroyed our CrawlDB > > It was finally Omkar who brought NUTCH-2442 forward. > Time to review the patch

Re: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread lewis john mcgibbney
Hi Markus, What a disaster... do/did you have any crazy rules, replacements and/or substitutions present in the urlnormalizer-regex configuration? Lewis On Wed, Jan 17, 2018 at 2:51 AM, wrote: > > From: Markus Jelsma > To: User > Cc: > Bcc: > Date: Wed, 17 Jan 2018 10:51:49 + > Subject: S

RE: SitemapProcessor destroyed our CrawlDB

2018-01-17 Thread Markus Jelsma
Hello Lewis, We do have some weird and complicated rules, but these should not time out for 450 seconds, e.g. keep the JVM busy for that amount of time. We still haven't fully investigated yet so it is a possibility some sitemap entries are very long and complicated. But 450 seconds, very odd,