Hi Sebastian and lewis,
Did build on other machine and diffed the runtime log. Got the issues
pretty clear
yes, the build was not proper. Got it resolved.
Happy crawling.
Regards,
GoViNd
On Mon, Jan 15, 2018 at 2:04 AM, Sebastian Nagel wrote:
> Hi Govind,
>
> thanks. At least, although it's
Hello,
We noticed some abnormalities in our crawl cycle caused by a sudden reduction
of our CrawlDB's size. The SitemapProcessor ran, failed (timed out, see below)
and left us with a decimated CrawlDB.
This is odd because of:
} catch (Exception e) {
if (fs.exists(tempCrawlDb))
Hi Markus,
the problem should be fixed with NUTCH-2442. It wasn't the case with the first
version of the
sitemap processor. It's mandatory to check also the return value of
job.waitForCompletion(true),
only checking for exceptions isn't enough!
Sebastian
On 01/17/2018 11:51 AM, Markus Jelsma w
Ah thanks!
I knew you'd fixed some of these, now i know my patch of NUTCH-2466 silently
removes your commit!
My bad, thanks!
Markus
-Original message-
> From:Sebastian Nagel
> Sent: Wednesday 17th January 2018 13:32
> To: user@nutch.apache.org
> Subject: Re: SitemapProcessor destroye
It was finally Omkar who brought NUTCH-2442 forward.
Time to review the patch of NUTCH-2466!
On 01/17/2018 01:53 PM, Markus Jelsma wrote:
> Ah thanks!
>
> I knew you'd fixed some of these, now i know my patch of NUTCH-2466 silently
> removes your commit!
>
> My bad, thanks!
> Markus
>
>
I'll fix NUTCH-2466 this afternoon.
-Original message-
> From:Sebastian Nagel
> Sent: Wednesday 17th January 2018 14:09
> To: user@nutch.apache.org
> Subject: Re: SitemapProcessor destroyed our CrawlDB
>
> It was finally Omkar who brought NUTCH-2442 forward.
> Time to review the patch
Hi Markus,
What a disaster... do/did you have any crazy rules, replacements and/or
substitutions present in the urlnormalizer-regex configuration?
Lewis
On Wed, Jan 17, 2018 at 2:51 AM, wrote:
>
> From: Markus Jelsma
> To: User
> Cc:
> Bcc:
> Date: Wed, 17 Jan 2018 10:51:49 +
> Subject: S
Hello Lewis,
We do have some weird and complicated rules, but these should not time out for
450 seconds, e.g. keep the JVM busy for that amount of time. We still haven't
fully investigated yet so it is a possibility some sitemap entries are very
long and complicated. But 450 seconds, very odd,
8 matches
Mail list logo