m: "Semyon Semyonov" <semyon.semyo...@mail.com>
To: "usernutch.apache.org" <user@nutch.apache.org>
Subject: Usage previous stage HostDb data for generate(fetched deltas)
Dear all,
I plan to improve hostdb functionality to have a DB_FETCHED delta for generate
stage.
L
Hi Yossi,
What you say makes sense if you run Nutch in the "whole Internet crawling"
mode. In other words, you don't specify the set of hosts you want to crawl, but
crawl up to infinity.
Our case is different. We crawl the specific hosts per each country(around
20). For each host we set
t you will not crawl it, because your delta condition is still not
satisfied.
What am I missing?
Yossi.
> -Original Message-
> From: Semyon Semyonov [mailto:semyon.semyo...@mail.com]
> Sent: 14 December 2017 15:08
> To: usernutch.apache.org <user@nutch.apache.org>
I have created an issue for this functionality:
https://issues.apache.org/jira/browse/NUTCH-2481
Sent: Thursday, December 14, 2017 at 2:07 PM
From: "Semyon Semyonov" <semyon.semyo...@mail.com>
To: "usernutch.apache.org" <user@nutch.apache.org>
Subject:
Dear all,
I plan to improve hostdb functionality to have a DB_FETCHED delta for generate
stage.
Lets say for each website we have condition of generate while number of fetched
< 150.
The problem is for some websites that condition will (almost)never be finished,
because of its structure.
5 matches
Mail list logo