Re: RE: Usage previous stage HostDb data for generate(fetched deltas)

Semyon Semyonov Sat, 16 Dec 2017 03:15:06 -0800

Hi Yossi,

What you say makes sense if you run Nutch in the "whole Internet crawling" 
mode. In other words, you don't specify the set of hosts you want to crawl, but 
crawl up to infinity.


Our case is different. We crawl the specific hosts per each country(around 
200000). For each host we set up a stop condition in generate, with the 
expression based on fetched number per host, lets say db_fetched < 100(see 
https://issues.apache.org/jira/browse/NUTCH-2368).

The problem is for really deep websites this condition can be hard(never in 
practice) to satisfy. As an illustration, imagine a website with the following 
structure 1-10-15-5-1-1-1 - ...

Therefore I want to have a mechanism to stop at specific point with this host 
even though the db_fetched condition is not satisfied yet. 

Semyon.

Re: RE: Usage previous stage HostDb data for generate(fetched deltas)

Reply via email to