Re: Usage previous stage HostDb data for generate(fetched deltas)

2018-01-19 Thread Semyon Semyonov
m: "Semyon Semyonov" <semyon.semyo...@mail.com> To: "usernutch.apache.org" <user@nutch.apache.org> Subject: Usage previous stage HostDb data for generate(fetched deltas) Dear all, I plan to improve hostdb functionality to have a DB_FETCHED delta for generate stage. L

Re: RE: Usage previous stage HostDb data for generate(fetched deltas)

2017-12-16 Thread Semyon Semyonov
Hi Yossi, What you say makes sense if you run Nutch in the "whole Internet crawling" mode. In other words, you don't specify the set of hosts you want to crawl, but crawl up to infinity. Our case is different. We crawl the specific hosts per each country(around 20). For each host we set

RE: Usage previous stage HostDb data for generate(fetched deltas)

2017-12-15 Thread Yossi Tamari
t you will not crawl it, because your delta condition is still not satisfied. What am I missing? Yossi. > -Original Message- > From: Semyon Semyonov [mailto:semyon.semyo...@mail.com] > Sent: 14 December 2017 15:08 > To: usernutch.apache.org <user@nutch.apache.org>

Fw: Usage previous stage HostDb data for generate(fetched deltas)

2017-12-15 Thread Semyon Semyonov
I have created an issue for this functionality: https://issues.apache.org/jira/browse/NUTCH-2481     Sent: Thursday, December 14, 2017 at 2:07 PM From: "Semyon Semyonov" <semyon.semyo...@mail.com> To: "usernutch.apache.org" <user@nutch.apache.org> Subject: 

Usage previous stage HostDb data for generate(fetched deltas)

2017-12-14 Thread Semyon Semyonov
Dear all, I plan to improve hostdb functionality to have a DB_FETCHED delta for generate stage. Lets say for each website we have condition of generate while number of fetched < 150. The problem is for some websites that condition will (almost)never be finished, because of its structure.