Hi Markus,

I was thinking what happens if we launch one fetch job fetching URL's
from HostA (amongst others), then later we run another, seperate,
fetch job for URL's from HostA.

i.e.

-add URL's including HostA
-inject -> generate -> fetch
-add more URL's including HostA
-inject -> generate -> fetch

Is there anything in place to enforce the politeness policy in this case?
Do we wait for the first to finish? and ensure this happens in a
timely manner  (e.g. topN URL's) ?

Many thanks for your advice,
Dan


On Tue, Jan 31, 2012 at 7:33 AM, Markus Jelsma
<[email protected]> wrote:
> This is during fetching? Just create a new FetchItem and add it to the queue:
>
> FetchItem fit = FetchItem.create(new Text("http://url";), new
> CrawlDatum(CrawlDatum.STATUS_LINKED, interval), queueMode);
> fetchQueues.addFetchItem(fit);
>
>> Hi,
>>
>> Being new to Nutch, I'm unsure how Nutch deals with fetching more
>> URL's for a site that is currently being fetched.
>>
>> e.g. if we inject -> generate -> fetch, then whilst fetching we want
>> to add more URL's for potentially the same sites currently being
>> fetched, what's in place to ensure that we continue to adhere to the
>> politeness policy. As I understand with the initial fetch, all URL's
>> for the same site are partitioned into the same segment, though I'm
>> unsure what might maintain this for the second.
>>
>> Perhaps it's as simple as only allowing 1 fetch process / cluster at a
>> time, and then limit the number of URL's / host to ensure the second
>> can start in a timely manner?
>>
>> Thanks for your help,
>> Dan

Reply via email to