Hi Markus, I was thinking what happens if we launch one fetch job fetching URL's from HostA (amongst others), then later we run another, seperate, fetch job for URL's from HostA.
i.e. -add URL's including HostA -inject -> generate -> fetch -add more URL's including HostA -inject -> generate -> fetch Is there anything in place to enforce the politeness policy in this case? Do we wait for the first to finish? and ensure this happens in a timely manner (e.g. topN URL's) ? Many thanks for your advice, Dan On Tue, Jan 31, 2012 at 7:33 AM, Markus Jelsma <[email protected]> wrote: > This is during fetching? Just create a new FetchItem and add it to the queue: > > FetchItem fit = FetchItem.create(new Text("http://url"), new > CrawlDatum(CrawlDatum.STATUS_LINKED, interval), queueMode); > fetchQueues.addFetchItem(fit); > >> Hi, >> >> Being new to Nutch, I'm unsure how Nutch deals with fetching more >> URL's for a site that is currently being fetched. >> >> e.g. if we inject -> generate -> fetch, then whilst fetching we want >> to add more URL's for potentially the same sites currently being >> fetched, what's in place to ensure that we continue to adhere to the >> politeness policy. As I understand with the initial fetch, all URL's >> for the same site are partitioned into the same segment, though I'm >> unsure what might maintain this for the second. >> >> Perhaps it's as simple as only allowing 1 fetch process / cluster at a >> time, and then limit the number of URL's / host to ensure the second >> can start in a timely manner? >> >> Thanks for your help, >> Dan

