Ah i see. You do not want to use generate twice in one cycle without doing an update. You can use the freegen tool to generate a segments from seed files.
You must maintain a clean cycle of generate, fetch, update. There's an option to rebuild the DB after generate but it's another costly job and i would not encourage to use it. On Tuesday 31 January 2012 09:43:46 dan sutton wrote: > Hi Markus, > > I was thinking what happens if we launch one fetch job fetching URL's > from HostA (amongst others), then later we run another, seperate, > fetch job for URL's from HostA. > > i.e. > > -add URL's including HostA > -inject -> generate -> fetch > -add more URL's including HostA > -inject -> generate -> fetch > > Is there anything in place to enforce the politeness policy in this case? > Do we wait for the first to finish? and ensure this happens in a > timely manner (e.g. topN URL's) ? > > Many thanks for your advice, > Dan > > > On Tue, Jan 31, 2012 at 7:33 AM, Markus Jelsma > > <[email protected]> wrote: > > This is during fetching? Just create a new FetchItem and add it to the > > queue: > > > > FetchItem fit = FetchItem.create(new Text("http://url"), new > > CrawlDatum(CrawlDatum.STATUS_LINKED, interval), queueMode); > > fetchQueues.addFetchItem(fit); > > > >> Hi, > >> > >> Being new to Nutch, I'm unsure how Nutch deals with fetching more > >> URL's for a site that is currently being fetched. > >> > >> e.g. if we inject -> generate -> fetch, then whilst fetching we want > >> to add more URL's for potentially the same sites currently being > >> fetched, what's in place to ensure that we continue to adhere to the > >> politeness policy. As I understand with the initial fetch, all URL's > >> for the same site are partitioned into the same segment, though I'm > >> unsure what might maintain this for the second. > >> > >> Perhaps it's as simple as only allowing 1 fetch process / cluster at a > >> time, and then limit the number of URL's / host to ensure the second > >> can start in a timely manner? > >> > >> Thanks for your help, > >> Dan -- Markus Jelsma - CTO - Openindex

