Re: Re-crawling and multiple fetchers

Markus Jelsma Tue, 31 Jan 2012 01:09:16 -0800

Ah i see. You do not want to use generate twice in one cycle without doing an 
update. You can use the freegen tool to generate a segments from seed files.


You must maintain a clean cycle of generate, fetch, update. There's an option 
to rebuild the DB after generate but it's another costly job and i would not 
encourage to use it.

On Tuesday 31 January 2012 09:43:46 dan sutton wrote:
> Hi Markus,
> 
> I was thinking what happens if we launch one fetch job fetching URL's
> from HostA (amongst others), then later we run another, seperate,
> fetch job for URL's from HostA.
> 
> i.e.
> 
> -add URL's including HostA
> -inject -> generate -> fetch
> -add more URL's including HostA
> -inject -> generate -> fetch
> 
> Is there anything in place to enforce the politeness policy in this case?
> Do we wait for the first to finish? and ensure this happens in a
> timely manner  (e.g. topN URL's) ?
> 
> Many thanks for your advice,
> Dan
> 
> 
> On Tue, Jan 31, 2012 at 7:33 AM, Markus Jelsma
> 
> <[email protected]> wrote:
> > This is during fetching? Just create a new FetchItem and add it to the
> > queue:
> > 
> > FetchItem fit = FetchItem.create(new Text("http://url";), new
> > CrawlDatum(CrawlDatum.STATUS_LINKED, interval), queueMode);
> > fetchQueues.addFetchItem(fit);
> > 
> >> Hi,
> >> 
> >> Being new to Nutch, I'm unsure how Nutch deals with fetching more
> >> URL's for a site that is currently being fetched.
> >> 
> >> e.g. if we inject -> generate -> fetch, then whilst fetching we want
> >> to add more URL's for potentially the same sites currently being
> >> fetched, what's in place to ensure that we continue to adhere to the
> >> politeness policy. As I understand with the initial fetch, all URL's
> >> for the same site are partitioned into the same segment, though I'm
> >> unsure what might maintain this for the second.
> >> 
> >> Perhaps it's as simple as only allowing 1 fetch process / cluster at a
> >> time, and then limit the number of URL's / host to ensure the second
> >> can start in a timely manner?
> >> 
> >> Thanks for your help,
> >> Dan

-- 
Markus Jelsma - CTO - Openindex

Re: Re-crawling and multiple fetchers

Reply via email to