RE: The "topN" parameter in nutch crawl

Markus Jelsma Thu, 29 Nov 2012 13:58:55 -0800

Nutch does neither. If scoring is used the records to fetch are ordered by 
score and if there is no score it's simply sorted alphabetically. With some 
tuning to a scoring filter you can do whatever you want but in the end 
everything is going to be crawled (if there are enough resources).


What are you trying to do? If you're not going to process many millions of 
records it doesn't really matter because all records will be fetched within a 
reasonable amount of time. 
 
-----Original message-----
> From:Joe Zhang <smartag...@gmail.com>
> Sent: Thu 29-Nov-2012 22:45
> To: user@nutch.apache.org
> Subject: Re: The &quot;topN&quot; parameter in nutch crawl
> 
> How would you characterize the crawling algorithm? Depth-first,
> breath-first, or some heuristic-based?
> 
> On Thu, Nov 29, 2012 at 2:10 PM, Markus Jelsma
> <markus.jel...@openindex.io>wrote:
> 
> > Hi,
> >
> > None of all three. the topN-parameter simply means that the generator will
> > select up to N records to fetch for each time it is invoked. It's best to
> > forget the notion of depth in crawling, it has little meaning in most
> > cases. Usually one will just continously crawl until there are no more
> > records to fetch.
> >
> > We continously invoke the crawler and tell it to do something. If there's
> > nothing to do (but that never happens) we just invoke it again the next
> > time.
> >
> > Cheers,
> >
> >
> > -----Original message-----
> > > From:Joe Zhang <smartag...@gmail.com>
> > > Sent: Thu 29-Nov-2012 21:58
> > > To: user <user@nutch.apache.org>
> > > Subject: The &quot;topN&quot; parameter in nutch crawl
> > >
> > > Dear list,
> > >
> > > This parameter is causing me some confusion. To me, there are at 3
> > possible
> > > meanings for "topN":
> > >
> > > 1. The branching factor at a given node
> > > 2. *"*the maximum number of pages that will be retrieved at each level up
> > > to the depth" (from the wiki), which seems to refer to the total of
> > > branching factors at any given level
> > > 3. The size of the entire frontier/queue
> > >
> > > To me, (1) makes the most sense, and (3) is the easiest to implement
> > > programming-wise.
> > >
> > > If (2) is the actual implementation in nutch, it means the effective
> > > branching factor would be lower at deeper levels, correct?
> > >
> > > In this sense, in order to conduct a "comprehensive" crawl, if we have to
> > > trade off between "depth" and "topN", we should probably favor larger
> > > "topN"? In other words, "-depth 5 -topN 1000" would make more sense than
> > > "-depth 10 -topN 100" for a comprehensive crawl, correct?
> > >
> > > Thanks!
> > >
> >
>

RE: The "topN" parameter in nutch crawl

Reply via email to