Re: The "topN" parameter in nutch crawl

Joe Zhang Thu, 29 Nov 2012 13:39:31 -0800

How would you characterize the crawling algorithm? Depth-first,
breath-first, or some heuristic-based?


On Thu, Nov 29, 2012 at 2:10 PM, Markus Jelsma
<markus.jel...@openindex.io>wrote:

> Hi,
>
> None of all three. the topN-parameter simply means that the generator will
> select up to N records to fetch for each time it is invoked. It's best to
> forget the notion of depth in crawling, it has little meaning in most
> cases. Usually one will just continously crawl until there are no more
> records to fetch.
>
> We continously invoke the crawler and tell it to do something. If there's
> nothing to do (but that never happens) we just invoke it again the next
> time.
>
> Cheers,
>
>
> -----Original message-----
> > From:Joe Zhang <smartag...@gmail.com>
> > Sent: Thu 29-Nov-2012 21:58
> > To: user <user@nutch.apache.org>
> > Subject: The &quot;topN&quot; parameter in nutch crawl
> >
> > Dear list,
> >
> > This parameter is causing me some confusion. To me, there are at 3
> possible
> > meanings for "topN":
> >
> > 1. The branching factor at a given node
> > 2. *"*the maximum number of pages that will be retrieved at each level up
> > to the depth" (from the wiki), which seems to refer to the total of
> > branching factors at any given level
> > 3. The size of the entire frontier/queue
> >
> > To me, (1) makes the most sense, and (3) is the easiest to implement
> > programming-wise.
> >
> > If (2) is the actual implementation in nutch, it means the effective
> > branching factor would be lower at deeper levels, correct?
> >
> > In this sense, in order to conduct a "comprehensive" crawl, if we have to
> > trade off between "depth" and "topN", we should probably favor larger
> > "topN"? In other words, "-depth 5 -topN 1000" would make more sense than
> > "-depth 10 -topN 100" for a comprehensive crawl, correct?
> >
> > Thanks!
> >
>

Re: The "topN" parameter in nutch crawl

Reply via email to