How would you characterize the crawling algorithm? Depth-first, breath-first, or some heuristic-based?
On Thu, Nov 29, 2012 at 2:10 PM, Markus Jelsma <markus.jel...@openindex.io>wrote: > Hi, > > None of all three. the topN-parameter simply means that the generator will > select up to N records to fetch for each time it is invoked. It's best to > forget the notion of depth in crawling, it has little meaning in most > cases. Usually one will just continously crawl until there are no more > records to fetch. > > We continously invoke the crawler and tell it to do something. If there's > nothing to do (but that never happens) we just invoke it again the next > time. > > Cheers, > > > -----Original message----- > > From:Joe Zhang <smartag...@gmail.com> > > Sent: Thu 29-Nov-2012 21:58 > > To: user <user@nutch.apache.org> > > Subject: The "topN" parameter in nutch crawl > > > > Dear list, > > > > This parameter is causing me some confusion. To me, there are at 3 > possible > > meanings for "topN": > > > > 1. The branching factor at a given node > > 2. *"*the maximum number of pages that will be retrieved at each level up > > to the depth" (from the wiki), which seems to refer to the total of > > branching factors at any given level > > 3. The size of the entire frontier/queue > > > > To me, (1) makes the most sense, and (3) is the easiest to implement > > programming-wise. > > > > If (2) is the actual implementation in nutch, it means the effective > > branching factor would be lower at deeper levels, correct? > > > > In this sense, in order to conduct a "comprehensive" crawl, if we have to > > trade off between "depth" and "topN", we should probably favor larger > > "topN"? In other words, "-depth 5 -topN 1000" would make more sense than > > "-depth 10 -topN 100" for a comprehensive crawl, correct? > > > > Thanks! > > >