Rob, that helps a lot, Is this related to limiting urls/host in any way? https://issues.apache.org/jira/browse/NUTCH-272
<https://issues.apache.org/jira/browse/NUTCH-272>Thanks, Krish On Wed, Oct 27, 2010 at 5:14 PM, Rob Hunter <[email protected]> wrote: > Krish, > > I think what you're looking for is a depth of 2 - I believe depth of > 1 will only return foo.bar. Also, due to your depth change, I think you > can reduce your topN to 50k. I'm unsure if your results will be evenly > distributed across your domains, hopefully someone else has an answer > for that. > > -- Rob > > -----Original Message----- > From: Krish Pan [mailto:[email protected]] > Sent: Wednesday, October 27, 2010 2:29 PM > To: [email protected] > Subject: downloading exact number of pages from list of seed urls > > Hi, > > I am trying to use nutch to just download exact number of (say 5) html > pages > from each seed page I provide, > > I was wondering if this is the right approach, > > Seed Urls = total 10,000 > > bin/nuch crawl urls/<list of domains> -dir <out> -depth 1 -topN 60000 > > here depth = 1 because I just want pages from only first level > > i.e. if domain if foo.bar I want to download > > foo.bar/spam.htm > foo.bar/ham.htm > foo.bar/eggs.htm > > but NOT > foo.bar/ham/spam.htm > > And, > > -topN is 60,000 because there there are 10,000 seed urls > 10,000 home pages and 5 top pages per url > > Any suggestions? > > Thanks, > krish >

