In the tutroial on the wiki the depth is not specified and topN=1000. I run those commands yesterday and it is still running. Will it index all my urls? My seed file has about 20K urls.
Thanks. Alex. -----Original Message----- From: Marko Bauhardt <m...@101tec.com> To: nutch-user@lucene.apache.org Sent: Thu, Aug 20, 2009 12:17 am Subject: Re: topN value in crawl On Aug 19, 2009, at 8:42 PM, alx...@aim.com wrote:? ? >? >? ? hi? ? >? >? > Thanks. What if urls in my seed file do not have outlinks, let > say .pdf > files. Should I still specify topN variable? All I need is > to index all > urls in my seed file. And they are about 1 M.? ? topN means that your generated shards (segments) contains max. N popular urls from your crawldb which are not fetched.? popular urls means urls with highest score.? ? You can set the topN to "-1". if you do this then you generate and fetch all urls in one shard.? if you set topN=330.000 then you fetch 330.000 Urls in one shard.? if you specifiy the depth parameter then you generate depth shards? ? for example -topN=330.000 -depth=3? then you generate/fetch/parse/index 3 shards, every shard contains max. 330.000 urls, ~990.000 urls.? ? marko? ?