Re: topN value in crawl

2009-08-20 Thread Marko Bauhardt


On Aug 19, 2009, at 8:42 PM, alx...@aim.com wrote:






hi




Thanks. What if urls in my seed file do not have outlinks, let  
say .pdf files. Should I still specify topN variable? All I need is  
to index all urls in my seed file. And they are about 1 M.


topN means that your generated shards (segments) contains max. N  
popular urls from your crawldb which are not fetched.

popular urls means urls with highest score.

You can set the topN to -1. if you do this then you generate and  
fetch all urls in one shard.

if you set topN=330.000 then you fetch 330.000 Urls in one shard.
if you specifiy the depth parameter then you generate depth shards

for example -topN=330.000 -depth=3
then you generate/fetch/parse/index 3 shards, every shard contains  
max. 330.000 urls,  ~990.000 urls.



marko



Re: topN value in crawl

2009-08-20 Thread alxsss

 In the tutroial on the wiki the depth is not specified and topN=1000. I run 
those commands yesterday and it is still running. Will it index all my urls? My 
seed file has about 20K urls.

Thanks.
Alex.



 


 

-Original Message-
From: Marko Bauhardt m...@101tec.com
To: nutch-user@lucene.apache.org
Sent: Thu, Aug 20, 2009 12:17 am
Subject: Re: topN value in crawl










On Aug 19, 2009, at 8:42 PM, alx...@aim.com wrote:?
?

?

?
?

hi?
?

?

?

 Thanks. What if urls in my seed file do not have outlinks, let  say .pdf 
 files. Should I still specify topN variable? All I need is  to index all 
 urls in my seed file. And they are about 1 M.?
?

topN means that your generated shards (segments) contains max. N popular urls 
from your crawldb which are not fetched.?

popular urls means urls with highest score.?
?

You can set the topN to -1. if you do this then you generate and fetch all 
urls in one shard.?

if you set topN=330.000 then you fetch 330.000 Urls in one shard.?

if you specifiy the depth parameter then you generate depth shards?
?

for example -topN=330.000 -depth=3?

then you generate/fetch/parse/index 3 shards, every shard contains max. 330.000 
urls,  ~990.000 urls.?
?


marko?
?



 



topN value in crawl

2009-08-19 Thread alxsss

 Hi,

I have read a few tutorials on running Nutch to crawl web. However, I still do 
not understand the meaning of topN variable in crawl command. In tutorials it 
is suggested to create 3 segments and fetch them with topN=1000. What if I 
create 100 segments or only one. What would be difference. My goal is to index 
urls I have in my seed file and nothing more.

Thanks.
Alex.





Re: topN value in crawl

2009-08-19 Thread alxsss

 


 Thanks. What if urls in my seed file do not have outlinks, let say .pdf files. 
Should I still specify topN variable? All I need is to index all urls in my 
seed file. And they are about 1 M.

Alex.


 

-Original Message-
From: Kirby Bohling kirby.bohl...@gmail.com
To: nutch-user@lucene.apache.org
Sent: Wed, Aug 19, 2009 11:02 am
Subject: Re: topN value in crawl










On Wed, Aug 19, 2009 at 12:13 PM, alx...@aim.com wrote:

 ?Hi,

 I have read a few tutorials on running Nutch to crawl web. However, I still 
 do 
not understand the meaning of topN variable in crawl command. In tutorials it 
is 
suggested to create 3 segments and fetch them with topN=1000. What if I create 
100 segments or only one. What would be difference. My goal is to index urls I 
have in my seed file and nothing more.


My understanding of TopN is that it interacts with the depth to help
you keep crawling interesting areas.  So if you have a depth of 3,
and a topN of let's say 100 (just to keep the math easy).  Every page
I go to has 20 outlinks.  I have 10 pages listed in my seed list.

This is my understanding from reading the documentation and watching
what happens, not from reading the code, I could be all wrong.
Hopefully someone corrects any details I have wrong:

depth 0:
10 pages fetched, 10 * 20 = 200 pending links to be fetched.

depth 1:
Because I have a topN of 100, of the 200 links I have, it will pick
the 100 most interesting (using whatever algorithm is configured, I
believe it is OPIC by default).

depth 2:
100 pages fetched, 100 + 100 * 20 = 2100 pages to fetch. (100
existing, 100 pages with 20 outlinks)

depth 3:
100 pages fetched, 2000 + 100 * 20 = 4000 pages to fetch. (2000
existing pages, 100 pages with 20 outlinks).

(NOTE: This analysis assumes all the links are unique, which is highly
unlikely).

I believe the point is to not force you to do a depth first search of
the web.  Note that the algorithm might still not have fetched all of
the pending links from depth 0 by depth 3 (or depth 100 for that
matter).  If they were deemed less interesting then other links, they
could sit in the queue effectively forever.

I view it as an latency vs. throughput thing:  How much effort are you
willing to always fetch _the most_ interesting page next.  Evaluating
and managing the computation of ordering that list is expensive.  So
queue the topN most interesting links you know about now, and
process that without re-evaluating interesting as new information is
gathered that would change the ordering.

I also believe that topN * depth is an upper bound on the number of
pages you will fetch during a crawl.

However, take all this with a grain of salt.  I haven't read the code
closely, but that was gleaned while tracking down why some pages were
not being fetched that I expected to be, reading the documentation,
and modifying the topN parameter to fix my issues.

Thanks,
   Kirby



 Thanks.
 Alex.