Re: topN value in crawl

alxsss Wed, 19 Aug 2009 11:43:19 -0700

 Thanks. What if urls in my seed file do not have outlinks, let say .pdf files. 
Should I still specify topN variable? All I need is to index all urls in my 
seed file. And they are about 1 M.

Alex.


 

-----Original Message-----
From: Kirby Bohling <kirby.bohl...@gmail.com>
To: nutch-user@lucene.apache.org
Sent: Wed, Aug 19, 2009 11:02 am
Subject: Re: topN value in crawl










On Wed, Aug 19, 2009 at 12:13 PM, <alx...@aim.com> wrote:
>
> ?Hi,
>
> I have read a few tutorials on running Nutch to crawl web. However, I still 
> do 
not understand the meaning of topN variable in crawl command. In tutorials it 
is 
suggested to create 3 segments and fetch them with topN=1000. What if I create 
100 segments or only one. What would be difference. My goal is to index urls I 
have in my seed file and nothing more.
>

My understanding of "TopN" is that it interacts with the depth to help
you keep crawling "interesting" areas.  So if you have a depth of 3,
and a topN of let's say 100 (just to keep the math easy).  Every page
I go to has 20 outlinks.  I have 10 pages listed in my seed list.

This is my understanding from reading the documentation and watching
what happens, not from reading the code, I could be all wrong.
Hopefully someone corrects any details I have wrong:

depth 0:
10 pages fetched, 10 * 20 = 200 pending links to be fetched.

depth 1:
Because I have a "topN" of 100, of the 200 links I have, it will pick
the "100" most interesting (using whatever algorithm is configured, I
believe it is OPIC by default).

depth 2:
100 pages fetched, 100 + 100 * 20 = 2100 pages to fetch. (100
existing, 100 pages with 20 outlinks)

depth 3:
100 pages fetched, 2000 + 100 * 20 = 4000 pages to fetch. (2000
existing pages, 100 pages with 20 outlinks).

(NOTE: This analysis assumes all the links are unique, which is highly
unlikely).

I believe the point is to not force you to do a depth first search of
the web.  Note that the algorithm might still not have fetched all of
the pending links from depth 0 by depth 3 (or depth 100 for that
matter).  If they were deemed less interesting then other links, they
could sit in the queue effectively forever.

I view it as an latency vs. throughput thing:  How much effort are you
willing to always fetch _the most_ interesting page next.  Evaluating
and managing the computation of ordering that list is expensive.  So
queue the "topN" most interesting links you know about now, and
process that without re-evaluating "interesting" as new information is
gathered that would change the ordering.

I also believe that "topN * depth" is an upper bound on the number of
pages you will fetch during a crawl.

However, take all this with a grain of salt.  I haven't read the code
closely, but that was gleaned while tracking down why some pages were
not being fetched that I expected to be, reading the documentation,
and modifying the topN parameter to fix my issues.

Thanks,
   Kirby



> Thanks.
> Alex.
>
>
>
>
Re: topN value in crawl

Reply via email to