Re: [Nutch-general] meaning of depth value - tutorial wrong?

Tim Gautier Thu, 14 Jun 2007 08:41:32 -0700

I'm not positive about this, but I think this is what will happen:

You give it 300 urls and tell it to go a depth of 3 with topn 300.
It generates a fetchlist of the 300 urls you gave it and comes back
with 1000 news ones.
Those 1000 new urls are inserted into the database.
It then selects the top 300 of those new 1000 urls because none of the
first depth are due for refetching yet.
After fetching the 300 from the 2nd depth, it has say another 1000 new
urls which are then inserted into the crawl database.  The crawl
database now contains 2300 urls.  300 from the first list which have
all been crawled, 300 from the 2nd list which have been crawled, 700
from the 2nd list which have NOT been crawled, and 1000 from the 3rd
list which have not been crawled.
Now for depth 3 it just selects the top 300 urls from the database
again.  Since 700 of the urls from the 2nd depth have not been fetched
yet, there's no way to know how many urls of the new fetchlist come
from the depth 3 1000 urls or the depth 2 700 urls.


Again, I may be wrong about this since I haven't tried it myself, but
I believe that's how it works.

On 6/13/07, rashmin babaria <[EMAIL PROTECTED]> wrote:
> Still I have one confusion.
>
> If I set TOPN to 300. and suppose after one round(depth 1) crawldb contains
> 1000 unfetched links which points to depth 2 pages.
> Thus for second round generator will select 300 links out of 1000. Now if
> updatedb inserts 500 more urls, which point to depth 3 pag.
> Now for third round generator will select 300 urls from 700 depth 2 urls +
> 500 depth 3 urls.  Am I right?
> Then how it is ensured that all the 300 urls selected for third round are
> from 500 depth 3 urls?
>
> On 6/13/07, Tim Gautier <[EMAIL PROTECTED]> wrote:
> >
> > The tutorial is correct, it just uses a different definition of depth
> > than what you are. :)
> >
> > The depth is essentially the number of links that must be followed
> > before reaching a certain page.  For instance:
> >
> > If you start with http://www.blabla.com/home.html, that page has a
> > depth of 1.  If that page then contains a link to
> > http://www.blabla.com/a/b/c/d/e/a.html, that means
> > http://www.blabla.com/a/b/c/d/e/a.html has a depth of 2.
> >
> > Remember, you're talking about a web here.  Each page is a node in the
> > web.  The first node is a depth of 1.  Following its links leads you
> > to nodes at a depth of 2.  Following the links of those nodes takes
> > you to nodes of a depth of 3.
> >
> > On 6/12/07, Manoharam Reddy <[EMAIL PROTECTED]> wrote:
> > > the tutorial says that depth value is the level of depth of a page
> > > from the root of a website. so as per the tutorial, if i want to fetch
> > > a page say, http://www.blabla.com/a/b/c/d/e/a.html, I must set the
> > > value of depth >= 6.
> > >
> > > but I find in the source code that depth is simply a for loop. It will
> > > run fetch loop as many number of times as mentioned in the depth
> > > value. so it has no connection with the depth of a page from the root.
> > >
> > > please confirm whether my understanding is right. and if so shouldn't
> > > the tutorial be corrected in order to prevent noobs like me from being
> > > misled?
> > >
> >
>

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] meaning of depth value - tutorial wrong?

Reply via email to