I see,

I had the idea of depth being the length of the "chain of links" to follow
from a site to other sites.
for example:
lets say i have cnn.com as a url in my root fetchlist.
and it has for e.g. a link to www.nbc.com and in www.nbc.com they have alink
to www.news.com.

so if i would have choose depth 3, that means i would have crawled
www.news.com as well
(cnn-->nbc-->news) , i understand now that i was mistaken?

So the only way to tell the crawler to keep "digging" inside a url is via
the nutch-site.xml file, am i right?

thanks,

Eyal.

On 8/30/07, Gal Nitzan <[EMAIL PROTECTED]> wrote:
>
> Hey Eyal,
>
> Actually, in the mode you call "command mode" there is no depth value.
>
> To be more specific, the depth value is not "folder depth" it means the
> number
> of times the crawler would run from the basic seeds you entered to it. So
> for
> example if you put into your seeds 1 url to www.sample.com and in the
> crawl
> mode you set the "depth" to 3 than the crawler would run 3 times where
> each
> time the urls found during the previous crawl would be crawld. In the last
> stages of the crawl after the crawling stage is done the data would be
> indexed.
>
> So, in the "command mode" to achieve this you would need to write a small
> bash
> script which would copy that behavior which is:
>
> For the number of depth
> NewSegment = Nutch generate # generate the list of url to fetch
> Nutch fetch NewSegment # fetch list of URLs
> Nutch updatedb NewSegment # update the status of crawled links and add new
> found links.
> Next.
>
> HTH,
>
> Gal Nitzan.
>
>
>
> > -----Original Message-----
> > From: eyal edri [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, August 30, 2007 10:49 AM
> > To: nutch-agent@lucene.apache.org
> > Subject: depth arg in non crawl mode (fetch)
> >
> > Hello,
> >
> > I'm testing nutch 0.9 in the "Whole-Web" approach where i use a set of
> > command to run the engine instead of just runing "crawl".
> > i.e. nutch inject
> >      nutch genrate
> >      nutch fetch
> >      nutch updatedb.. and so on.
> >
> > My question is, where can i define the depth arg (same one that appears
> in
> > the crawl mode), in the broken ('whole web') mode?
> >
> > thanks,
> >
> >
> > --
> > Eyal Edri
>
>
>


-- 
Eyal Edri

Reply via email to