Re: [Nutch-general] Crawling the web and going into depth

Enzo Michelangeli Tue, 12 Jun 2007 14:32:37 -0700

----- Original Message ----- 
From: "Berlin Brown" <[EMAIL PROTECTED]>
Sent: Sunday, June 10, 2007 4:19 AM


>I am using the following tutorial below (with nutch 0.9) to crawl the
> web.  I went through the steps, download dmoz and run the parser, etc,
> etc.
>
> bin/nutch inject crawl/crawldb dmoz
> etc
> etc.
> bin/nutch fetch $s1
>
> Once I get to this step, is there a way to "crawl" the sites that are
> in the dmoz/url list.  It seems like we are just fetching the URLs
> that are straight out of the dmoz list.  Lets say I want to crawl
> those and give a particular depth?
>
> http://lucene.apache.org/nutch/tutorial8.html

You have to complete the fetch phase and, after that, the updatedb phase, so
that the URL's in the pages fetched and placed in the segment $1 be inserted
in the crawldb:

bin/nutch updatedb crawl/crawldb $s1

The next "generate" phase will prepare a new fetch list using both unfetched
original and newly-added URL's.

If you want, in that next cycle, to crawl only the new URL's and not the
ones injected from the dmoz list, you can use a different crawldb:

bin/nutch updatedb crawl/crawldb_new $s1
bin/nutch generate crawl/crawldb_new crawl/segments -topN 1000
s3=`ls -d crawl/segments/2* | tail -1`

etc., but I see little advantage in that.

Enzo



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Crawling the web and going into depth

Reply via email to