----- Original Message ----- From: "Berlin Brown" <[EMAIL PROTECTED]> Sent: Sunday, June 10, 2007 4:19 AM
>I am using the following tutorial below (with nutch 0.9) to crawl the > web. I went through the steps, download dmoz and run the parser, etc, > etc. > > bin/nutch inject crawl/crawldb dmoz > etc > etc. > bin/nutch fetch $s1 > > Once I get to this step, is there a way to "crawl" the sites that are > in the dmoz/url list. It seems like we are just fetching the URLs > that are straight out of the dmoz list. Lets say I want to crawl > those and give a particular depth? > > http://lucene.apache.org/nutch/tutorial8.html You have to complete the fetch phase and, after that, the updatedb phase, so that the URL's in the pages fetched and placed in the segment $1 be inserted in the crawldb: bin/nutch updatedb crawl/crawldb $s1 The next "generate" phase will prepare a new fetch list using both unfetched original and newly-added URL's. If you want, in that next cycle, to crawl only the new URL's and not the ones injected from the dmoz list, you can use a different crawldb: bin/nutch updatedb crawl/crawldb_new $s1 bin/nutch generate crawl/crawldb_new crawl/segments -topN 1000 s3=`ls -d crawl/segments/2* | tail -1` etc., but I see little advantage in that. Enzo ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
