Yea, but how do crawl the actual pages like you would a intranet crawl. For example, lets say that I have 20 urls in my set from the DmozParser. Lets also say that I want to go into the depth 3 levels deep into the 20 urls. Is that possible.
For example with the intranet crawl I would start with some seed URL and then go into some depth. How would I do that URLs fetched from for example dmoz. On 6/9/07, Enzo Michelangeli <[EMAIL PROTECTED]> wrote: > ----- Original Message ----- > From: "Berlin Brown" <[EMAIL PROTECTED]> > Sent: Sunday, June 10, 2007 4:19 AM > > >I am using the following tutorial below (with nutch 0.9) to crawl the > > web. I went through the steps, download dmoz and run the parser, etc, > > etc. > > > > bin/nutch inject crawl/crawldb dmoz > > etc > > etc. > > bin/nutch fetch $s1 > > > > Once I get to this step, is there a way to "crawl" the sites that are > > in the dmoz/url list. It seems like we are just fetching the URLs > > that are straight out of the dmoz list. Lets say I want to crawl > > those and give a particular depth? > > > > http://lucene.apache.org/nutch/tutorial8.html > > You have to complete the fetch phase and, after that, the updatedb phase, so > that the URL's in the pages fetched and placed in the segment $1 be inserted > in the crawldb: > > bin/nutch updatedb crawl/crawldb $s1 > > The next "generate" phase will prepare a new fetch list using both unfetched > original and newly-added URL's. > > If you want, in that next cycle, to crawl only the new URL's and not the > ones injected from the dmoz list, you can use a different crawldb: > > bin/nutch updatedb crawl/crawldb_new $s1 > bin/nutch generate crawl/crawldb_new crawl/segments -topN 1000 > s3=`ls -d crawl/segments/2* | tail -1` > > etc., but I see little advantage in that. > > Enzo > > > -- Berlin Brown http://www.newspiritcompany.com - newspirit technologies ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
