Yea, but how do crawl the actual pages like you would a intranet
crawl. For example, lets say that I have 20 urls in my set from the
DmozParser.  Lets also say that I want to go into the depth 3 levels
deep into the 20 urls.  Is that possible.

For example with the intranet crawl I would start with some seed URL
and then go into some depth.  How would I do that URLs fetched from
for example dmoz.

On 6/9/07, Enzo Michelangeli <[EMAIL PROTECTED]> wrote:
> ----- Original Message -----
> From: "Berlin Brown" <[EMAIL PROTECTED]>
> Sent: Sunday, June 10, 2007 4:19 AM
>
> >I am using the following tutorial below (with nutch 0.9) to crawl the
> > web.  I went through the steps, download dmoz and run the parser, etc,
> > etc.
> >
> > bin/nutch inject crawl/crawldb dmoz
> > etc
> > etc.
> > bin/nutch fetch $s1
> >
> > Once I get to this step, is there a way to "crawl" the sites that are
> > in the dmoz/url list.  It seems like we are just fetching the URLs
> > that are straight out of the dmoz list.  Lets say I want to crawl
> > those and give a particular depth?
> >
> > http://lucene.apache.org/nutch/tutorial8.html
>
> You have to complete the fetch phase and, after that, the updatedb phase, so
> that the URL's in the pages fetched and placed in the segment $1 be inserted
> in the crawldb:
>
> bin/nutch updatedb crawl/crawldb $s1
>
> The next "generate" phase will prepare a new fetch list using both unfetched
> original and newly-added URL's.
>
> If you want, in that next cycle, to crawl only the new URL's and not the
> ones injected from the dmoz list, you can use a different crawldb:
>
> bin/nutch updatedb crawl/crawldb_new $s1
> bin/nutch generate crawl/crawldb_new crawl/segments -topN 1000
> s3=`ls -d crawl/segments/2* | tail -1`
>
> etc., but I see little advantage in that.
>
> Enzo
>
>
>


-- 
Berlin Brown
http://www.newspiritcompany.com - newspirit technologies

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to