The concept of keeping track of the crawl db is as follows.

every url that is found during crawl (parse) is updated into crawldb with the
updatedb. ofcourse this url should pass all filters and normalizers prior to
this.

when entered into the crawl db an object (crawldatum) is created with
inormation about this link. one of the parameters of crawldatum is a status.
this status indicates the status of the url and initially it is unfetched.
when generate is called, the generator will add links which their status is
unfetched. further in the crawldatum object there is an information on when
the url was fetched. if you didn't change settings than it should be
re-fetched in 30 days.

HTH

Gal Nitzan

------ Original Message ------
Received: Mon, 04 Dec 2006 01:26:54 PM IST
From: Yoni Amir <[EMAIL PROTECTED]>
To: [email protected]
Subject: Re: Re-crawl

> I am struggling with the same questions. I don't understand how nutch
> decides whether to re-fetch content that was not updated, and how/where
> to configure it?
> 
> Any help will be greatly appreciated :)
> 
> Yoni
> 
> On Mon, 2006-11-27 at 07:27 -0800, karthik085 wrote:
> > First time I let nutch crawl and if some urls are not feteched, nutch
reports
> > an error in the log file. Is there a way, Nutch can re-crawl and update
the
> > affected/non-fetched ones and do not do any operations on the valid ones?
> > 
> > Also, If I wanted to recrawl again, say after few days/months on the same
> > website and some content of the website was updated and some not. What
does
> > nutch do in this case? What operations does it do for the 
> > 1. updated content
> > 2. not-updated content
> > in the current database (local database from the previous crawl)?
> > 
> > Does it just get the updated contents? Does it get all?
> > 
> > If nutch gets everything(updated and non-updated), is there a way, we can
> > ask nutch to get only the updated content?
> > 
> 




-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to