The concept of keeping track of the crawl db is as follows. every url that is found during crawl (parse) is updated into crawldb with the updatedb. ofcourse this url should pass all filters and normalizers prior to this.
when entered into the crawl db an object (crawldatum) is created with inormation about this link. one of the parameters of crawldatum is a status. this status indicates the status of the url and initially it is unfetched. when generate is called, the generator will add links which their status is unfetched. further in the crawldatum object there is an information on when the url was fetched. if you didn't change settings than it should be re-fetched in 30 days. HTH Gal Nitzan ------ Original Message ------ Received: Mon, 04 Dec 2006 01:26:54 PM IST From: Yoni Amir <[EMAIL PROTECTED]> To: [email protected] Subject: Re: Re-crawl > I am struggling with the same questions. I don't understand how nutch > decides whether to re-fetch content that was not updated, and how/where > to configure it? > > Any help will be greatly appreciated :) > > Yoni > > On Mon, 2006-11-27 at 07:27 -0800, karthik085 wrote: > > First time I let nutch crawl and if some urls are not feteched, nutch reports > > an error in the log file. Is there a way, Nutch can re-crawl and update the > > affected/non-fetched ones and do not do any operations on the valid ones? > > > > Also, If I wanted to recrawl again, say after few days/months on the same > > website and some content of the website was updated and some not. What does > > nutch do in this case? What operations does it do for the > > 1. updated content > > 2. not-updated content > > in the current database (local database from the previous crawl)? > > > > Does it just get the updated contents? Does it get all? > > > > If nutch gets everything(updated and non-updated), is there a way, we can > > ask nutch to get only the updated content? > > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
