Hey Gal, Thanks for the excellent explanation. I am surprised that nutch will re-fetch a page (assuming 30 days have passed), even if the page hasn't been updated on the server.
Yoni On Tue, 2006-12-05 at 15:41 +0200, Gal Nitzan wrote: > The concept of keeping track of the crawl db is as follows. > > every url that is found during crawl (parse) is updated into crawldb with the > updatedb. ofcourse this url should pass all filters and normalizers prior to > this. > > when entered into the crawl db an object (crawldatum) is created with > inormation about this link. one of the parameters of crawldatum is a status. > this status indicates the status of the url and initially it is unfetched. > when generate is called, the generator will add links which their status is > unfetched. further in the crawldatum object there is an information on when > the url was fetched. if you didn't change settings than it should be > re-fetched in 30 days. > > HTH > > Gal Nitzan > > ------ Original Message ------ > Received: Mon, 04 Dec 2006 01:26:54 PM IST > From: Yoni Amir <[EMAIL PROTECTED]> > To: [email protected] > Subject: Re: Re-crawl > > > I am struggling with the same questions. I don't understand how nutch > > decides whether to re-fetch content that was not updated, and how/where > > to configure it? > > > > Any help will be greatly appreciated :) > > > > Yoni > > > > On Mon, 2006-11-27 at 07:27 -0800, karthik085 wrote: > > > First time I let nutch crawl and if some urls are not feteched, nutch > reports > > > an error in the log file. Is there a way, Nutch can re-crawl and update > the > > > affected/non-fetched ones and do not do any operations on the valid ones? > > > > > > Also, If I wanted to recrawl again, say after few days/months on the same > > > website and some content of the website was updated and some not. What > does > > > nutch do in this case? What operations does it do for the > > > 1. updated content > > > 2. not-updated content > > > in the current database (local database from the previous crawl)? > > > > > > Does it just get the updated contents? Does it get all? > > > > > > If nutch gets everything(updated and non-updated), is there a way, we can > > > ask nutch to get only the updated content? > > > > > > > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
