I am struggling with the same questions. I don't understand how nutch decides whether to re-fetch content that was not updated, and how/where to configure it?
Any help will be greatly appreciated :) Yoni On Mon, 2006-11-27 at 07:27 -0800, karthik085 wrote: > First time I let nutch crawl and if some urls are not feteched, nutch reports > an error in the log file. Is there a way, Nutch can re-crawl and update the > affected/non-fetched ones and do not do any operations on the valid ones? > > Also, If I wanted to recrawl again, say after few days/months on the same > website and some content of the website was updated and some not. What does > nutch do in this case? What operations does it do for the > 1. updated content > 2. not-updated content > in the current database (local database from the previous crawl)? > > Does it just get the updated contents? Does it get all? > > If nutch gets everything(updated and non-updated), is there a way, we can > ask nutch to get only the updated content? > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
