First time I let nutch crawl and if some urls are not feteched, nutch reports an error in the log file. Is there a way, Nutch can re-crawl and update the affected/non-fetched ones and do not do any operations on the valid ones?
Also, If I wanted to recrawl again, say after few days/months on the same website and some content of the website was updated and some not. What does nutch do in this case? What operations does it do for the 1. updated content 2. not-updated content in the current database (local database from the previous crawl)? Does it just get the updated contents? Does it get all? If nutch gets everything(updated and non-updated), is there a way, we can ask nutch to get only the updated content? -- View this message in context: http://www.nabble.com/Re-crawl-tf2712378.html#a7561830 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
