[Nutch-general] Re-crawl

karthik085 Mon, 27 Nov 2006 07:28:09 -0800

First time I let nutch crawl and if some urls are not feteched, nutch reports
an error in the log file. Is there a way, Nutch can re-crawl and update the
affected/non-fetched ones and do not do any operations on the valid ones?


Also, If I wanted to recrawl again, say after few days/months on the same
website and some content of the website was updated and some not. What does
nutch do in this case? What operations does it do for the 
1. updated content
2. not-updated content
in the current database (local database from the previous crawl)?

Does it just get the updated contents? Does it get all?

If nutch gets everything(updated and non-updated), is there a way, we can
ask nutch to get only the updated content?

-- 
View this message in context: 
http://www.nabble.com/Re-crawl-tf2712378.html#a7561830
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re-crawl

Reply via email to