Re: [Nutch-general] Re-crawl

Yoni Amir Mon, 04 Dec 2006 03:27:01 -0800

I am struggling with the same questions. I don't understand how nutch
decides whether to re-fetch content that was not updated, and how/where
to configure it?


Any help will be greatly appreciated :)

Yoni

On Mon, 2006-11-27 at 07:27 -0800, karthik085 wrote:
> First time I let nutch crawl and if some urls are not feteched, nutch reports
> an error in the log file. Is there a way, Nutch can re-crawl and update the
> affected/non-fetched ones and do not do any operations on the valid ones?
> 
> Also, If I wanted to recrawl again, say after few days/months on the same
> website and some content of the website was updated and some not. What does
> nutch do in this case? What operations does it do for the 
> 1. updated content
> 2. not-updated content
> in the current database (local database from the previous crawl)?
> 
> Does it just get the updated contents? Does it get all?
> 
> If nutch gets everything(updated and non-updated), is there a way, we can
> ask nutch to get only the updated content?
> 

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Re-crawl

Reply via email to