Re: [Nutch-general] Re-crawl

Yoni Amir Tue, 05 Dec 2006 07:14:02 -0800

Hey Gal,

Thanks for the excellent explanation. I am surprised that nutch will
re-fetch a page (assuming 30 days have passed), even if the page hasn't
been updated on the server.


Yoni

On Tue, 2006-12-05 at 15:41 +0200, Gal Nitzan wrote:
> The concept of keeping track of the crawl db is as follows.
> 
> every url that is found during crawl (parse) is updated into crawldb with the
> updatedb. ofcourse this url should pass all filters and normalizers prior to
> this.
> 
> when entered into the crawl db an object (crawldatum) is created with
> inormation about this link. one of the parameters of crawldatum is a status.
> this status indicates the status of the url and initially it is unfetched.
> when generate is called, the generator will add links which their status is
> unfetched. further in the crawldatum object there is an information on when
> the url was fetched. if you didn't change settings than it should be
> re-fetched in 30 days.
> 
> HTH
> 
> Gal Nitzan
> 
> ------ Original Message ------
> Received: Mon, 04 Dec 2006 01:26:54 PM IST
> From: Yoni Amir <[EMAIL PROTECTED]>
> To: [email protected]
> Subject: Re: Re-crawl
> 
> > I am struggling with the same questions. I don't understand how nutch
> > decides whether to re-fetch content that was not updated, and how/where
> > to configure it?
> > 
> > Any help will be greatly appreciated :)
> > 
> > Yoni
> > 
> > On Mon, 2006-11-27 at 07:27 -0800, karthik085 wrote:
> > > First time I let nutch crawl and if some urls are not feteched, nutch
> reports
> > > an error in the log file. Is there a way, Nutch can re-crawl and update
> the
> > > affected/non-fetched ones and do not do any operations on the valid ones?
> > > 
> > > Also, If I wanted to recrawl again, say after few days/months on the same
> > > website and some content of the website was updated and some not. What
> does
> > > nutch do in this case? What operations does it do for the 
> > > 1. updated content
> > > 2. not-updated content
> > > in the current database (local database from the previous crawl)?
> > > 
> > > Does it just get the updated contents? Does it get all?
> > > 
> > > If nutch gets everything(updated and non-updated), is there a way, we can
> > > ask nutch to get only the updated content?
> > > 
> > 
> 
> 
> 

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Re-crawl

Reply via email to