Re: scheduled crawling in nutch

Thorsten Scherler Thu, 21 Aug 2008 06:00:07 -0700

On Thu, 2008-08-21 at 05:51 -0700, rameshgalla wrote:
> Thanks for the reply.
> If I use cron it solves half of my problem...
> It helps to do the schedule crawling...
> But how to do incremental crawling...
> If I use cron it runs the same command once in a week but it crawls 1
> million documents each time..
> Any ideas to do only incremental crawling.......

That is logical not possible. 

Since you need to know which pages has been modified you need to crawl
them! A crawl is nothing else as to request the page and compare the
last-modified header of the response. However requesting the page again
and again is the only way to know whether it has been modified.

Indexing is another thing.

salu2

> 
> 
> 
> rameshgalla wrote:
> > 
> > I want to do schedule crawling in nutch.....
> > Eg: I have crawled a site which has 1 million pages.
> > and want to crawl the same site for updates once per week
> > automatically(scheduled & incremental crawling).
> > It has to crawl only modified or newly added content.
> > 
> > Is it possible with nutch?
> > 
> > If possible how can I achieve it?
> > 
> 
-- 
Thorsten Scherler                                 thorsten.at.apache.org
Open Source Java                      consulting, training and solutions

Re: scheduled crawling in nutch

Reply via email to