On Thu, 2008-08-21 at 05:51 -0700, rameshgalla wrote: > Thanks for the reply. > If I use cron it solves half of my problem... > It helps to do the schedule crawling... > But how to do incremental crawling... > If I use cron it runs the same command once in a week but it crawls 1 > million documents each time.. > Any ideas to do only incremental crawling.......
That is logical not possible. Since you need to know which pages has been modified you need to crawl them! A crawl is nothing else as to request the page and compare the last-modified header of the response. However requesting the page again and again is the only way to know whether it has been modified. Indexing is another thing. salu2 > > > > rameshgalla wrote: > > > > I want to do schedule crawling in nutch..... > > Eg: I have crawled a site which has 1 million pages. > > and want to crawl the same site for updates once per week > > automatically(scheduled & incremental crawling). > > It has to crawl only modified or newly added content. > > > > Is it possible with nutch? > > > > If possible how can I achieve it? > > > -- Thorsten Scherler thorsten.at.apache.org Open Source Java consulting, training and solutions
