Re: scheduled crawling in nutch

rameshgalla Thu, 21 Aug 2008 06:13:16 -0700

Correct. My previous question was not proper.....
I want to ask about indexing.......
After crawling the content second time how it will index?
Like will it delete the previous index and creates the new index or
Will it replace the only modified content?




Thorsten Scherler-3 wrote:
> 
> On Thu, 2008-08-21 at 05:51 -0700, rameshgalla wrote:
>> Thanks for the reply.
>> If I use cron it solves half of my problem...
>> It helps to do the schedule crawling...
>> But how to do incremental crawling...
>> If I use cron it runs the same command once in a week but it crawls 1
>> million documents each time..
>> Any ideas to do only incremental crawling.......
> 
> That is logical not possible. 
> 
> Since you need to know which pages has been modified you need to crawl
> them! A crawl is nothing else as to request the page and compare the
> last-modified header of the response. However requesting the page again
> and again is the only way to know whether it has been modified.
> 
> Indexing is another thing.
> 
> salu2
> 
>> 
>> 
>> 
>> rameshgalla wrote:
>> > 
>> > I want to do schedule crawling in nutch.....
>> > Eg: I have crawled a site which has 1 million pages.
>> > and want to crawl the same site for updates once per week
>> > automatically(scheduled & incremental crawling).
>> > It has to crawl only modified or newly added content.
>> > 
>> > Is it possible with nutch?
>> > 
>> > If possible how can I achieve it?
>> > 
>> 
> -- 
> Thorsten Scherler                                 thorsten.at.apache.org
> Open Source Java                      consulting, training and solutions
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/scheduled-crawling-in-nutch-tp19087524p19088491.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: scheduled crawling in nutch

Reply via email to