Hi guys,
I have a few questions regarding the way nutch indexes and the best way a recrawl can be implemented. 1. Why does nutch has to create a new index every time when indexing, while it can just merge it with the old existing index? I try to change the value in the IndexMerger class to 'false' while creating an index therefore Lucene doesn't recreate a new index each time it is indexing. The problem with this is, I keep on having some exception when it tries to merge the indexes. There is a lock time out exception that is thrown by the IndexMerger. And consequently the index that get created. Is it possible to let nutch index by merging it with an existing index? I have to crawl about 100Gb of data and if there are only a few documents that have been changed, I don't nutch to recreate a new index because of that but update the existing index by merging it with the new one. I need some light on this. 2. What is the best way to make nutch re-crawl? I have implemented a class that loops the crawl process; it has a crawl interval which is set in a property file and a running status. The running status is a Boolean variable which is set to true if the re-crawl process is ongoing or false if it should stop. But with this approach, it seems that the index is not being fully generated. The values in the index cannot be queried. The re-crawl is in java which calls an underlying ant script to run nutch. I know most re-crawl are written as batch script but can you tell me which one do you recommended? A batch script or a loop-based java program? 3. What is the best way of implementing nutch has a window service or unix daemon? Thanks, Armel