Using nutch nighly/0.9, on 'whole-internet' crawl type applications--  
we've got a process running that does a long (12hr or so) generate,  
fetch, update, mergesegs, invert, index, merge loop. This is all  
working fine.

I want to add another nutch crawl on the same machine from a small  
set of high update-rate pages. Some of these pages may be in the  
larger crawl as well. The smaller crawl will happen a few times a day  
and will only take a few minutes to finish fetching.

I would like to merge the index from the smaller crawl with the main  
larger index every time after the smaller crawl completes -- so that  
results from the small crawls are pushed out to the searcher  
immediately. However, I'm concerned that doing so might corrupt the  
larger index if by chance the larger crawl was in an indexing or  
merging state at the same time as the smaller crawl.

Are there protections against this? If it's not advised, is there a  
better way to have two separate crawls happening at once to the same  
index?

-Brian




-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to