Using nutch nighly/0.9, on 'whole-internet' crawl type applications-- we've got a process running that does a long (12hr or so) generate, fetch, update, mergesegs, invert, index, merge loop. This is all working fine.
I want to add another nutch crawl on the same machine from a small set of high update-rate pages. Some of these pages may be in the larger crawl as well. The smaller crawl will happen a few times a day and will only take a few minutes to finish fetching. I would like to merge the index from the smaller crawl with the main larger index every time after the smaller crawl completes -- so that results from the small crawls are pushed out to the searcher immediately. However, I'm concerned that doing so might corrupt the larger index if by chance the larger crawl was in an indexing or merging state at the same time as the smaller crawl. Are there protections against this? If it's not advised, is there a better way to have two separate crawls happening at once to the same index? -Brian ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
