Your merged index will only reference the segments you choose to marge. For me i'll have 200 segments of about 1 million urls a piece. I generally index each one individually and merge 10 and put that on a query server and work my way down.
The nice thing is with svn current the merge of segments works fine and update of scoring is easier to do. Takes some handy work, but is doable :) -----Original Message----- From: Leonardo Barbosa <[EMAIL PROTECTED]> To: [email protected] Date: Thu, 7 Apr 2005 11:43:38 -0300 Subject: Merge question > Hello, > > I configured nutch to crawl and index my intranet periodically, and > now I'm trying to find the ideal merge process. I've looked in the > list achive and find a discussion about it (please see below), but I > still have one question : The solution #2 was kind of standad as I've > noticed, but my problem is, when I have lots of segment dirs, I start > to have "Too many open files" exception. > So I need to merge them, and by doing that, do I need to index it > again? Because it is an expensive process to index all the content, > and I have it already indexed in the segment dirs. > Can't I used the merged index created by "./nutch merge" facility? The > problem that I've found is that the merged index that I created > (solution 2) is pointing to the old segments. Can't I "update" the > index to point to the new fresh merged segment? > Shouldn't the "./nutch mergesegs" create a merged index? i'm kind of > confused with this.. :-) > > Best regards, > Leonardo Barbosa. > > From > nutch-user-return-53-apmail-incubator-nutch-user-archive=www.apache.org > @incubator.apache.org > Thu Mar 10 18:58:58 2005 > > > Should I : > > > > 1) merge all the segments and then index them, or > > 2) Should I index each segment individually and then merge the > indexes, > > keeping the segments separate. Or > > 3) Should I index each segment separately, and keep both segments and > > indexes separate, and search across multiple indexes (but I have > heard > > there are issues with the ranking) > > Option #3 is not really that great. You get better performance with a > merged index. Option #1 would be more work with having to merge the > segments, and I'm not sure that there is a real advantage to doing that > over option #2. Option #2 is what most people do. > > Luke >
