No problem man, been there done that :) What do you get when you run "ulimit -a"
What size do you keep your segments at roughly? I typically wouldn't merge segments before, however now that it is supported it makes managing multiple crawlers pretty easy and allows for quick crawl lists and merging to standard size segments. I typically have 100 segments that i have been merging into 10 large ones. I generate indexes from those merged segments and then move them to 10 different servers. I used to always just merge the indexes and move 10 segments to each server, but i like having the 1-1 relationship as you can update the segments from db data now and re-index if you have to. -----Original Message----- From: Leonardo Barbosa <[EMAIL PROTECTED]> To: [email protected] Date: Thu, 7 Apr 2005 17:52:25 -0300 Subject: Re: Merge question > Thanks for your help Byron... But I think I didn't make myself clear > (or didn't understand you answer). > > I'll check the last svn, but the merge of segments > (SegmentMergeTool.java) that I have here is working fine for the > content, but I can't fine de index dir inside the new merged segment > dir. > > So, If I merge the index (IndexMerger.java) of my actual segments (a > part of it or all of them) before merge the segments, It's pointing to > the old segments. > > Sorry to boder you with this question, but I re-indexing all my > fetched urls every time I merge the segments, and if I don't merge the > segs, I get into "Too many open files" > :-( > > Where can I find more documentation about it? > > Thanks again, > Leonardo Barbosa. > > On Apr 7, 2005 4:48 PM, Byron Miller <[EMAIL PROTECTED]> wrote: > > Your merged index will only reference the segments you choose to > marge. > > > > For me i'll have 200 segments of about 1 million urls a piece. I > > generally index each one individually and merge 10 and put that on a > query > > server and work my way down. > > > > The nice thing is with svn current the merge of segments works fine > and > > update of scoring is easier to do. > > > > Takes some handy work, but is doable :) > > > > -----Original Message----- > > From: Leonardo Barbosa <[EMAIL PROTECTED]> > > To: [email protected] > > Date: Thu, 7 Apr 2005 11:43:38 -0300 > > Subject: Merge question > > > > > Hello, > > > > > > I configured nutch to crawl and index my intranet periodically, and > > > now I'm trying to find the ideal merge process. I've looked in the > > > list achive and find a discussion about it (please see below), but > I > > > still have one question : The solution #2 was kind of standad as > I've > > > noticed, but my problem is, when I have lots of segment dirs, I > start > > > to have "Too many open files" exception. > > > So I need to merge them, and by doing that, do I need to index it > > > again? Because it is an expensive process to index all the content, > > > and I have it already indexed in the segment dirs. > > > Can't I used the merged index created by "./nutch merge" facility? > The > > > problem that I've found is that the merged index that I created > > > (solution 2) is pointing to the old segments. Can't I "update" the > > > index to point to the new fresh merged segment? > > > Shouldn't the "./nutch mergesegs" create a merged index? i'm kind > of > > > confused with this.. :-) > > > > > > Best regards, > > > Leonardo Barbosa. > > > > > > From > > > > nutch-user-return-53-apmail-incubator-nutch-user-archive=www.apache.org > > > @incubator.apache.org > > > Thu Mar 10 18:58:58 2005 > > > > > > > Should I : > > > > > > > > 1) merge all the segments and then index them, or > > > > 2) Should I index each segment individually and then merge the > > > indexes, > > > > keeping the segments separate. Or > > > > 3) Should I index each segment separately, and keep both segments > and > > > > indexes separate, and search across multiple indexes (but I have > > > heard > > > > there are issues with the ranking) > > > > > > Option #3 is not really that great. You get better performance > with a > > > merged index. Option #1 would be more work with having to merge > the > > > segments, and I'm not sure that there is a real advantage to doing > that > > > over option #2. Option #2 is what most people do. > > > > > > Luke > > > > > > > > > > -- > ----------------------------------------------------------------------- > ------------------- > Encumbered forever by desire and ambition > There's a hunger still unsatisfied > Our weary eyes still stray to the horizon > Though down this road we've been so many times > > Pink Floyd (David Gilmour/Polly Samson) - High Hopes > ----------------------------------------------------------------------- > ------------------- >
