No problem man, been there done that :)

What do you get when you run "ulimit -a"

What size do you keep your segments at roughly?

I typically wouldn't merge segments before, however now that it is
supported it makes managing multiple crawlers pretty easy and allows for
quick crawl lists and merging to standard size segments.

I typically have 100 segments that i have been merging into 10 large ones.
I generate indexes from those merged segments and then move them to 10
different servers.

I used to always just merge the indexes and move 10 segments to each
server, but i like having the 1-1 relationship as you can update the
segments from db data now and re-index if you have to.


-----Original Message-----
From: Leonardo Barbosa <[EMAIL PROTECTED]>
To: [email protected]
Date: Thu, 7 Apr 2005 17:52:25 -0300
Subject: Re: Merge question

> Thanks for your help Byron... But I think I didn't make myself clear
> (or didn't understand you answer).
> 
> I'll check the last svn, but the merge of segments
> (SegmentMergeTool.java) that I have here is working fine for the
> content, but I can't fine de index dir inside the new merged segment
> dir.
> 
> So, If I merge the index (IndexMerger.java) of my actual segments (a
> part of it or all of them) before merge the segments, It's pointing to
> the old segments.
> 
> Sorry to boder you with this question, but I re-indexing all my
> fetched urls every time I merge the segments, and if I don't merge the
> segs, I get into "Too many open files"
> :-(
> 
> Where can I find more documentation about it?
> 
> Thanks again,
> Leonardo Barbosa.
> 
> On Apr 7, 2005 4:48 PM, Byron Miller <[EMAIL PROTECTED]> wrote:
> > Your merged index will only reference the segments you choose to
> marge.
> > 
> > For me i'll have 200 segments of about 1 million urls a piece.  I
> > generally index each one individually and merge 10 and put that on a
> query
> > server and work my way down.
> > 
> > The nice thing is with svn current the merge of segments works fine
> and
> > update of scoring is easier to do.
> > 
> > Takes some handy work, but is doable :)
> > 
> > -----Original Message-----
> > From: Leonardo Barbosa <[EMAIL PROTECTED]>
> > To: [email protected]
> > Date: Thu, 7 Apr 2005 11:43:38 -0300
> > Subject: Merge question
> > 
> > > Hello,
> > >
> > > I configured nutch to crawl and index my intranet periodically, and
> > > now I'm trying to find the ideal merge process. I've looked in the
> > > list achive and find a discussion about it (please see below), but
> I
> > > still have one question : The solution #2 was kind of standad as
> I've
> > > noticed, but my problem is, when I have lots of segment dirs, I
> start
> > > to have "Too many open files" exception.
> > > So I need to merge them, and by doing that, do I need to index it
> > > again? Because it is an expensive process to index all the content,
> > > and I have it already indexed in the segment dirs.
> > > Can't I used the merged index created by "./nutch merge" facility?
> The
> > > problem that I've found is that the merged index that I created
> > > (solution 2) is pointing to the old segments. Can't I "update" the
> > > index to point to the new fresh merged segment?
> > > Shouldn't the "./nutch mergesegs" create a merged index? i'm kind
> of
> > > confused with this.. :-)
> > >
> > > Best regards,
> > > Leonardo Barbosa.
> > >
> > > From
> > >
> nutch-user-return-53-apmail-incubator-nutch-user-archive=www.apache.org
> > > @incubator.apache.org
> > > Thu Mar 10 18:58:58 2005
> > >
> > > > Should I :
> > > >
> > > > 1) merge all the segments and then index them, or
> > > > 2) Should I index each segment individually and then merge the
> > > indexes,
> > > > keeping the segments separate. Or
> > > > 3) Should I index each segment separately, and keep both segments
> and
> > > > indexes separate, and search across multiple indexes (but I have
> > > heard
> > > > there are issues with the ranking)
> > >
> > > Option #3 is not really that great.  You get better performance
> with a
> > > merged index.  Option #1 would be more work with having to merge
> the
> > > segments, and I'm not sure that there is a real advantage to doing
> that
> > > over option #2.  Option #2 is what most people do.
> > >
> > > Luke
> > >
> > 
> > 
> 
> 
> -- 
> -----------------------------------------------------------------------
> -------------------
> Encumbered forever by desire and ambition
> There's a hunger still unsatisfied
> Our weary eyes still stray to the horizon
> Though down this road we've been so many times
> 
> Pink Floyd (David Gilmour/Polly Samson) - High Hopes
> -----------------------------------------------------------------------
> -------------------
> 

Reply via email to