Re: manually merging Directories

Shaun Senecal Mon, 29 Dec 2014 10:19:33 -0800

Hi Uwe,

Ok, that's good to know.  I thought each segment operated independently, so 
this may have been possible, but if there are inter-dependencies between the 
segments then its certainly not worth the trouble.  I will scratch that off the 
list of possibilities.


As you pointed out, doing the merge on the search side is fine, and has worked 
out for me in the past.  You have to be a little more careful with GC causing 
problems in maintaining your SLA if you have a large number of segments to be 
added/merged, but with the design I have in mind that shouldn't be a problem.  
This is one of the avenues I am considering.

You're correct about the HDFS copy to local requiring IO, and I am not 
discounting this.  However, that IO is additional to the IO required for the 
addIndexes/merge phase, and it was simply the addIndexes/merge phase that I 
thought had potential room for improvement.  At this point, it seems that is 
not the case, so I will just move on.

Thanks for the help



Shaun


________________________________________
From: Uwe Schindler <[email protected]>
Sent: December 29, 2014 9:44 AM
To: [email protected]
Subject: RE: manually merging Directories

Hi,

Why not simply leave each index directory on the searcher nodes as is:
Move all index directories (as mentioned by you) to a local disk and access 
them using a MultiReader - there is no need to merge them if you have not 
enough resources. If you have enough CPU and IO power, just merge them as usual 
with IndexWriter.addIndexes(). But I don't understand you argument with I/O: If 
you copy the index files from HDFS to local disks already, how can this work 
without I/O? So you can merge them anyways.

Merging index files, simply by copying them all in one directory, is 
impossible, because the files reference each other by segment name (segments_n 
refers to them, also the segment ids are used all over). So You would need to 
change some index files already for merge to make the SegmentInfos structures 
use the correct names, so you can do a real merge anyways.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]


> -----Original Message-----
> From: Shaun Senecal [mailto:[email protected]]
> Sent: Monday, December 29, 2014 6:34 PM
> To: java-user
> Subject: Re: manually merging Directories
>
> I'm not worried about the I/O right now, I'm "hoping I can do better", that's
> all.  It sounds like the only actual complication here is building the
> segments_N file, which would list all of the newly renamed segments, so
> perhaps this isn't impossible.  That said, you're absolutely right about the
> possibility of complications, so its debatable if doing something like this
> would be worth it in the end.  Thanks for the info
>
>
>
> Shaun
>
>
> ________________________________________
> From: Erick Erickson <[email protected]>
> Sent: December 23, 2014 5:55 PM
> To: java-user
> Subject: Re: manually merging Directories
>
> I doubt this is going to work. I have to ask why you're worried about the I/O;
> this smacks of premature optimization. Not only do the files have to be
> moved, but the right control structures need to be in place to inform Solr
> (well, Lucene) exactly what files are current. There's a lot of room for
> programming errors here....
>
> segments_n is the file that tells Lucene which segments are active. There can
> only be one that's active so you'd have to somehow combine them all.
>
> I think this is a dubious proposition at best, all to avoid some I/O. How much
> I/O are we talking here? If it's a huge amount, I'm not at all sure you'll be 
> able
> to _use_ your merged index.
> How many docs are we talking about? 100M? 10B? I mean you used M/R on it
> in the first place for a reason....
>
> But this is what the --go-live option of the MapReduceIndexerTool already
> does for you. Admittedly, it copies things around the network to the final
> destination, personally I'd just use that.
>
> As you can tell, I don't know all the details to say it's impossible, IMO 
> this is
> feels like wasted effort with lots of possibilities to get wrong for little
> demonstrated benefit. You'd spend a lot more time trying to figure out the
> correct thing to do and then fixing bugs than you'll spend waiting for the 
> copy
> HDFS or no.
>
> Best,
> Erick
>
> On Tue, Dec 23, 2014 at 2:55 PM, Shaun Senecal
> <[email protected]> wrote:
> > Hi
> >
> > I have a number of Directories which are stored in various paths on HDFS,
> and I would like to merge them into a single index.  The obvious way to do
> this is to use IndexWriter.addIndexes(...), however, I'm hoping I can do
> better.  Since I have created each of the separate indexes using
> Map/Reduce, I know that there are no deleted or duplicate documents and
> the codecs are the same.  Using addIndexes(...) will incur a lot of I/O as it
> copies from the source Directory into the dest Directory, and this is the bit 
> I
> would like to avoid.  Would it instead be possible to simply move each of the
> segments from each path into a single path on HDFS using a mv/rename
> operation instead?  Obviously I would need to take care of the naming to
> ensure the files from one index dont overwrite another's, but it looks like
> this is done with a counter of some sort so that the latest segment can be
> found. A potential complication is the segments_1 file, as I'm not sure what
> that is for or if I can easily (re)construct it externally.
> >
> > The end goal here is to index using Map/Reduce and then spit out a single
> index in the end that has been merged down to a single segment, and to
> minimize IO while doing it.  Once I have the completed index in a single
> Directory, I can (optionally) perform the forced merge (which will incur a
> huge IO hit).  If the forced merge isnt performed on HDFS, it could be done
> on the search nodes before the active searcher is switched.  This may be
> better if, for example, you know all of your search nodes have SSDs and IO to
> spare.?
> >
> > Just in case my explanation above wasn't clear enough, here is a
> > picture
> >
> > What I have:
> >
> > /user/username/MR_output/0
> >   _0.fdt
> >   _0.fdx
> >   _0.fnm
> >   _0.si
> >   ...
> >   segments_1
> >
> > /user/username/MR_output/1
> >   _0.fdt
> >   _0.fdx
> >   _0.fnm
> >   _0.si
> >   ...
> >   segments_1
> >
> >
> > What I want (using simple mv/rename):
> >
> > /user/username/merged
> >   _0.fdt
> >   _0.fdx
> >   _0.fnm
> >   _0.si
> >   ...
> >   _1.fdt
> >   _1.fdx
> >   _1.fnm
> >   _1.si
> >   ...
> >   segments_1
> >
> >
> >
> >
> > Thanks,
> >
> > Shaun?
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: manually merging Directories

Reply via email to