Re: manually merging Directories

Shaun Senecal Mon, 29 Dec 2014 15:38:25 -0800

Hi Mike

That's actually what I was looking at doing, I was just hoping there was a way 
to avoid the "copySegmentAsIs" step and simply replace it with a "rename" 
operation on the file system.  It seemed like low hanging fruit, but Uwe and 
Erick have now told me that the segments have dependencies embedded in them 
somehow, so a simple rename operation wouldn't accomplish the same thing.  In 
the end, it may not be a big deal anyway.



Thanks

Shaun


________________________________________
From: Michael McCandless <[email protected]>
Sent: December 29, 2014 2:43 PM
To: Lucene Users
Subject: Re: manually merging Directories

Why not use IW.addIndexes(Directory[])?

Mike McCandless

http://blog.mikemccandless.com


On Mon, Dec 29, 2014 at 12:44 PM, Uwe Schindler <[email protected]> wrote:
> Hi,
>
> Why not simply leave each index directory on the searcher nodes as is:
> Move all index directories (as mentioned by you) to a local disk and access 
> them using a MultiReader - there is no need to merge them if you have not 
> enough resources. If you have enough CPU and IO power, just merge them as 
> usual with IndexWriter.addIndexes(). But I don't understand you argument with 
> I/O: If you copy the index files from HDFS to local disks already, how can 
> this work without I/O? So you can merge them anyways.
>
> Merging index files, simply by copying them all in one directory, is 
> impossible, because the files reference each other by segment name 
> (segments_n refers to them, also the segment ids are used all over). So You 
> would need to change some index files already for merge to make the 
> SegmentInfos structures use the correct names, so you can do a real merge 
> anyways.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
>
>
>> -----Original Message-----
>> From: Shaun Senecal [mailto:[email protected]]
>> Sent: Monday, December 29, 2014 6:34 PM
>> To: java-user
>> Subject: Re: manually merging Directories
>>
>> I'm not worried about the I/O right now, I'm "hoping I can do better", that's
>> all.  It sounds like the only actual complication here is building the
>> segments_N file, which would list all of the newly renamed segments, so
>> perhaps this isn't impossible.  That said, you're absolutely right about the
>> possibility of complications, so its debatable if doing something like this
>> would be worth it in the end.  Thanks for the info
>>
>>
>>
>> Shaun
>>
>>
>> ________________________________________
>> From: Erick Erickson <[email protected]>
>> Sent: December 23, 2014 5:55 PM
>> To: java-user
>> Subject: Re: manually merging Directories
>>
>> I doubt this is going to work. I have to ask why you're worried about the 
>> I/O;
>> this smacks of premature optimization. Not only do the files have to be
>> moved, but the right control structures need to be in place to inform Solr
>> (well, Lucene) exactly what files are current. There's a lot of room for
>> programming errors here....
>>
>> segments_n is the file that tells Lucene which segments are active. There can
>> only be one that's active so you'd have to somehow combine them all.
>>
>> I think this is a dubious proposition at best, all to avoid some I/O. How 
>> much
>> I/O are we talking here? If it's a huge amount, I'm not at all sure you'll 
>> be able
>> to _use_ your merged index.
>> How many docs are we talking about? 100M? 10B? I mean you used M/R on it
>> in the first place for a reason....
>>
>> But this is what the --go-live option of the MapReduceIndexerTool already
>> does for you. Admittedly, it copies things around the network to the final
>> destination, personally I'd just use that.
>>
>> As you can tell, I don't know all the details to say it's impossible, IMO 
>> this is
>> feels like wasted effort with lots of possibilities to get wrong for little
>> demonstrated benefit. You'd spend a lot more time trying to figure out the
>> correct thing to do and then fixing bugs than you'll spend waiting for the 
>> copy
>> HDFS or no.
>>
>> Best,
>> Erick
>>
>> On Tue, Dec 23, 2014 at 2:55 PM, Shaun Senecal
>> <[email protected]> wrote:
>> > Hi
>> >
>> > I have a number of Directories which are stored in various paths on HDFS,
>> and I would like to merge them into a single index.  The obvious way to do
>> this is to use IndexWriter.addIndexes(...), however, I'm hoping I can do
>> better.  Since I have created each of the separate indexes using
>> Map/Reduce, I know that there are no deleted or duplicate documents and
>> the codecs are the same.  Using addIndexes(...) will incur a lot of I/O as it
>> copies from the source Directory into the dest Directory, and this is the 
>> bit I
>> would like to avoid.  Would it instead be possible to simply move each of the
>> segments from each path into a single path on HDFS using a mv/rename
>> operation instead?  Obviously I would need to take care of the naming to
>> ensure the files from one index dont overwrite another's, but it looks like
>> this is done with a counter of some sort so that the latest segment can be
>> found. A potential complication is the segments_1 file, as I'm not sure what
>> that is for or if I can easily (re)construct it externally.
>> >
>> > The end goal here is to index using Map/Reduce and then spit out a single
>> index in the end that has been merged down to a single segment, and to
>> minimize IO while doing it.  Once I have the completed index in a single
>> Directory, I can (optionally) perform the forced merge (which will incur a
>> huge IO hit).  If the forced merge isnt performed on HDFS, it could be done
>> on the search nodes before the active searcher is switched.  This may be
>> better if, for example, you know all of your search nodes have SSDs and IO to
>> spare.?
>> >
>> > Just in case my explanation above wasn't clear enough, here is a
>> > picture
>> >
>> > What I have:
>> >
>> > /user/username/MR_output/0
>> >   _0.fdt
>> >   _0.fdx
>> >   _0.fnm
>> >   _0.si
>> >   ...
>> >   segments_1
>> >
>> > /user/username/MR_output/1
>> >   _0.fdt
>> >   _0.fdx
>> >   _0.fnm
>> >   _0.si
>> >   ...
>> >   segments_1
>> >
>> >
>> > What I want (using simple mv/rename):
>> >
>> > /user/username/merged
>> >   _0.fdt
>> >   _0.fdx
>> >   _0.fnm
>> >   _0.si
>> >   ...
>> >   _1.fdt
>> >   _1.fdx
>> >   _1.fnm
>> >   _1.si
>> >   ...
>> >   segments_1
>> >
>> >
>> >
>> >
>> > Thanks,
>> >
>> > Shaun?
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: manually merging Directories

Reply via email to