Re: manually merging Directories

Erick Erickson Tue, 23 Dec 2014 17:56:09 -0800

I doubt this is going to work. I have to ask why you're
worried about the I/O; this smacks of premature
optimization. Not only do the files have to be moved, but
the right control structures need to be in place to inform
Solr (well, Lucene) exactly what files are current. There's
a lot of room for programming errors here....


segments_n is the file that tells Lucene which segments
are active. There can only be one that's active so you'd have
to somehow combine them all.

I think this is a dubious proposition at best, all to avoid some
I/O. How much I/O are we talking here? If it's a huge amount,
I'm not at all sure you'll be able to _use_ your merged index.
How many docs are we talking about? 100M? 10B? I mean
you used M/R on it in the first place for a reason....

But this is what the --go-live option of the MapReduceIndexerTool
already does for you. Admittedly, it copies things around the
network to the final destination, personally I'd just use that.

As you can tell, I don't know all the details to say it's impossible,
IMO this is feels like wasted effort with lots of possibilities to
get wrong for little demonstrated benefit. You'd spend a lot more
time trying to figure out the correct thing to do and then fixing
bugs than you'll spend waiting for the copy HDFS or no.

Best,
Erick

On Tue, Dec 23, 2014 at 2:55 PM, Shaun Senecal
<shaun.sene...@lithium.com> wrote:
> Hi
>
> I have a number of Directories which are stored in various paths on HDFS, and 
> I would like to merge them into a single index.  The obvious way to do this 
> is to use IndexWriter.addIndexes(...), however, I'm hoping I can do better.  
> Since I have created each of the separate indexes using Map/Reduce, I know 
> that there are no deleted or duplicate documents and the codecs are the same. 
>  Using addIndexes(...) will incur a lot of I/O as it copies from the source 
> Directory into the dest Directory, and this is the bit I would like to avoid. 
>  Would it instead be possible to simply move each of the segments from each 
> path into a single path on HDFS using a mv/rename operation instead?  
> Obviously I would need to take care of the naming to ensure the files from 
> one index dont overwrite another's, but it looks like this is done with a 
> counter of some sort so that the latest segment can be found. A potential 
> complication is the segments_1 file, as I'm not sure what that is for or if I 
> can easily (re)construct it externally.
>
> The end goal here is to index using Map/Reduce and then spit out a single 
> index in the end that has been merged down to a single segment, and to 
> minimize IO while doing it.  Once I have the completed index in a single 
> Directory, I can (optionally) perform the forced merge (which will incur a 
> huge IO hit).  If the forced merge isnt performed on HDFS, it could be done 
> on the search nodes before the active searcher is switched.  This may be 
> better if, for example, you know all of your search nodes have SSDs and IO to 
> spare.?
>
> Just in case my explanation above wasn't clear enough, here is a picture
>
> What I have:
>
> /user/username/MR_output/0
>   _0.fdt
>   _0.fdx
>   _0.fnm
>   _0.si
>   ...
>   segments_1
>
> /user/username/MR_output/1
>   _0.fdt
>   _0.fdx
>   _0.fnm
>   _0.si
>   ...
>   segments_1
>
>
> What I want (using simple mv/rename):
>
> /user/username/merged
>   _0.fdt
>   _0.fdx
>   _0.fnm
>   _0.si
>   ...
>   _1.fdt
>   _1.fdx
>   _1.fnm
>   _1.si
>   ...
>   segments_1
>
>
>
>
> Thanks,
>
> Shaun?
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: manually merging Directories

Reply via email to