Re: manually merging Directories

Shaun Senecal Tue, 30 Dec 2014 09:56:30 -0800

Excellent, this is pretty much exactly what I was looking for.  I agree with 
you on the use of hard links as well.  Sadly, HDFS doesn't support hard links 
yet (https://issues.apache.org/jira/browse/HDFS-3370), so even if this feature 
is implemented, I wont be able to use it, but its still good to keep this in 
mind for future reference.



Thanks!

Shaun

________________________________________
From: Robert Muir <[email protected]>
Sent: December 30, 2014 9:36 AM
To: java-user
Subject: Re: manually merging Directories

FYI there is more discussion on
https://issues.apache.org/jira/browse/LUCENE-4746

In general, i don't like the idea that if things go wrong (which they
will), that the input Directories would be left in a trashed state.

To me, hard links would be the correct solution, but Files.createLink
is an optional operation for a reason (I think it may require special
privs on windows).

On Tue, Dec 30, 2014 at 12:24 PM, Shaun Senecal
<[email protected]> wrote:
> Ya, I already have that set up.  Thanks for the heads-up though!
>
> ________________________________________
> From: Uwe Schindler <[email protected]>
> Sent: December 30, 2014 5:22 AM
> To: [email protected]
> Subject: RE: manually merging Directories
>
> In addition, use NoMergePolicy to prevent automatic merging once the segments 
> were added. :-)
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: [email protected]
>
>
>> -----Original Message-----
>> From: Uwe Schindler [mailto:[email protected]]
>> Sent: Tuesday, December 30, 2014 2:20 PM
>> To: '[email protected]'
>> Subject: RE: manually merging Directories
>>
>> Hi Shaun,
>>
>> you can actually do this relatively simple. In fact, most of the files are 
>> indeed
>> copied as-is, so you can theoretically change the logic to make a simple
>> rename. Files that cannot be copied unmodified and need to be changed by
>> IndexWriter, will be handled as usual.
>>
>> You don't need to patch Lucene for this: IndexWriter calls
>> Directory#copy(Directory to, String src, String dest, IOContext context) for
>> those files that can be copied unmodified. What you need to do is: Just care 
>> a
>> oal.store.FilterDirectory that wraps the original FSDirectory and implement
>> this copy method on it to just do a rename, like:
>>
>> public class RenameInsteadCopyFilterDirectory extends FilterDirectory {
>>   public RenameInsteadCopyFilterDirectory(FSDirectory dir) {
>>     super(dir);
>>   }
>>
>>   public void copy(Directory to, String src, String dest, IOContext context)
>> throws IOException {
>>     if (!(to instanceof FSDirectory)) {
>>      throw new IOException("This only works for target FSDirectories");
>>     final FSDirectory fromFS = (FSDirectory) this.getDelegate(), toFS =
>> (FSDirectory) to;
>>     Files.move(fromFS.getDirectory().resolve(source),
>> toFS.getDirectory().resolve(dest));
>>   }
>> }
>>
>> Please be aware that you have to wrap the "source" directory, because
>> IndexWriter's copySegmentAsIs() call this method of the directory that’s
>> passed to addIndexes(Directory). Something like:
>>
>> writer.addIndexes(new RenameInsteadCopyFilterDirectory(originalDir));
>>
>> After that all files, that were not copied unmodified, keep alive in the 
>> source
>> directory, but all those that are copied as-is will move and disappear from
>> source directory.
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: [email protected]
>>
>>
>> > -----Original Message-----
>> > From: Shaun Senecal [mailto:[email protected]]
>> > Sent: Tuesday, December 30, 2014 12:37 AM
>> > To: Lucene Users
>> > Subject: Re: manually merging Directories
>> >
>> > Hi Mike
>> >
>> > That's actually what I was looking at doing, I was just hoping there
>> > was a way to avoid the "copySegmentAsIs" step and simply replace it with a
>> "rename"
>> > operation on the file system.  It seemed like low hanging fruit, but
>> > Uwe and Erick have now told me that the segments have dependencies
>> > embedded in them somehow, so a simple rename operation wouldn't
>> > accomplish the same thing.  In the end, it may not be a big deal anyway.
>> >
>> >
>> > Thanks
>> >
>> > Shaun
>> >
>> >
>> > ________________________________________
>> > From: Michael McCandless <[email protected]>
>> > Sent: December 29, 2014 2:43 PM
>> > To: Lucene Users
>> > Subject: Re: manually merging Directories
>> >
>> > Why not use IW.addIndexes(Directory[])?
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> > On Mon, Dec 29, 2014 at 12:44 PM, Uwe Schindler <[email protected]>
>> > wrote:
>> > > Hi,
>> > >
>> > > Why not simply leave each index directory on the searcher nodes as is:
>> > > Move all index directories (as mentioned by you) to a local disk and
>> > > access
>> > them using a MultiReader - there is no need to merge them if you have
>> > not enough resources. If you have enough CPU and IO power, just merge
>> > them as usual with IndexWriter.addIndexes(). But I don't understand
>> > you argument with I/O: If you copy the index files from HDFS to local
>> > disks already, how can this work without I/O? So you can merge them
>> anyways.
>> > >
>> > > Merging index files, simply by copying them all in one directory, is
>> > impossible, because the files reference each other by segment name
>> > (segments_n refers to them, also the segment ids are used all over).
>> > So You would need to change some index files already for merge to make
>> > the SegmentInfos structures use the correct names, so you can do a
>> > real merge anyways.
>> > >
>> > > Uwe
>> > >
>> > > -----
>> > > Uwe Schindler
>> > > H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
>> > > eMail: [email protected]
>> > >
>> > >
>> > >> -----Original Message-----
>> > >> From: Shaun Senecal [mailto:[email protected]]
>> > >> Sent: Monday, December 29, 2014 6:34 PM
>> > >> To: java-user
>> > >> Subject: Re: manually merging Directories
>> > >>
>> > >> I'm not worried about the I/O right now, I'm "hoping I can do
>> > >> better", that's all.  It sounds like the only actual complication
>> > >> here is building the segments_N file, which would list all of the
>> > >> newly renamed segments, so perhaps this isn't impossible.  That said,
>> > >> you're absolutely right about the possibility of complications, so
>> > >> its debatable if doing something like this would be worth it in the
>> > >> end.  Thanks for the info
>> > >>
>> > >>
>> > >>
>> > >> Shaun
>> > >>
>> > >>
>> > >> ________________________________________
>> > >> From: Erick Erickson <[email protected]>
>> > >> Sent: December 23, 2014 5:55 PM
>> > >> To: java-user
>> > >> Subject: Re: manually merging Directories
>> > >>
>> > >> I doubt this is going to work. I have to ask why you're worried about
>> > >> the I/O; this smacks of premature optimization. Not only do the files
>> > >> have to be moved, but the right control structures need to be in
>> > >> place to inform Solr (well, Lucene) exactly what files are current.
>> > >> There's a lot of room for programming errors here....
>> > >>
>> > >> segments_n is the file that tells Lucene which segments are active.
>> > >> There can only be one that's active so you'd have to somehow combine
>> > them all.
>> > >>
>> > >> I think this is a dubious proposition at best, all to avoid some I/O.
>> > >> How much I/O are we talking here? If it's a huge amount, I'm not at
>> > >> all sure you'll be able to _use_ your merged index.
>> > >> How many docs are we talking about? 100M? 10B? I mean you used M/R
>> > on
>> > >> it in the first place for a reason....
>> > >>
>> > >> But this is what the --go-live option of the MapReduceIndexerTool
>> > >> already does for you. Admittedly, it copies things around the network
>> > >> to the final destination, personally I'd just use that.
>> > >>
>> > >> As you can tell, I don't know all the details to say it's impossible,
>> > >> IMO this is feels like wasted effort with lots of possibilities to
>> > >> get wrong for little demonstrated benefit. You'd spend a lot more
>> > >> time trying to figure out the correct thing to do and then fixing
>> > >> bugs than you'll spend waiting for the copy HDFS or no.
>> > >>
>> > >> Best,
>> > >> Erick
>> > >>
>> > >> On Tue, Dec 23, 2014 at 2:55 PM, Shaun Senecal
>> > >> <[email protected]> wrote:
>> > >> > Hi
>> > >> >
>> > >> > I have a number of Directories which are stored in various paths on
>> > >> > HDFS,
>> > >> and I would like to merge them into a single index.  The obvious way
>> > >> to do this is to use IndexWriter.addIndexes(...), however, I'm hoping
>> > >> I can do better.  Since I have created each of the separate indexes
>> > >> using Map/Reduce, I know that there are no deleted or duplicate
>> > >> documents and the codecs are the same.  Using addIndexes(...) will
>> > >> incur a lot of I/O as it copies from the source Directory into the
>> > >> dest Directory, and this is the bit I would like to avoid.  Would it
>> > >> instead be possible to simply move each of the segments from each
>> > >> path into a single path on HDFS using a mv/rename operation instead?
>> > >> Obviously I would need to take care of the naming to ensure the files
>> > >> from one index dont overwrite another's, but it looks like this is
>> > >> done with a counter of some sort so that the latest segment can be
>> > >> found. A potential complication is the segments_1 file, as I'm not sure
>> > what that is for or if I can easily (re)construct it externally.
>> > >> >
>> > >> > The end goal here is to index using Map/Reduce and then spit out a
>> > >> > single
>> > >> index in the end that has been merged down to a single segment, and
>> > >> to minimize IO while doing it.  Once I have the completed index in a
>> > >> single Directory, I can (optionally) perform the forced merge (which
>> > >> will incur a huge IO hit).  If the forced merge isnt performed on
>> > >> HDFS, it could be done on the search nodes before the active searcher
>> > >> is switched.  This may be better if, for example, you know all of
>> > >> your search nodes have SSDs and IO to spare.?
>> > >> >
>> > >> > Just in case my explanation above wasn't clear enough, here is a
>> > >> > picture
>> > >> >
>> > >> > What I have:
>> > >> >
>> > >> > /user/username/MR_output/0
>> > >> >   _0.fdt
>> > >> >   _0.fdx
>> > >> >   _0.fnm
>> > >> >   _0.si
>> > >> >   ...
>> > >> >   segments_1
>> > >> >
>> > >> > /user/username/MR_output/1
>> > >> >   _0.fdt
>> > >> >   _0.fdx
>> > >> >   _0.fnm
>> > >> >   _0.si
>> > >> >   ...
>> > >> >   segments_1
>> > >> >
>> > >> >
>> > >> > What I want (using simple mv/rename):
>> > >> >
>> > >> > /user/username/merged
>> > >> >   _0.fdt
>> > >> >   _0.fdx
>> > >> >   _0.fnm
>> > >> >   _0.si
>> > >> >   ...
>> > >> >   _1.fdt
>> > >> >   _1.fdx
>> > >> >   _1.fnm
>> > >> >   _1.si
>> > >> >   ...
>> > >> >   segments_1
>> > >> >
>> > >> >
>> > >> >
>> > >> >
>> > >> > Thanks,
>> > >> >
>> > >> > Shaun?
>> > >> >
>> > >>
>> > >> ---------------------------------------------------------------------
>> > >> To unsubscribe, e-mail: [email protected]
>> > >> For additional commands, e-mail: [email protected]
>> > >>
>> > >> ---------------------------------------------------------------------
>> > >> To unsubscribe, e-mail: [email protected]
>> > >> For additional commands, e-mail: [email protected]
>> > >
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: [email protected]
>> > > For additional commands, e-mail: [email protected]
>> > >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: manually merging Directories

Reply via email to