Re: manually merging Directories

Shaun Senecal Tue, 30 Dec 2014 10:49:09 -0800

No problem at all.  Its just good to hear that what I was thinking is actually 
possible!


________________________________________
From: Robert Muir <rcm...@gmail.com>
Sent: December 30, 2014 10:27 AM
To: java-user
Subject: Re: manually merging Directories

I will revisit this and look at hardlinks as an optimization. If they
aren't supported, we can just fall back to copying.

Sorry it will not help your case, but it would improve the situation
and can be done safely.

On Tue, Dec 30, 2014 at 12:55 PM, Shaun Senecal
<shaun.sene...@lithium.com> wrote:
> Excellent, this is pretty much exactly what I was looking for.  I agree with 
> you on the use of hard links as well.  Sadly, HDFS doesn't support hard links 
> yet (https://issues.apache.org/jira/browse/HDFS-3370), so even if this 
> feature is implemented, I wont be able to use it, but its still good to keep 
> this in mind for future reference.
>
>
> Thanks!
>
> Shaun
>
> ________________________________________
> From: Robert Muir <rcm...@gmail.com>
> Sent: December 30, 2014 9:36 AM
> To: java-user
> Subject: Re: manually merging Directories
>
> FYI there is more discussion on
> https://issues.apache.org/jira/browse/LUCENE-4746
>
> In general, i don't like the idea that if things go wrong (which they
> will), that the input Directories would be left in a trashed state.
>
> To me, hard links would be the correct solution, but Files.createLink
> is an optional operation for a reason (I think it may require special
> privs on windows).
>
> On Tue, Dec 30, 2014 at 12:24 PM, Shaun Senecal
> <shaun.sene...@lithium.com> wrote:
>> Ya, I already have that set up.  Thanks for the heads-up though!
>>
>> ________________________________________
>> From: Uwe Schindler <u...@thetaphi.de>
>> Sent: December 30, 2014 5:22 AM
>> To: java-user@lucene.apache.org
>> Subject: RE: manually merging Directories
>>
>> In addition, use NoMergePolicy to prevent automatic merging once the 
>> segments were added. :-)
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>>
>>> -----Original Message-----
>>> From: Uwe Schindler [mailto:u...@thetaphi.de]
>>> Sent: Tuesday, December 30, 2014 2:20 PM
>>> To: 'java-user@lucene.apache.org'
>>> Subject: RE: manually merging Directories
>>>
>>> Hi Shaun,
>>>
>>> you can actually do this relatively simple. In fact, most of the files are 
>>> indeed
>>> copied as-is, so you can theoretically change the logic to make a simple
>>> rename. Files that cannot be copied unmodified and need to be changed by
>>> IndexWriter, will be handled as usual.
>>>
>>> You don't need to patch Lucene for this: IndexWriter calls
>>> Directory#copy(Directory to, String src, String dest, IOContext context) for
>>> those files that can be copied unmodified. What you need to do is: Just 
>>> care a
>>> oal.store.FilterDirectory that wraps the original FSDirectory and implement
>>> this copy method on it to just do a rename, like:
>>>
>>> public class RenameInsteadCopyFilterDirectory extends FilterDirectory {
>>>   public RenameInsteadCopyFilterDirectory(FSDirectory dir) {
>>>     super(dir);
>>>   }
>>>
>>>   public void copy(Directory to, String src, String dest, IOContext context)
>>> throws IOException {
>>>     if (!(to instanceof FSDirectory)) {
>>>      throw new IOException("This only works for target FSDirectories");
>>>     final FSDirectory fromFS = (FSDirectory) this.getDelegate(), toFS =
>>> (FSDirectory) to;
>>>     Files.move(fromFS.getDirectory().resolve(source),
>>> toFS.getDirectory().resolve(dest));
>>>   }
>>> }
>>>
>>> Please be aware that you have to wrap the "source" directory, because
>>> IndexWriter's copySegmentAsIs() call this method of the directory that’s
>>> passed to addIndexes(Directory). Something like:
>>>
>>> writer.addIndexes(new RenameInsteadCopyFilterDirectory(originalDir));
>>>
>>> After that all files, that were not copied unmodified, keep alive in the 
>>> source
>>> directory, but all those that are copied as-is will move and disappear from
>>> source directory.
>>>
>>> Uwe
>>>
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: u...@thetaphi.de
>>>
>>>
>>> > -----Original Message-----
>>> > From: Shaun Senecal [mailto:shaun.sene...@lithium.com]
>>> > Sent: Tuesday, December 30, 2014 12:37 AM
>>> > To: Lucene Users
>>> > Subject: Re: manually merging Directories
>>> >
>>> > Hi Mike
>>> >
>>> > That's actually what I was looking at doing, I was just hoping there
>>> > was a way to avoid the "copySegmentAsIs" step and simply replace it with a
>>> "rename"
>>> > operation on the file system.  It seemed like low hanging fruit, but
>>> > Uwe and Erick have now told me that the segments have dependencies
>>> > embedded in them somehow, so a simple rename operation wouldn't
>>> > accomplish the same thing.  In the end, it may not be a big deal anyway.
>>> >
>>> >
>>> > Thanks
>>> >
>>> > Shaun
>>> >
>>> >
>>> > ________________________________________
>>> > From: Michael McCandless <luc...@mikemccandless.com>
>>> > Sent: December 29, 2014 2:43 PM
>>> > To: Lucene Users
>>> > Subject: Re: manually merging Directories
>>> >
>>> > Why not use IW.addIndexes(Directory[])?
>>> >
>>> > Mike McCandless
>>> >
>>> > http://blog.mikemccandless.com
>>> >
>>> >
>>> > On Mon, Dec 29, 2014 at 12:44 PM, Uwe Schindler <u...@thetaphi.de>
>>> > wrote:
>>> > > Hi,
>>> > >
>>> > > Why not simply leave each index directory on the searcher nodes as is:
>>> > > Move all index directories (as mentioned by you) to a local disk and
>>> > > access
>>> > them using a MultiReader - there is no need to merge them if you have
>>> > not enough resources. If you have enough CPU and IO power, just merge
>>> > them as usual with IndexWriter.addIndexes(). But I don't understand
>>> > you argument with I/O: If you copy the index files from HDFS to local
>>> > disks already, how can this work without I/O? So you can merge them
>>> anyways.
>>> > >
>>> > > Merging index files, simply by copying them all in one directory, is
>>> > impossible, because the files reference each other by segment name
>>> > (segments_n refers to them, also the segment ids are used all over).
>>> > So You would need to change some index files already for merge to make
>>> > the SegmentInfos structures use the correct names, so you can do a
>>> > real merge anyways.
>>> > >
>>> > > Uwe
>>> > >
>>> > > -----
>>> > > Uwe Schindler
>>> > > H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de
>>> > > eMail: u...@thetaphi.de
>>> > >
>>> > >
>>> > >> -----Original Message-----
>>> > >> From: Shaun Senecal [mailto:shaun.sene...@lithium.com]
>>> > >> Sent: Monday, December 29, 2014 6:34 PM
>>> > >> To: java-user
>>> > >> Subject: Re: manually merging Directories
>>> > >>
>>> > >> I'm not worried about the I/O right now, I'm "hoping I can do
>>> > >> better", that's all.  It sounds like the only actual complication
>>> > >> here is building the segments_N file, which would list all of the
>>> > >> newly renamed segments, so perhaps this isn't impossible.  That said,
>>> > >> you're absolutely right about the possibility of complications, so
>>> > >> its debatable if doing something like this would be worth it in the
>>> > >> end.  Thanks for the info
>>> > >>
>>> > >>
>>> > >>
>>> > >> Shaun
>>> > >>
>>> > >>
>>> > >> ________________________________________
>>> > >> From: Erick Erickson <erickerick...@gmail.com>
>>> > >> Sent: December 23, 2014 5:55 PM
>>> > >> To: java-user
>>> > >> Subject: Re: manually merging Directories
>>> > >>
>>> > >> I doubt this is going to work. I have to ask why you're worried about
>>> > >> the I/O; this smacks of premature optimization. Not only do the files
>>> > >> have to be moved, but the right control structures need to be in
>>> > >> place to inform Solr (well, Lucene) exactly what files are current.
>>> > >> There's a lot of room for programming errors here....
>>> > >>
>>> > >> segments_n is the file that tells Lucene which segments are active.
>>> > >> There can only be one that's active so you'd have to somehow combine
>>> > them all.
>>> > >>
>>> > >> I think this is a dubious proposition at best, all to avoid some I/O.
>>> > >> How much I/O are we talking here? If it's a huge amount, I'm not at
>>> > >> all sure you'll be able to _use_ your merged index.
>>> > >> How many docs are we talking about? 100M? 10B? I mean you used M/R
>>> > on
>>> > >> it in the first place for a reason....
>>> > >>
>>> > >> But this is what the --go-live option of the MapReduceIndexerTool
>>> > >> already does for you. Admittedly, it copies things around the network
>>> > >> to the final destination, personally I'd just use that.
>>> > >>
>>> > >> As you can tell, I don't know all the details to say it's impossible,
>>> > >> IMO this is feels like wasted effort with lots of possibilities to
>>> > >> get wrong for little demonstrated benefit. You'd spend a lot more
>>> > >> time trying to figure out the correct thing to do and then fixing
>>> > >> bugs than you'll spend waiting for the copy HDFS or no.
>>> > >>
>>> > >> Best,
>>> > >> Erick
>>> > >>
>>> > >> On Tue, Dec 23, 2014 at 2:55 PM, Shaun Senecal
>>> > >> <shaun.sene...@lithium.com> wrote:
>>> > >> > Hi
>>> > >> >
>>> > >> > I have a number of Directories which are stored in various paths on
>>> > >> > HDFS,
>>> > >> and I would like to merge them into a single index.  The obvious way
>>> > >> to do this is to use IndexWriter.addIndexes(...), however, I'm hoping
>>> > >> I can do better.  Since I have created each of the separate indexes
>>> > >> using Map/Reduce, I know that there are no deleted or duplicate
>>> > >> documents and the codecs are the same.  Using addIndexes(...) will
>>> > >> incur a lot of I/O as it copies from the source Directory into the
>>> > >> dest Directory, and this is the bit I would like to avoid.  Would it
>>> > >> instead be possible to simply move each of the segments from each
>>> > >> path into a single path on HDFS using a mv/rename operation instead?
>>> > >> Obviously I would need to take care of the naming to ensure the files
>>> > >> from one index dont overwrite another's, but it looks like this is
>>> > >> done with a counter of some sort so that the latest segment can be
>>> > >> found. A potential complication is the segments_1 file, as I'm not sure
>>> > what that is for or if I can easily (re)construct it externally.
>>> > >> >
>>> > >> > The end goal here is to index using Map/Reduce and then spit out a
>>> > >> > single
>>> > >> index in the end that has been merged down to a single segment, and
>>> > >> to minimize IO while doing it.  Once I have the completed index in a
>>> > >> single Directory, I can (optionally) perform the forced merge (which
>>> > >> will incur a huge IO hit).  If the forced merge isnt performed on
>>> > >> HDFS, it could be done on the search nodes before the active searcher
>>> > >> is switched.  This may be better if, for example, you know all of
>>> > >> your search nodes have SSDs and IO to spare.?
>>> > >> >
>>> > >> > Just in case my explanation above wasn't clear enough, here is a
>>> > >> > picture
>>> > >> >
>>> > >> > What I have:
>>> > >> >
>>> > >> > /user/username/MR_output/0
>>> > >> >   _0.fdt
>>> > >> >   _0.fdx
>>> > >> >   _0.fnm
>>> > >> >   _0.si
>>> > >> >   ...
>>> > >> >   segments_1
>>> > >> >
>>> > >> > /user/username/MR_output/1
>>> > >> >   _0.fdt
>>> > >> >   _0.fdx
>>> > >> >   _0.fnm
>>> > >> >   _0.si
>>> > >> >   ...
>>> > >> >   segments_1
>>> > >> >
>>> > >> >
>>> > >> > What I want (using simple mv/rename):
>>> > >> >
>>> > >> > /user/username/merged
>>> > >> >   _0.fdt
>>> > >> >   _0.fdx
>>> > >> >   _0.fnm
>>> > >> >   _0.si
>>> > >> >   ...
>>> > >> >   _1.fdt
>>> > >> >   _1.fdx
>>> > >> >   _1.fnm
>>> > >> >   _1.si
>>> > >> >   ...
>>> > >> >   segments_1
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> > Thanks,
>>> > >> >
>>> > >> > Shaun?
>>> > >> >
>>> > >>
>>> > >> ---------------------------------------------------------------------
>>> > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> > >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> > >>
>>> > >> ---------------------------------------------------------------------
>>> > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> > >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> > >
>>> > >
>>> > > ---------------------------------------------------------------------
>>> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> > >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: manually merging Directories

Reply via email to