Excellent, this is pretty much exactly what I was looking for. I agree with you on the use of hard links as well. Sadly, HDFS doesn't support hard links yet (https://issues.apache.org/jira/browse/HDFS-3370), so even if this feature is implemented, I wont be able to use it, but its still good to keep this in mind for future reference.
Thanks! Shaun ________________________________________ From: Robert Muir <rcm...@gmail.com> Sent: December 30, 2014 9:36 AM To: java-user Subject: Re: manually merging Directories FYI there is more discussion on https://issues.apache.org/jira/browse/LUCENE-4746 In general, i don't like the idea that if things go wrong (which they will), that the input Directories would be left in a trashed state. To me, hard links would be the correct solution, but Files.createLink is an optional operation for a reason (I think it may require special privs on windows). On Tue, Dec 30, 2014 at 12:24 PM, Shaun Senecal <shaun.sene...@lithium.com> wrote: > Ya, I already have that set up. Thanks for the heads-up though! > > ________________________________________ > From: Uwe Schindler <u...@thetaphi.de> > Sent: December 30, 2014 5:22 AM > To: java-user@lucene.apache.org > Subject: RE: manually merging Directories > > In addition, use NoMergePolicy to prevent automatic merging once the segments > were added. :-) > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > >> -----Original Message----- >> From: Uwe Schindler [mailto:u...@thetaphi.de] >> Sent: Tuesday, December 30, 2014 2:20 PM >> To: 'java-user@lucene.apache.org' >> Subject: RE: manually merging Directories >> >> Hi Shaun, >> >> you can actually do this relatively simple. In fact, most of the files are >> indeed >> copied as-is, so you can theoretically change the logic to make a simple >> rename. Files that cannot be copied unmodified and need to be changed by >> IndexWriter, will be handled as usual. >> >> You don't need to patch Lucene for this: IndexWriter calls >> Directory#copy(Directory to, String src, String dest, IOContext context) for >> those files that can be copied unmodified. What you need to do is: Just care >> a >> oal.store.FilterDirectory that wraps the original FSDirectory and implement >> this copy method on it to just do a rename, like: >> >> public class RenameInsteadCopyFilterDirectory extends FilterDirectory { >> public RenameInsteadCopyFilterDirectory(FSDirectory dir) { >> super(dir); >> } >> >> public void copy(Directory to, String src, String dest, IOContext context) >> throws IOException { >> if (!(to instanceof FSDirectory)) { >> throw new IOException("This only works for target FSDirectories"); >> final FSDirectory fromFS = (FSDirectory) this.getDelegate(), toFS = >> (FSDirectory) to; >> Files.move(fromFS.getDirectory().resolve(source), >> toFS.getDirectory().resolve(dest)); >> } >> } >> >> Please be aware that you have to wrap the "source" directory, because >> IndexWriter's copySegmentAsIs() call this method of the directory that’s >> passed to addIndexes(Directory). Something like: >> >> writer.addIndexes(new RenameInsteadCopyFilterDirectory(originalDir)); >> >> After that all files, that were not copied unmodified, keep alive in the >> source >> directory, but all those that are copied as-is will move and disappear from >> source directory. >> >> Uwe >> >> ----- >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: u...@thetaphi.de >> >> >> > -----Original Message----- >> > From: Shaun Senecal [mailto:shaun.sene...@lithium.com] >> > Sent: Tuesday, December 30, 2014 12:37 AM >> > To: Lucene Users >> > Subject: Re: manually merging Directories >> > >> > Hi Mike >> > >> > That's actually what I was looking at doing, I was just hoping there >> > was a way to avoid the "copySegmentAsIs" step and simply replace it with a >> "rename" >> > operation on the file system. It seemed like low hanging fruit, but >> > Uwe and Erick have now told me that the segments have dependencies >> > embedded in them somehow, so a simple rename operation wouldn't >> > accomplish the same thing. In the end, it may not be a big deal anyway. >> > >> > >> > Thanks >> > >> > Shaun >> > >> > >> > ________________________________________ >> > From: Michael McCandless <luc...@mikemccandless.com> >> > Sent: December 29, 2014 2:43 PM >> > To: Lucene Users >> > Subject: Re: manually merging Directories >> > >> > Why not use IW.addIndexes(Directory[])? >> > >> > Mike McCandless >> > >> > http://blog.mikemccandless.com >> > >> > >> > On Mon, Dec 29, 2014 at 12:44 PM, Uwe Schindler <u...@thetaphi.de> >> > wrote: >> > > Hi, >> > > >> > > Why not simply leave each index directory on the searcher nodes as is: >> > > Move all index directories (as mentioned by you) to a local disk and >> > > access >> > them using a MultiReader - there is no need to merge them if you have >> > not enough resources. If you have enough CPU and IO power, just merge >> > them as usual with IndexWriter.addIndexes(). But I don't understand >> > you argument with I/O: If you copy the index files from HDFS to local >> > disks already, how can this work without I/O? So you can merge them >> anyways. >> > > >> > > Merging index files, simply by copying them all in one directory, is >> > impossible, because the files reference each other by segment name >> > (segments_n refers to them, also the segment ids are used all over). >> > So You would need to change some index files already for merge to make >> > the SegmentInfos structures use the correct names, so you can do a >> > real merge anyways. >> > > >> > > Uwe >> > > >> > > ----- >> > > Uwe Schindler >> > > H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de >> > > eMail: u...@thetaphi.de >> > > >> > > >> > >> -----Original Message----- >> > >> From: Shaun Senecal [mailto:shaun.sene...@lithium.com] >> > >> Sent: Monday, December 29, 2014 6:34 PM >> > >> To: java-user >> > >> Subject: Re: manually merging Directories >> > >> >> > >> I'm not worried about the I/O right now, I'm "hoping I can do >> > >> better", that's all. It sounds like the only actual complication >> > >> here is building the segments_N file, which would list all of the >> > >> newly renamed segments, so perhaps this isn't impossible. That said, >> > >> you're absolutely right about the possibility of complications, so >> > >> its debatable if doing something like this would be worth it in the >> > >> end. Thanks for the info >> > >> >> > >> >> > >> >> > >> Shaun >> > >> >> > >> >> > >> ________________________________________ >> > >> From: Erick Erickson <erickerick...@gmail.com> >> > >> Sent: December 23, 2014 5:55 PM >> > >> To: java-user >> > >> Subject: Re: manually merging Directories >> > >> >> > >> I doubt this is going to work. I have to ask why you're worried about >> > >> the I/O; this smacks of premature optimization. Not only do the files >> > >> have to be moved, but the right control structures need to be in >> > >> place to inform Solr (well, Lucene) exactly what files are current. >> > >> There's a lot of room for programming errors here.... >> > >> >> > >> segments_n is the file that tells Lucene which segments are active. >> > >> There can only be one that's active so you'd have to somehow combine >> > them all. >> > >> >> > >> I think this is a dubious proposition at best, all to avoid some I/O. >> > >> How much I/O are we talking here? If it's a huge amount, I'm not at >> > >> all sure you'll be able to _use_ your merged index. >> > >> How many docs are we talking about? 100M? 10B? I mean you used M/R >> > on >> > >> it in the first place for a reason.... >> > >> >> > >> But this is what the --go-live option of the MapReduceIndexerTool >> > >> already does for you. Admittedly, it copies things around the network >> > >> to the final destination, personally I'd just use that. >> > >> >> > >> As you can tell, I don't know all the details to say it's impossible, >> > >> IMO this is feels like wasted effort with lots of possibilities to >> > >> get wrong for little demonstrated benefit. You'd spend a lot more >> > >> time trying to figure out the correct thing to do and then fixing >> > >> bugs than you'll spend waiting for the copy HDFS or no. >> > >> >> > >> Best, >> > >> Erick >> > >> >> > >> On Tue, Dec 23, 2014 at 2:55 PM, Shaun Senecal >> > >> <shaun.sene...@lithium.com> wrote: >> > >> > Hi >> > >> > >> > >> > I have a number of Directories which are stored in various paths on >> > >> > HDFS, >> > >> and I would like to merge them into a single index. The obvious way >> > >> to do this is to use IndexWriter.addIndexes(...), however, I'm hoping >> > >> I can do better. Since I have created each of the separate indexes >> > >> using Map/Reduce, I know that there are no deleted or duplicate >> > >> documents and the codecs are the same. Using addIndexes(...) will >> > >> incur a lot of I/O as it copies from the source Directory into the >> > >> dest Directory, and this is the bit I would like to avoid. Would it >> > >> instead be possible to simply move each of the segments from each >> > >> path into a single path on HDFS using a mv/rename operation instead? >> > >> Obviously I would need to take care of the naming to ensure the files >> > >> from one index dont overwrite another's, but it looks like this is >> > >> done with a counter of some sort so that the latest segment can be >> > >> found. A potential complication is the segments_1 file, as I'm not sure >> > what that is for or if I can easily (re)construct it externally. >> > >> > >> > >> > The end goal here is to index using Map/Reduce and then spit out a >> > >> > single >> > >> index in the end that has been merged down to a single segment, and >> > >> to minimize IO while doing it. Once I have the completed index in a >> > >> single Directory, I can (optionally) perform the forced merge (which >> > >> will incur a huge IO hit). If the forced merge isnt performed on >> > >> HDFS, it could be done on the search nodes before the active searcher >> > >> is switched. This may be better if, for example, you know all of >> > >> your search nodes have SSDs and IO to spare.? >> > >> > >> > >> > Just in case my explanation above wasn't clear enough, here is a >> > >> > picture >> > >> > >> > >> > What I have: >> > >> > >> > >> > /user/username/MR_output/0 >> > >> > _0.fdt >> > >> > _0.fdx >> > >> > _0.fnm >> > >> > _0.si >> > >> > ... >> > >> > segments_1 >> > >> > >> > >> > /user/username/MR_output/1 >> > >> > _0.fdt >> > >> > _0.fdx >> > >> > _0.fnm >> > >> > _0.si >> > >> > ... >> > >> > segments_1 >> > >> > >> > >> > >> > >> > What I want (using simple mv/rename): >> > >> > >> > >> > /user/username/merged >> > >> > _0.fdt >> > >> > _0.fdx >> > >> > _0.fnm >> > >> > _0.si >> > >> > ... >> > >> > _1.fdt >> > >> > _1.fdx >> > >> > _1.fnm >> > >> > _1.si >> > >> > ... >> > >> > segments_1 >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > Thanks, >> > >> > >> > >> > Shaun? >> > >> > >> > >> >> > >> --------------------------------------------------------------------- >> > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> >> > >> --------------------------------------------------------------------- >> > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > >> > > >> > > --------------------------------------------------------------------- >> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org