Hi Shaun, you can actually do this relatively simple. In fact, most of the files are indeed copied as-is, so you can theoretically change the logic to make a simple rename. Files that cannot be copied unmodified and need to be changed by IndexWriter, will be handled as usual.
You don't need to patch Lucene for this: IndexWriter calls Directory#copy(Directory to, String src, String dest, IOContext context) for those files that can be copied unmodified. What you need to do is: Just care a oal.store.FilterDirectory that wraps the original FSDirectory and implement this copy method on it to just do a rename, like: public class RenameInsteadCopyFilterDirectory extends FilterDirectory { public RenameInsteadCopyFilterDirectory(FSDirectory dir) { super(dir); } public void copy(Directory to, String src, String dest, IOContext context) throws IOException { if (!(to instanceof FSDirectory)) { throw new IOException("This only works for target FSDirectories"); final FSDirectory fromFS = (FSDirectory) this.getDelegate(), toFS = (FSDirectory) to; Files.move(fromFS.getDirectory().resolve(source), toFS.getDirectory().resolve(dest)); } } Please be aware that you have to wrap the "source" directory, because IndexWriter's copySegmentAsIs() call this method of the directory that’s passed to addIndexes(Directory). Something like: writer.addIndexes(new RenameInsteadCopyFilterDirectory(originalDir)); After that all files, that were not copied unmodified, keep alive in the source directory, but all those that are copied as-is will move and disappear from source directory. Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -----Original Message----- > From: Shaun Senecal [mailto:shaun.sene...@lithium.com] > Sent: Tuesday, December 30, 2014 12:37 AM > To: Lucene Users > Subject: Re: manually merging Directories > > Hi Mike > > That's actually what I was looking at doing, I was just hoping there was a way > to avoid the "copySegmentAsIs" step and simply replace it with a "rename" > operation on the file system. It seemed like low hanging fruit, but Uwe and > Erick have now told me that the segments have dependencies embedded in > them somehow, so a simple rename operation wouldn't accomplish the > same thing. In the end, it may not be a big deal anyway. > > > Thanks > > Shaun > > > ________________________________________ > From: Michael McCandless <luc...@mikemccandless.com> > Sent: December 29, 2014 2:43 PM > To: Lucene Users > Subject: Re: manually merging Directories > > Why not use IW.addIndexes(Directory[])? > > Mike McCandless > > http://blog.mikemccandless.com > > > On Mon, Dec 29, 2014 at 12:44 PM, Uwe Schindler <u...@thetaphi.de> > wrote: > > Hi, > > > > Why not simply leave each index directory on the searcher nodes as is: > > Move all index directories (as mentioned by you) to a local disk and access > them using a MultiReader - there is no need to merge them if you have not > enough resources. If you have enough CPU and IO power, just merge them > as usual with IndexWriter.addIndexes(). But I don't understand you > argument with I/O: If you copy the index files from HDFS to local disks > already, how can this work without I/O? So you can merge them anyways. > > > > Merging index files, simply by copying them all in one directory, is > impossible, because the files reference each other by segment name > (segments_n refers to them, also the segment ids are used all over). So You > would need to change some index files already for merge to make the > SegmentInfos structures use the correct names, so you can do a real merge > anyways. > > > > Uwe > > > > ----- > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > > >> -----Original Message----- > >> From: Shaun Senecal [mailto:shaun.sene...@lithium.com] > >> Sent: Monday, December 29, 2014 6:34 PM > >> To: java-user > >> Subject: Re: manually merging Directories > >> > >> I'm not worried about the I/O right now, I'm "hoping I can do > >> better", that's all. It sounds like the only actual complication > >> here is building the segments_N file, which would list all of the > >> newly renamed segments, so perhaps this isn't impossible. That said, > >> you're absolutely right about the possibility of complications, so > >> its debatable if doing something like this would be worth it in the > >> end. Thanks for the info > >> > >> > >> > >> Shaun > >> > >> > >> ________________________________________ > >> From: Erick Erickson <erickerick...@gmail.com> > >> Sent: December 23, 2014 5:55 PM > >> To: java-user > >> Subject: Re: manually merging Directories > >> > >> I doubt this is going to work. I have to ask why you're worried about > >> the I/O; this smacks of premature optimization. Not only do the files > >> have to be moved, but the right control structures need to be in > >> place to inform Solr (well, Lucene) exactly what files are current. > >> There's a lot of room for programming errors here.... > >> > >> segments_n is the file that tells Lucene which segments are active. > >> There can only be one that's active so you'd have to somehow combine > them all. > >> > >> I think this is a dubious proposition at best, all to avoid some I/O. > >> How much I/O are we talking here? If it's a huge amount, I'm not at > >> all sure you'll be able to _use_ your merged index. > >> How many docs are we talking about? 100M? 10B? I mean you used M/R > on > >> it in the first place for a reason.... > >> > >> But this is what the --go-live option of the MapReduceIndexerTool > >> already does for you. Admittedly, it copies things around the network > >> to the final destination, personally I'd just use that. > >> > >> As you can tell, I don't know all the details to say it's impossible, > >> IMO this is feels like wasted effort with lots of possibilities to > >> get wrong for little demonstrated benefit. You'd spend a lot more > >> time trying to figure out the correct thing to do and then fixing > >> bugs than you'll spend waiting for the copy HDFS or no. > >> > >> Best, > >> Erick > >> > >> On Tue, Dec 23, 2014 at 2:55 PM, Shaun Senecal > >> <shaun.sene...@lithium.com> wrote: > >> > Hi > >> > > >> > I have a number of Directories which are stored in various paths on > >> > HDFS, > >> and I would like to merge them into a single index. The obvious way > >> to do this is to use IndexWriter.addIndexes(...), however, I'm hoping > >> I can do better. Since I have created each of the separate indexes > >> using Map/Reduce, I know that there are no deleted or duplicate > >> documents and the codecs are the same. Using addIndexes(...) will > >> incur a lot of I/O as it copies from the source Directory into the > >> dest Directory, and this is the bit I would like to avoid. Would it > >> instead be possible to simply move each of the segments from each > >> path into a single path on HDFS using a mv/rename operation instead? > >> Obviously I would need to take care of the naming to ensure the files > >> from one index dont overwrite another's, but it looks like this is > >> done with a counter of some sort so that the latest segment can be > >> found. A potential complication is the segments_1 file, as I'm not sure > what that is for or if I can easily (re)construct it externally. > >> > > >> > The end goal here is to index using Map/Reduce and then spit out a > >> > single > >> index in the end that has been merged down to a single segment, and > >> to minimize IO while doing it. Once I have the completed index in a > >> single Directory, I can (optionally) perform the forced merge (which > >> will incur a huge IO hit). If the forced merge isnt performed on > >> HDFS, it could be done on the search nodes before the active searcher > >> is switched. This may be better if, for example, you know all of > >> your search nodes have SSDs and IO to spare.? > >> > > >> > Just in case my explanation above wasn't clear enough, here is a > >> > picture > >> > > >> > What I have: > >> > > >> > /user/username/MR_output/0 > >> > _0.fdt > >> > _0.fdx > >> > _0.fnm > >> > _0.si > >> > ... > >> > segments_1 > >> > > >> > /user/username/MR_output/1 > >> > _0.fdt > >> > _0.fdx > >> > _0.fnm > >> > _0.si > >> > ... > >> > segments_1 > >> > > >> > > >> > What I want (using simple mv/rename): > >> > > >> > /user/username/merged > >> > _0.fdt > >> > _0.fdx > >> > _0.fnm > >> > _0.si > >> > ... > >> > _1.fdt > >> > _1.fdx > >> > _1.fnm > >> > _1.si > >> > ... > >> > segments_1 > >> > > >> > > >> > > >> > > >> > Thanks, > >> > > >> > Shaun? > >> > > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org