Hi Mike That's actually what I was looking at doing, I was just hoping there was a way to avoid the "copySegmentAsIs" step and simply replace it with a "rename" operation on the file system. It seemed like low hanging fruit, but Uwe and Erick have now told me that the segments have dependencies embedded in them somehow, so a simple rename operation wouldn't accomplish the same thing. In the end, it may not be a big deal anyway.
Thanks Shaun ________________________________________ From: Michael McCandless <luc...@mikemccandless.com> Sent: December 29, 2014 2:43 PM To: Lucene Users Subject: Re: manually merging Directories Why not use IW.addIndexes(Directory[])? Mike McCandless http://blog.mikemccandless.com On Mon, Dec 29, 2014 at 12:44 PM, Uwe Schindler <u...@thetaphi.de> wrote: > Hi, > > Why not simply leave each index directory on the searcher nodes as is: > Move all index directories (as mentioned by you) to a local disk and access > them using a MultiReader - there is no need to merge them if you have not > enough resources. If you have enough CPU and IO power, just merge them as > usual with IndexWriter.addIndexes(). But I don't understand you argument with > I/O: If you copy the index files from HDFS to local disks already, how can > this work without I/O? So you can merge them anyways. > > Merging index files, simply by copying them all in one directory, is > impossible, because the files reference each other by segment name > (segments_n refers to them, also the segment ids are used all over). So You > would need to change some index files already for merge to make the > SegmentInfos structures use the correct names, so you can do a real merge > anyways. > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > >> -----Original Message----- >> From: Shaun Senecal [mailto:shaun.sene...@lithium.com] >> Sent: Monday, December 29, 2014 6:34 PM >> To: java-user >> Subject: Re: manually merging Directories >> >> I'm not worried about the I/O right now, I'm "hoping I can do better", that's >> all. It sounds like the only actual complication here is building the >> segments_N file, which would list all of the newly renamed segments, so >> perhaps this isn't impossible. That said, you're absolutely right about the >> possibility of complications, so its debatable if doing something like this >> would be worth it in the end. Thanks for the info >> >> >> >> Shaun >> >> >> ________________________________________ >> From: Erick Erickson <erickerick...@gmail.com> >> Sent: December 23, 2014 5:55 PM >> To: java-user >> Subject: Re: manually merging Directories >> >> I doubt this is going to work. I have to ask why you're worried about the >> I/O; >> this smacks of premature optimization. Not only do the files have to be >> moved, but the right control structures need to be in place to inform Solr >> (well, Lucene) exactly what files are current. There's a lot of room for >> programming errors here.... >> >> segments_n is the file that tells Lucene which segments are active. There can >> only be one that's active so you'd have to somehow combine them all. >> >> I think this is a dubious proposition at best, all to avoid some I/O. How >> much >> I/O are we talking here? If it's a huge amount, I'm not at all sure you'll >> be able >> to _use_ your merged index. >> How many docs are we talking about? 100M? 10B? I mean you used M/R on it >> in the first place for a reason.... >> >> But this is what the --go-live option of the MapReduceIndexerTool already >> does for you. Admittedly, it copies things around the network to the final >> destination, personally I'd just use that. >> >> As you can tell, I don't know all the details to say it's impossible, IMO >> this is >> feels like wasted effort with lots of possibilities to get wrong for little >> demonstrated benefit. You'd spend a lot more time trying to figure out the >> correct thing to do and then fixing bugs than you'll spend waiting for the >> copy >> HDFS or no. >> >> Best, >> Erick >> >> On Tue, Dec 23, 2014 at 2:55 PM, Shaun Senecal >> <shaun.sene...@lithium.com> wrote: >> > Hi >> > >> > I have a number of Directories which are stored in various paths on HDFS, >> and I would like to merge them into a single index. The obvious way to do >> this is to use IndexWriter.addIndexes(...), however, I'm hoping I can do >> better. Since I have created each of the separate indexes using >> Map/Reduce, I know that there are no deleted or duplicate documents and >> the codecs are the same. Using addIndexes(...) will incur a lot of I/O as it >> copies from the source Directory into the dest Directory, and this is the >> bit I >> would like to avoid. Would it instead be possible to simply move each of the >> segments from each path into a single path on HDFS using a mv/rename >> operation instead? Obviously I would need to take care of the naming to >> ensure the files from one index dont overwrite another's, but it looks like >> this is done with a counter of some sort so that the latest segment can be >> found. A potential complication is the segments_1 file, as I'm not sure what >> that is for or if I can easily (re)construct it externally. >> > >> > The end goal here is to index using Map/Reduce and then spit out a single >> index in the end that has been merged down to a single segment, and to >> minimize IO while doing it. Once I have the completed index in a single >> Directory, I can (optionally) perform the forced merge (which will incur a >> huge IO hit). If the forced merge isnt performed on HDFS, it could be done >> on the search nodes before the active searcher is switched. This may be >> better if, for example, you know all of your search nodes have SSDs and IO to >> spare.? >> > >> > Just in case my explanation above wasn't clear enough, here is a >> > picture >> > >> > What I have: >> > >> > /user/username/MR_output/0 >> > _0.fdt >> > _0.fdx >> > _0.fnm >> > _0.si >> > ... >> > segments_1 >> > >> > /user/username/MR_output/1 >> > _0.fdt >> > _0.fdx >> > _0.fnm >> > _0.si >> > ... >> > segments_1 >> > >> > >> > What I want (using simple mv/rename): >> > >> > /user/username/merged >> > _0.fdt >> > _0.fdx >> > _0.fnm >> > _0.si >> > ... >> > _1.fdt >> > _1.fdx >> > _1.fnm >> > _1.si >> > ... >> > segments_1 >> > >> > >> > >> > >> > Thanks, >> > >> > Shaun? >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org