On Thu, Apr 5, 2012 at 3:31 PM, Ivan Brusic <i...@brusic.com> wrote: > On Thu, Apr 5, 2012 at 11:36 AM, Michael McCandless > <luc...@mikemccandless.com> wrote: >> I'm assuming this is a "build once and never change" index...? Else, >> it sounds like you should never run forceMerge... > > Correct. The forceMerge was merely to preserve the previous 2.3 > behavior of using optimize.
OK. Avoid it, unless you can't... >> To preserve insertion order you just need to use one of the >> Log*MergePolicy (which you are already doing). Merge factor doesn't >> affect this... > > I was never sure why the merge factor was set to 2. My experiences in > the past was to set a high merge factor when doing a batch index. Well, it's not entirely clear... you'd have to test in your env to be sure. My instinct is to use a large (maybe infinite) MF while indexing, and then big MF while forceMerge'ing. >> For the fastest way to get to a single-segment index.... use >> NoMergePolicy while indexing the documents, and set the largest RAM >> buffer you can afford. This will create tons of segments in the index >> dir, which is fine as long as you will not open a reader on it... >> then: >> >> Open a new IW, with Log*MergePolicy, set a highish (maybe 30) >> mergeFactor, and call forceMerge(1). You may need to cutover to >> SerialMergeScheduler... > > NoMergePolicy? Never seen that class used before. It's like Log*MP with infinite mergeFactor... > RAM buffer size is > not an issue. Is the limitation still 2048MB? Yes. > Is the fastest way also the best way? :) There will never be a read > open on the index. Your second solution is similar to the existing > code with the exception of the mergeFactor. Will setting the merge > factor to a more reasonable number help with the merge speed? I think you'd have to test in your env. A non-infinite MF is good in that it gets some merges out of the way before the end, ie, you can soak up some otherwise unused IO resources/concurrency while you are indexing... making it less work/time to forceMerge in the end. > What enforces the preservation of the insertion order? The > MergePolicy? MergePolicy does. Though, in 4.0, it's also important you use only 1 thread for indexing. Prior to 4.0, docIDs were assigned in arrival order, across threads, but with 4.0, each thread gets a private segment, so the docIDs are jumbled. > How does the MergeScheduler affect things? It shouldn't affect docID order. > Used Lucene > on a few projects over the years and I never had to tweak the index > creation. The defaults normally work well... but docID assignment is an impl detail and is free to change across releases... > I guess I need to reread the tuning chapter in LIA, it's > been a few years. ;) Mike McCandless http://blog.mikemccandless.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org