LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total order.
Only TieredMergePolicy merges out-of-order segments. I don't understand why you need to encouraging merging of the more recent (by your "time" field) segments... Mike McCandless http://blog.mikemccandless.com On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan <ravikumar.govindara...@gmail.com> wrote: > Mike, > > Each of my flushed segment is fully ordered by time. But TieredMergePolicy > or LogByteSizeMergePolicy is going to pick arbitrary time-segments and > disturb this arrangement and I wanted some kind of control on this. > > But like you pointed-out, going by only be time-adjacent merges can be > disastrous. > > Is there a way to mix both time and size to arrive at a somewhat > [less-than-accurate] global order of segment merges. > > Like attempt a time-adjacent merge, provided size of segments is not > extremely skewed etc... > > -- > Ravi > > > > > > > > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> You want to focus merging on the segments containing newer documents? >> Why? This seems somewhat dangerous... >> >> Not taking into account the "true" segment size can lead to very very >> poor merge decisions ... you should turn on IndexWriter's infoStream >> and do a long running test to convince yourself the merging is being >> sane. >> >> Mike >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan >> <ravikumar.govindara...@gmail.com> wrote: >> > Thanks Mike, >> > >> > Will try your suggestion. I will try to describe the actual use-case >> itself >> > >> > There is a requirement for merging time-adjacent segments [append-only, >> > rolling time-series data] >> > >> > All Documents have a timestamp affixed and during flush I need to note >> down >> > the least timestamp for all documents, through Codec. >> > >> > Then, I define a TimeMergePolicy extends LogMergePolicy and define the >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag]. >> > >> > LogMergePolicy will auto-arrange levels of segments according time and >> > proceed with merges. Latest segments will be lesser in size and preferred >> > during merges than older and bigger segments >> > >> > Do you think such an approach will be fine or there are better ways to >> > solve this? >> > >> > -- >> > Ravi >> > >> > >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless < >> > luc...@mikemccandless.com> wrote: >> > >> >> Somewhere in those numeric trie terms are the exact integers from your >> >> documents, encoded. >> >> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the int >> >> value back from the BytesRef term. >> >> >> >> But you need to filter out the "higher level" terms, e.g. using >> >> NumericUtils.getPrefixCodedLongShift(term) == 0. Or use >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum. I believe >> >> all the terms you want come first, so once you hit a term where >> >> .getPrefixCodedLongShift is > 0, that's your max term and you can stop >> >> checking. >> >> >> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so that >> >> you can e.g. pull your own TermsEnum and iterate the terms yourself. >> >> >> >> Mike McCandless >> >> >> >> http://blog.mikemccandless.com >> >> >> >> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan >> >> <ravikumar.govindara...@gmail.com> wrote: >> >> > I use a Codec to flush data. All methods delegate to actual >> >> Lucene42Codec, >> >> > except for intercepting one single-field. This field is indexed as an >> >> > IntField [Numeric-Trie...], with precisionStep=4. >> >> > >> >> > The purpose of the Codec is as follows >> >> > >> >> > 1. Note the first BytesRef for this field >> >> > 2. During finish() call [TermsConsumer.java], note the last BytesRef >> for >> >> > this field >> >> > 3. Converts both the first/last BytesRef to respective integers >> >> > 4. Store these 2 ints in segment-info diagnostics >> >> > >> >> > The problem with this approach is that, first/last BytesRef is totally >> >> > different from the actual "int" values I try to index. I guess, this >> is >> >> > because Numeric-Trie explodes all the integers into it's own format of >> >> > BytesRefs. Hence my Codec stores the wrong values in >> segment-diagnostics >> >> > >> >> > Is there a way I can record actual min/max int-values correctly in my >> >> codec >> >> > and still support NumericRange search? >> >> > >> >> > -- >> >> > Ravi >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org