OK, I see (early termination). That's a challenge, because you really want the docs sorted backwards from how they were added right? And, e.g., merged and then searched in "reverse segment order"?
I think you should be able to do this w/ SortingMergePolicy? And then use a custom collector that stops after you've gone back enough in time for a given search. Mike McCandless http://blog.mikemccandless.com On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan <ravikumar.govindara...@gmail.com> wrote: > Mike, > > All our queries need to be sorted by timestamp field, in descending order > of time. [latest-first] > > Each segment is sorted in itself. But TieredMergePolicy picks arbitrary > segments and merges them [even with SortingMergePolicy etc...]. I am trying > to avoid this and see if an approximate global ordering of segments [by > time-stamp field] can be maintained via merge. > > Ex: TopN results will only examine recent 2-3 smaller segments [best-case] > and return, without examining older and bigger segments. > > I do not know the terminology, may be "Early Query Termination Across > Segments" etc...? > > -- > Ravi > > > On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total >> order. >> >> Only TieredMergePolicy merges out-of-order segments. >> >> I don't understand why you need to encouraging merging of the more >> recent (by your "time" field) segments... >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan >> <ravikumar.govindara...@gmail.com> wrote: >> > Mike, >> > >> > Each of my flushed segment is fully ordered by time. But >> TieredMergePolicy >> > or LogByteSizeMergePolicy is going to pick arbitrary time-segments and >> > disturb this arrangement and I wanted some kind of control on this. >> > >> > But like you pointed-out, going by only be time-adjacent merges can be >> > disastrous. >> > >> > Is there a way to mix both time and size to arrive at a somewhat >> > [less-than-accurate] global order of segment merges. >> > >> > Like attempt a time-adjacent merge, provided size of segments is not >> > extremely skewed etc... >> > >> > -- >> > Ravi >> > >> > >> > >> > >> > >> > >> > >> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless < >> > luc...@mikemccandless.com> wrote: >> > >> >> You want to focus merging on the segments containing newer documents? >> >> Why? This seems somewhat dangerous... >> >> >> >> Not taking into account the "true" segment size can lead to very very >> >> poor merge decisions ... you should turn on IndexWriter's infoStream >> >> and do a long running test to convince yourself the merging is being >> >> sane. >> >> >> >> Mike >> >> >> >> Mike McCandless >> >> >> >> http://blog.mikemccandless.com >> >> >> >> >> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan >> >> <ravikumar.govindara...@gmail.com> wrote: >> >> > Thanks Mike, >> >> > >> >> > Will try your suggestion. I will try to describe the actual use-case >> >> itself >> >> > >> >> > There is a requirement for merging time-adjacent segments >> [append-only, >> >> > rolling time-series data] >> >> > >> >> > All Documents have a timestamp affixed and during flush I need to note >> >> down >> >> > the least timestamp for all documents, through Codec. >> >> > >> >> > Then, I define a TimeMergePolicy extends LogMergePolicy and define the >> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag]. >> >> > >> >> > LogMergePolicy will auto-arrange levels of segments according time and >> >> > proceed with merges. Latest segments will be lesser in size and >> preferred >> >> > during merges than older and bigger segments >> >> > >> >> > Do you think such an approach will be fine or there are better ways to >> >> > solve this? >> >> > >> >> > -- >> >> > Ravi >> >> > >> >> > >> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless < >> >> > luc...@mikemccandless.com> wrote: >> >> > >> >> >> Somewhere in those numeric trie terms are the exact integers from >> your >> >> >> documents, encoded. >> >> >> >> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the int >> >> >> value back from the BytesRef term. >> >> >> >> >> >> But you need to filter out the "higher level" terms, e.g. using >> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0. Or use >> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum. I believe >> >> >> all the terms you want come first, so once you hit a term where >> >> >> .getPrefixCodedLongShift is > 0, that's your max term and you can >> stop >> >> >> checking. >> >> >> >> >> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so that >> >> >> you can e.g. pull your own TermsEnum and iterate the terms yourself. >> >> >> >> >> >> Mike McCandless >> >> >> >> >> >> http://blog.mikemccandless.com >> >> >> >> >> >> >> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan >> >> >> <ravikumar.govindara...@gmail.com> wrote: >> >> >> > I use a Codec to flush data. All methods delegate to actual >> >> >> Lucene42Codec, >> >> >> > except for intercepting one single-field. This field is indexed as >> an >> >> >> > IntField [Numeric-Trie...], with precisionStep=4. >> >> >> > >> >> >> > The purpose of the Codec is as follows >> >> >> > >> >> >> > 1. Note the first BytesRef for this field >> >> >> > 2. During finish() call [TermsConsumer.java], note the last >> BytesRef >> >> for >> >> >> > this field >> >> >> > 3. Converts both the first/last BytesRef to respective integers >> >> >> > 4. Store these 2 ints in segment-info diagnostics >> >> >> > >> >> >> > The problem with this approach is that, first/last BytesRef is >> totally >> >> >> > different from the actual "int" values I try to index. I guess, >> this >> >> is >> >> >> > because Numeric-Trie explodes all the integers into it's own >> format of >> >> >> > BytesRefs. Hence my Codec stores the wrong values in >> >> segment-diagnostics >> >> >> > >> >> >> > Is there a way I can record actual min/max int-values correctly in >> my >> >> >> codec >> >> >> > and still support NumericRange search? >> >> >> > >> >> >> > -- >> >> >> > Ravi >> >> >> >> >> >> --------------------------------------------------------------------- >> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org