Why not use LogByteSizeMP in conjunction w/ SortingMP? LogMP picks adjacent segments and SortingMP ensures the merged segment is also sorted.
Shai On Wed, Feb 12, 2014 at 3:16 PM, Ravikumar Govindarajan < ravikumar.govindara...@gmail.com> wrote: > Yes exactly as you have described. > > Ex: Consider Segment[S1,S2,S3 & S4] are in reverse-chronological order and > goes for a merge > > While SortingMergePolicy will correctly solve the merge-part, it does not > however play any role in picking segments to merge right? > > SMP internally delegates to TieredMergePolicy, which might pick S1&S4 to > merge disturbing the global-order. Ideally only "adjacent" segments should > be picked up for merge. Ex: {S1,S2} or {S2,S3,S4} etc... > > Can there be a better selection of segments to merge in this case, so as to > maintain a semblance of global-ordering? > > -- > Ravi > > > > On Wed, Feb 12, 2014 at 6:21 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > > > OK, I see (early termination). > > > > That's a challenge, because you really want the docs sorted backwards > > from how they were added right? And, e.g., merged and then searched > > in "reverse segment order"? > > > > I think you should be able to do this w/ SortingMergePolicy? And then > > use a custom collector that stops after you've gone back enough in > > time for a given search. > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > > > On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan > > <ravikumar.govindara...@gmail.com> wrote: > > > Mike, > > > > > > All our queries need to be sorted by timestamp field, in descending > order > > > of time. [latest-first] > > > > > > Each segment is sorted in itself. But TieredMergePolicy picks arbitrary > > > segments and merges them [even with SortingMergePolicy etc...]. I am > > trying > > > to avoid this and see if an approximate global ordering of segments [by > > > time-stamp field] can be maintained via merge. > > > > > > Ex: TopN results will only examine recent 2-3 smaller segments > > [best-case] > > > and return, without examining older and bigger segments. > > > > > > I do not know the terminology, may be "Early Query Termination Across > > > Segments" etc...? > > > > > > -- > > > Ravi > > > > > > > > > On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless < > > > luc...@mikemccandless.com> wrote: > > > > > >> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total > > >> order. > > >> > > >> Only TieredMergePolicy merges out-of-order segments. > > >> > > >> I don't understand why you need to encouraging merging of the more > > >> recent (by your "time" field) segments... > > >> > > >> Mike McCandless > > >> > > >> http://blog.mikemccandless.com > > >> > > >> > > >> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan > > >> <ravikumar.govindara...@gmail.com> wrote: > > >> > Mike, > > >> > > > >> > Each of my flushed segment is fully ordered by time. But > > >> TieredMergePolicy > > >> > or LogByteSizeMergePolicy is going to pick arbitrary time-segments > and > > >> > disturb this arrangement and I wanted some kind of control on this. > > >> > > > >> > But like you pointed-out, going by only be time-adjacent merges can > be > > >> > disastrous. > > >> > > > >> > Is there a way to mix both time and size to arrive at a somewhat > > >> > [less-than-accurate] global order of segment merges. > > >> > > > >> > Like attempt a time-adjacent merge, provided size of segments is not > > >> > extremely skewed etc... > > >> > > > >> > -- > > >> > Ravi > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless < > > >> > luc...@mikemccandless.com> wrote: > > >> > > > >> >> You want to focus merging on the segments containing newer > documents? > > >> >> Why? This seems somewhat dangerous... > > >> >> > > >> >> Not taking into account the "true" segment size can lead to very > very > > >> >> poor merge decisions ... you should turn on IndexWriter's > infoStream > > >> >> and do a long running test to convince yourself the merging is > being > > >> >> sane. > > >> >> > > >> >> Mike > > >> >> > > >> >> Mike McCandless > > >> >> > > >> >> http://blog.mikemccandless.com > > >> >> > > >> >> > > >> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan > > >> >> <ravikumar.govindara...@gmail.com> wrote: > > >> >> > Thanks Mike, > > >> >> > > > >> >> > Will try your suggestion. I will try to describe the actual > > use-case > > >> >> itself > > >> >> > > > >> >> > There is a requirement for merging time-adjacent segments > > >> [append-only, > > >> >> > rolling time-series data] > > >> >> > > > >> >> > All Documents have a timestamp affixed and during flush I need to > > note > > >> >> down > > >> >> > the least timestamp for all documents, through Codec. > > >> >> > > > >> >> > Then, I define a TimeMergePolicy extends LogMergePolicy and > define > > the > > >> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag]. > > >> >> > > > >> >> > LogMergePolicy will auto-arrange levels of segments according > time > > and > > >> >> > proceed with merges. Latest segments will be lesser in size and > > >> preferred > > >> >> > during merges than older and bigger segments > > >> >> > > > >> >> > Do you think such an approach will be fine or there are better > > ways to > > >> >> > solve this? > > >> >> > > > >> >> > -- > > >> >> > Ravi > > >> >> > > > >> >> > > > >> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless < > > >> >> > luc...@mikemccandless.com> wrote: > > >> >> > > > >> >> >> Somewhere in those numeric trie terms are the exact integers > from > > >> your > > >> >> >> documents, encoded. > > >> >> >> > > >> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the > int > > >> >> >> value back from the BytesRef term. > > >> >> >> > > >> >> >> But you need to filter out the "higher level" terms, e.g. using > > >> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0. Or use > > >> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum. I > > believe > > >> >> >> all the terms you want come first, so once you hit a term where > > >> >> >> .getPrefixCodedLongShift is > 0, that's your max term and you > can > > >> stop > > >> >> >> checking. > > >> >> >> > > >> >> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so > > that > > >> >> >> you can e.g. pull your own TermsEnum and iterate the terms > > yourself. > > >> >> >> > > >> >> >> Mike McCandless > > >> >> >> > > >> >> >> http://blog.mikemccandless.com > > >> >> >> > > >> >> >> > > >> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan > > >> >> >> <ravikumar.govindara...@gmail.com> wrote: > > >> >> >> > I use a Codec to flush data. All methods delegate to actual > > >> >> >> Lucene42Codec, > > >> >> >> > except for intercepting one single-field. This field is > indexed > > as > > >> an > > >> >> >> > IntField [Numeric-Trie...], with precisionStep=4. > > >> >> >> > > > >> >> >> > The purpose of the Codec is as follows > > >> >> >> > > > >> >> >> > 1. Note the first BytesRef for this field > > >> >> >> > 2. During finish() call [TermsConsumer.java], note the last > > >> BytesRef > > >> >> for > > >> >> >> > this field > > >> >> >> > 3. Converts both the first/last BytesRef to respective > integers > > >> >> >> > 4. Store these 2 ints in segment-info diagnostics > > >> >> >> > > > >> >> >> > The problem with this approach is that, first/last BytesRef is > > >> totally > > >> >> >> > different from the actual "int" values I try to index. I > guess, > > >> this > > >> >> is > > >> >> >> > because Numeric-Trie explodes all the integers into it's own > > >> format of > > >> >> >> > BytesRefs. Hence my Codec stores the wrong values in > > >> >> segment-diagnostics > > >> >> >> > > > >> >> >> > Is there a way I can record actual min/max int-values > correctly > > in > > >> my > > >> >> >> codec > > >> >> >> > and still support NumericRange search? > > >> >> >> > > > >> >> >> > -- > > >> >> >> > Ravi > > >> >> >> > > >> >> >> > > --------------------------------------------------------------------- > > >> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >> >> >> For additional commands, e-mail: > java-user-h...@lucene.apache.org > > >> >> >> > > >> >> >> > > >> >> > > >> >> > --------------------------------------------------------------------- > > >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >> >> > > >> >> > > >> > > >> --------------------------------------------------------------------- > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > >> > > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > >