Re: Actual min and max-value of NumericField during codec flush

Ravikumar Govindarajan Wed, 12 Feb 2014 21:26:09 -0800

@Mike,

I had suggested the same approach in one of my previous mails, where-by
each segment records min/max timestamps in seg-info diagnostics and use it
for merging adjacent segments.


"Then, I define a TimeMergePolicy extends LogMergePolicy and define the
segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag]. "

But you have expressed reservations

"This seems somewhat dangerous...

Not taking into account the "true" segment size can lead to very very
poor merge decisions ... you should turn on IndexWriter's infoStream
and do a long running test to convince yourself the merging is being
sane."

Will merging be disastrous, if I choose a TimeMergePolicy? I will also test
and verify, but it's always great to hear finer points from experts.

@Shai,

LogByteSizeMP categorizes "adjacency" by "size", whereas it would be better
if "timestamp" is used in my case

Sure, I need to wrap this in an SMP to make sure that the newly-created
segment is also in sorted-order

--
Ravi



On Wed, Feb 12, 2014 at 8:29 PM, Shai Erera <[email protected]> wrote:

> Why not use LogByteSizeMP in conjunction w/ SortingMP? LogMP picks adjacent
> segments and SortingMP ensures the merged segment is also sorted.
>
> Shai
>
>
> On Wed, Feb 12, 2014 at 3:16 PM, Ravikumar Govindarajan <
> [email protected]> wrote:
>
> > Yes exactly as you have described.
> >
> > Ex: Consider Segment[S1,S2,S3 & S4] are in reverse-chronological order
> and
> > goes for a merge
> >
> > While SortingMergePolicy will correctly solve the merge-part, it does not
> > however play any role in picking segments to merge right?
> >
> > SMP internally delegates to TieredMergePolicy, which might pick S1&S4 to
> > merge disturbing the global-order. Ideally only "adjacent" segments
> should
> > be picked up for merge. Ex: {S1,S2} or {S2,S3,S4} etc...
> >
> > Can there be a better selection of segments to merge in this case, so as
> to
> > maintain a semblance of global-ordering?
> >
> > --
> > Ravi
> >
> >
> >
> > On Wed, Feb 12, 2014 at 6:21 PM, Michael McCandless <
> > [email protected]> wrote:
> >
> > > OK, I see (early termination).
> > >
> > > That's a challenge, because you really want the docs sorted backwards
> > > from how they were added right?  And, e.g., merged and then searched
> > > in "reverse segment order"?
> > >
> > > I think you should be able to do this w/ SortingMergePolicy?  And then
> > > use a custom collector that stops after you've gone back enough in
> > > time for a given search.
> > >
> > > Mike McCandless
> > >
> > > http://blog.mikemccandless.com
> > >
> > >
> > > On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan
> > > <[email protected]> wrote:
> > > > Mike,
> > > >
> > > > All our queries need to be sorted by timestamp field, in descending
> > order
> > > > of time. [latest-first]
> > > >
> > > > Each segment is sorted in itself. But TieredMergePolicy picks
> arbitrary
> > > > segments and merges them [even with SortingMergePolicy etc...]. I am
> > > trying
> > > > to avoid this and see if an approximate global ordering of segments
> [by
> > > > time-stamp field] can be maintained via merge.
> > > >
> > > > Ex: TopN results will only examine recent 2-3 smaller segments
> > > [best-case]
> > > > and return, without examining older and bigger segments.
> > > >
> > > > I do not know the terminology, may be "Early Query Termination Across
> > > > Segments" etc...?
> > > >
> > > > --
> > > > Ravi
> > > >
> > > >
> > > > On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless <
> > > > [email protected]> wrote:
> > > >
> > > >> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the
> total
> > > >> order.
> > > >>
> > > >> Only TieredMergePolicy merges out-of-order segments.
> > > >>
> > > >> I don't understand why you need to encouraging merging of the more
> > > >> recent (by your "time" field) segments...
> > > >>
> > > >> Mike McCandless
> > > >>
> > > >> http://blog.mikemccandless.com
> > > >>
> > > >>
> > > >> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan
> > > >> <[email protected]> wrote:
> > > >> > Mike,
> > > >> >
> > > >> > Each of my flushed segment is fully ordered by time. But
> > > >> TieredMergePolicy
> > > >> > or LogByteSizeMergePolicy is going to pick arbitrary time-segments
> > and
> > > >> > disturb this arrangement and I wanted some kind of control on
> this.
> > > >> >
> > > >> > But like you pointed-out, going by only be time-adjacent merges
> can
> > be
> > > >> > disastrous.
> > > >> >
> > > >> > Is there a way to mix both time and size to arrive at a somewhat
> > > >> > [less-than-accurate] global order of segment merges.
> > > >> >
> > > >> > Like attempt a time-adjacent merge, provided size of segments is
> not
> > > >> > extremely skewed etc...
> > > >> >
> > > >> > --
> > > >> > Ravi
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
> > > >> > [email protected]> wrote:
> > > >> >
> > > >> >> You want to focus merging on the segments containing newer
> > documents?
> > > >> >> Why?  This seems somewhat dangerous...
> > > >> >>
> > > >> >> Not taking into account the "true" segment size can lead to very
> > very
> > > >> >> poor merge decisions ... you should turn on IndexWriter's
> > infoStream
> > > >> >> and do a long running test to convince yourself the merging is
> > being
> > > >> >> sane.
> > > >> >>
> > > >> >> Mike
> > > >> >>
> > > >> >> Mike McCandless
> > > >> >>
> > > >> >> http://blog.mikemccandless.com
> > > >> >>
> > > >> >>
> > > >> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
> > > >> >> <[email protected]> wrote:
> > > >> >> > Thanks Mike,
> > > >> >> >
> > > >> >> > Will try your suggestion. I will try to describe the actual
> > > use-case
> > > >> >> itself
> > > >> >> >
> > > >> >> > There is a requirement for merging time-adjacent segments
> > > >> [append-only,
> > > >> >> > rolling time-series data]
> > > >> >> >
> > > >> >> > All Documents have a timestamp affixed and during flush I need
> to
> > > note
> > > >> >> down
> > > >> >> > the least timestamp for all documents, through Codec.
> > > >> >> >
> > > >> >> > Then, I define a TimeMergePolicy extends LogMergePolicy and
> > define
> > > the
> > > >> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
> > > >> >> >
> > > >> >> > LogMergePolicy will auto-arrange levels of segments according
> > time
> > > and
> > > >> >> > proceed with merges. Latest segments will be lesser in size and
> > > >> preferred
> > > >> >> > during merges than older and bigger segments
> > > >> >> >
> > > >> >> > Do you think such an approach will be fine or there are better
> > > ways to
> > > >> >> > solve this?
> > > >> >> >
> > > >> >> > --
> > > >> >> > Ravi
> > > >> >> >
> > > >> >> >
> > > >> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
> > > >> >> > [email protected]> wrote:
> > > >> >> >
> > > >> >> >> Somewhere in those numeric trie terms are the exact integers
> > from
> > > >> your
> > > >> >> >> documents, encoded.
> > > >> >> >>
> > > >> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the
> > int
> > > >> >> >> value back from the BytesRef term.
> > > >> >> >>
> > > >> >> >> But you need to filter out the "higher level" terms, e.g.
> using
> > > >> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
> > > >> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I
> > > believe
> > > >> >> >> all the terms you want come first, so once you hit a term
> where
> > > >> >> >> .getPrefixCodedLongShift is > 0, that's your max term and you
> > can
> > > >> stop
> > > >> >> >> checking.
> > > >> >> >>
> > > >> >> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so
> > > that
> > > >> >> >> you can e.g. pull your own TermsEnum and iterate the terms
> > > yourself.
> > > >> >> >>
> > > >> >> >> Mike McCandless
> > > >> >> >>
> > > >> >> >> http://blog.mikemccandless.com
> > > >> >> >>
> > > >> >> >>
> > > >> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
> > > >> >> >> <[email protected]> wrote:
> > > >> >> >> > I use a Codec to flush data. All methods delegate to actual
> > > >> >> >> Lucene42Codec,
> > > >> >> >> > except for intercepting one single-field. This field is
> > indexed
> > > as
> > > >> an
> > > >> >> >> > IntField [Numeric-Trie...], with precisionStep=4.
> > > >> >> >> >
> > > >> >> >> > The purpose of the Codec is as follows
> > > >> >> >> >
> > > >> >> >> > 1. Note the first BytesRef for this field
> > > >> >> >> > 2. During finish() call [TermsConsumer.java], note the last
> > > >> BytesRef
> > > >> >> for
> > > >> >> >> > this field
> > > >> >> >> > 3. Converts both the first/last BytesRef to respective
> > integers
> > > >> >> >> > 4. Store these 2 ints in segment-info diagnostics
> > > >> >> >> >
> > > >> >> >> > The problem with this approach is that, first/last BytesRef
> is
> > > >> totally
> > > >> >> >> > different from the actual "int" values I try to index. I
> > guess,
> > > >> this
> > > >> >> is
> > > >> >> >> > because Numeric-Trie explodes all the integers into it's own
> > > >> format of
> > > >> >> >> > BytesRefs. Hence my Codec stores the wrong values in
> > > >> >> segment-diagnostics
> > > >> >> >> >
> > > >> >> >> > Is there a way I can record actual min/max int-values
> > correctly
> > > in
> > > >> my
> > > >> >> >> codec
> > > >> >> >> > and still support NumericRange search?
> > > >> >> >> >
> > > >> >> >> > --
> > > >> >> >> > Ravi
> > > >> >> >>
> > > >> >> >>
> > > ---------------------------------------------------------------------
> > > >> >> >> To unsubscribe, e-mail:
> [email protected]
> > > >> >> >> For additional commands, e-mail:
> > [email protected]
> > > >> >> >>
> > > >> >> >>
> > > >> >>
> > > >> >>
> > ---------------------------------------------------------------------
> > > >> >> To unsubscribe, e-mail: [email protected]
> > > >> >> For additional commands, e-mail:
> [email protected]
> > > >> >>
> > > >> >>
> > > >>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: [email protected]
> > > >> For additional commands, e-mail: [email protected]
> > > >>
> > > >>
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
> >
>

Re: Actual min and max-value of NumericField during codec flush

Reply via email to