Re: Actual min and max-value of NumericField during codec flush

Shai Erera Wed, 12 Feb 2014 07:00:41 -0800

Why not use LogByteSizeMP in conjunction w/ SortingMP? LogMP picks adjacent
segments and SortingMP ensures the merged segment is also sorted.


Shai


On Wed, Feb 12, 2014 at 3:16 PM, Ravikumar Govindarajan <
[email protected]> wrote:

> Yes exactly as you have described.
>
> Ex: Consider Segment[S1,S2,S3 & S4] are in reverse-chronological order and
> goes for a merge
>
> While SortingMergePolicy will correctly solve the merge-part, it does not
> however play any role in picking segments to merge right?
>
> SMP internally delegates to TieredMergePolicy, which might pick S1&S4 to
> merge disturbing the global-order. Ideally only "adjacent" segments should
> be picked up for merge. Ex: {S1,S2} or {S2,S3,S4} etc...
>
> Can there be a better selection of segments to merge in this case, so as to
> maintain a semblance of global-ordering?
>
> --
> Ravi
>
>
>
> On Wed, Feb 12, 2014 at 6:21 PM, Michael McCandless <
> [email protected]> wrote:
>
> > OK, I see (early termination).
> >
> > That's a challenge, because you really want the docs sorted backwards
> > from how they were added right?  And, e.g., merged and then searched
> > in "reverse segment order"?
> >
> > I think you should be able to do this w/ SortingMergePolicy?  And then
> > use a custom collector that stops after you've gone back enough in
> > time for a given search.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan
> > <[email protected]> wrote:
> > > Mike,
> > >
> > > All our queries need to be sorted by timestamp field, in descending
> order
> > > of time. [latest-first]
> > >
> > > Each segment is sorted in itself. But TieredMergePolicy picks arbitrary
> > > segments and merges them [even with SortingMergePolicy etc...]. I am
> > trying
> > > to avoid this and see if an approximate global ordering of segments [by
> > > time-stamp field] can be maintained via merge.
> > >
> > > Ex: TopN results will only examine recent 2-3 smaller segments
> > [best-case]
> > > and return, without examining older and bigger segments.
> > >
> > > I do not know the terminology, may be "Early Query Termination Across
> > > Segments" etc...?
> > >
> > > --
> > > Ravi
> > >
> > >
> > > On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless <
> > > [email protected]> wrote:
> > >
> > >> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total
> > >> order.
> > >>
> > >> Only TieredMergePolicy merges out-of-order segments.
> > >>
> > >> I don't understand why you need to encouraging merging of the more
> > >> recent (by your "time" field) segments...
> > >>
> > >> Mike McCandless
> > >>
> > >> http://blog.mikemccandless.com
> > >>
> > >>
> > >> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan
> > >> <[email protected]> wrote:
> > >> > Mike,
> > >> >
> > >> > Each of my flushed segment is fully ordered by time. But
> > >> TieredMergePolicy
> > >> > or LogByteSizeMergePolicy is going to pick arbitrary time-segments
> and
> > >> > disturb this arrangement and I wanted some kind of control on this.
> > >> >
> > >> > But like you pointed-out, going by only be time-adjacent merges can
> be
> > >> > disastrous.
> > >> >
> > >> > Is there a way to mix both time and size to arrive at a somewhat
> > >> > [less-than-accurate] global order of segment merges.
> > >> >
> > >> > Like attempt a time-adjacent merge, provided size of segments is not
> > >> > extremely skewed etc...
> > >> >
> > >> > --
> > >> > Ravi
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
> > >> > [email protected]> wrote:
> > >> >
> > >> >> You want to focus merging on the segments containing newer
> documents?
> > >> >> Why?  This seems somewhat dangerous...
> > >> >>
> > >> >> Not taking into account the "true" segment size can lead to very
> very
> > >> >> poor merge decisions ... you should turn on IndexWriter's
> infoStream
> > >> >> and do a long running test to convince yourself the merging is
> being
> > >> >> sane.
> > >> >>
> > >> >> Mike
> > >> >>
> > >> >> Mike McCandless
> > >> >>
> > >> >> http://blog.mikemccandless.com
> > >> >>
> > >> >>
> > >> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
> > >> >> <[email protected]> wrote:
> > >> >> > Thanks Mike,
> > >> >> >
> > >> >> > Will try your suggestion. I will try to describe the actual
> > use-case
> > >> >> itself
> > >> >> >
> > >> >> > There is a requirement for merging time-adjacent segments
> > >> [append-only,
> > >> >> > rolling time-series data]
> > >> >> >
> > >> >> > All Documents have a timestamp affixed and during flush I need to
> > note
> > >> >> down
> > >> >> > the least timestamp for all documents, through Codec.
> > >> >> >
> > >> >> > Then, I define a TimeMergePolicy extends LogMergePolicy and
> define
> > the
> > >> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
> > >> >> >
> > >> >> > LogMergePolicy will auto-arrange levels of segments according
> time
> > and
> > >> >> > proceed with merges. Latest segments will be lesser in size and
> > >> preferred
> > >> >> > during merges than older and bigger segments
> > >> >> >
> > >> >> > Do you think such an approach will be fine or there are better
> > ways to
> > >> >> > solve this?
> > >> >> >
> > >> >> > --
> > >> >> > Ravi
> > >> >> >
> > >> >> >
> > >> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
> > >> >> > [email protected]> wrote:
> > >> >> >
> > >> >> >> Somewhere in those numeric trie terms are the exact integers
> from
> > >> your
> > >> >> >> documents, encoded.
> > >> >> >>
> > >> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the
> int
> > >> >> >> value back from the BytesRef term.
> > >> >> >>
> > >> >> >> But you need to filter out the "higher level" terms, e.g. using
> > >> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
> > >> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I
> > believe
> > >> >> >> all the terms you want come first, so once you hit a term where
> > >> >> >> .getPrefixCodedLongShift is > 0, that's your max term and you
> can
> > >> stop
> > >> >> >> checking.
> > >> >> >>
> > >> >> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so
> > that
> > >> >> >> you can e.g. pull your own TermsEnum and iterate the terms
> > yourself.
> > >> >> >>
> > >> >> >> Mike McCandless
> > >> >> >>
> > >> >> >> http://blog.mikemccandless.com
> > >> >> >>
> > >> >> >>
> > >> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
> > >> >> >> <[email protected]> wrote:
> > >> >> >> > I use a Codec to flush data. All methods delegate to actual
> > >> >> >> Lucene42Codec,
> > >> >> >> > except for intercepting one single-field. This field is
> indexed
> > as
> > >> an
> > >> >> >> > IntField [Numeric-Trie...], with precisionStep=4.
> > >> >> >> >
> > >> >> >> > The purpose of the Codec is as follows
> > >> >> >> >
> > >> >> >> > 1. Note the first BytesRef for this field
> > >> >> >> > 2. During finish() call [TermsConsumer.java], note the last
> > >> BytesRef
> > >> >> for
> > >> >> >> > this field
> > >> >> >> > 3. Converts both the first/last BytesRef to respective
> integers
> > >> >> >> > 4. Store these 2 ints in segment-info diagnostics
> > >> >> >> >
> > >> >> >> > The problem with this approach is that, first/last BytesRef is
> > >> totally
> > >> >> >> > different from the actual "int" values I try to index. I
> guess,
> > >> this
> > >> >> is
> > >> >> >> > because Numeric-Trie explodes all the integers into it's own
> > >> format of
> > >> >> >> > BytesRefs. Hence my Codec stores the wrong values in
> > >> >> segment-diagnostics
> > >> >> >> >
> > >> >> >> > Is there a way I can record actual min/max int-values
> correctly
> > in
> > >> my
> > >> >> >> codec
> > >> >> >> > and still support NumericRange search?
> > >> >> >> >
> > >> >> >> > --
> > >> >> >> > Ravi
> > >> >> >>
> > >> >> >>
> > ---------------------------------------------------------------------
> > >> >> >> To unsubscribe, e-mail: [email protected]
> > >> >> >> For additional commands, e-mail:
> [email protected]
> > >> >> >>
> > >> >> >>
> > >> >>
> > >> >>
> ---------------------------------------------------------------------
> > >> >> To unsubscribe, e-mail: [email protected]
> > >> >> For additional commands, e-mail: [email protected]
> > >> >>
> > >> >>
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: [email protected]
> > >> For additional commands, e-mail: [email protected]
> > >>
> > >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>

Re: Actual min and max-value of NumericField during codec flush

Reply via email to