Re: Actual min and max-value of NumericField during codec flush

Ravikumar Govindarajan Wed, 12 Feb 2014 05:17:53 -0800

Yes exactly as you have described.

Ex: Consider Segment[S1,S2,S3 & S4] are in reverse-chronological order and
goes for a merge


While SortingMergePolicy will correctly solve the merge-part, it does not
however play any role in picking segments to merge right?

SMP internally delegates to TieredMergePolicy, which might pick S1&S4 to
merge disturbing the global-order. Ideally only "adjacent" segments should
be picked up for merge. Ex: {S1,S2} or {S2,S3,S4} etc...

Can there be a better selection of segments to merge in this case, so as to
maintain a semblance of global-ordering?

--
Ravi



On Wed, Feb 12, 2014 at 6:21 PM, Michael McCandless <
[email protected]> wrote:

> OK, I see (early termination).
>
> That's a challenge, because you really want the docs sorted backwards
> from how they were added right?  And, e.g., merged and then searched
> in "reverse segment order"?
>
> I think you should be able to do this w/ SortingMergePolicy?  And then
> use a custom collector that stops after you've gone back enough in
> time for a given search.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan
> <[email protected]> wrote:
> > Mike,
> >
> > All our queries need to be sorted by timestamp field, in descending order
> > of time. [latest-first]
> >
> > Each segment is sorted in itself. But TieredMergePolicy picks arbitrary
> > segments and merges them [even with SortingMergePolicy etc...]. I am
> trying
> > to avoid this and see if an approximate global ordering of segments [by
> > time-stamp field] can be maintained via merge.
> >
> > Ex: TopN results will only examine recent 2-3 smaller segments
> [best-case]
> > and return, without examining older and bigger segments.
> >
> > I do not know the terminology, may be "Early Query Termination Across
> > Segments" etc...?
> >
> > --
> > Ravi
> >
> >
> > On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless <
> > [email protected]> wrote:
> >
> >> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total
> >> order.
> >>
> >> Only TieredMergePolicy merges out-of-order segments.
> >>
> >> I don't understand why you need to encouraging merging of the more
> >> recent (by your "time" field) segments...
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan
> >> <[email protected]> wrote:
> >> > Mike,
> >> >
> >> > Each of my flushed segment is fully ordered by time. But
> >> TieredMergePolicy
> >> > or LogByteSizeMergePolicy is going to pick arbitrary time-segments and
> >> > disturb this arrangement and I wanted some kind of control on this.
> >> >
> >> > But like you pointed-out, going by only be time-adjacent merges can be
> >> > disastrous.
> >> >
> >> > Is there a way to mix both time and size to arrive at a somewhat
> >> > [less-than-accurate] global order of segment merges.
> >> >
> >> > Like attempt a time-adjacent merge, provided size of segments is not
> >> > extremely skewed etc...
> >> >
> >> > --
> >> > Ravi
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
> >> > [email protected]> wrote:
> >> >
> >> >> You want to focus merging on the segments containing newer documents?
> >> >> Why?  This seems somewhat dangerous...
> >> >>
> >> >> Not taking into account the "true" segment size can lead to very very
> >> >> poor merge decisions ... you should turn on IndexWriter's infoStream
> >> >> and do a long running test to convince yourself the merging is being
> >> >> sane.
> >> >>
> >> >> Mike
> >> >>
> >> >> Mike McCandless
> >> >>
> >> >> http://blog.mikemccandless.com
> >> >>
> >> >>
> >> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
> >> >> <[email protected]> wrote:
> >> >> > Thanks Mike,
> >> >> >
> >> >> > Will try your suggestion. I will try to describe the actual
> use-case
> >> >> itself
> >> >> >
> >> >> > There is a requirement for merging time-adjacent segments
> >> [append-only,
> >> >> > rolling time-series data]
> >> >> >
> >> >> > All Documents have a timestamp affixed and during flush I need to
> note
> >> >> down
> >> >> > the least timestamp for all documents, through Codec.
> >> >> >
> >> >> > Then, I define a TimeMergePolicy extends LogMergePolicy and define
> the
> >> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
> >> >> >
> >> >> > LogMergePolicy will auto-arrange levels of segments according time
> and
> >> >> > proceed with merges. Latest segments will be lesser in size and
> >> preferred
> >> >> > during merges than older and bigger segments
> >> >> >
> >> >> > Do you think such an approach will be fine or there are better
> ways to
> >> >> > solve this?
> >> >> >
> >> >> > --
> >> >> > Ravi
> >> >> >
> >> >> >
> >> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
> >> >> > [email protected]> wrote:
> >> >> >
> >> >> >> Somewhere in those numeric trie terms are the exact integers from
> >> your
> >> >> >> documents, encoded.
> >> >> >>
> >> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the int
> >> >> >> value back from the BytesRef term.
> >> >> >>
> >> >> >> But you need to filter out the "higher level" terms, e.g. using
> >> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
> >> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I
> believe
> >> >> >> all the terms you want come first, so once you hit a term where
> >> >> >> .getPrefixCodedLongShift is > 0, that's your max term and you can
> >> stop
> >> >> >> checking.
> >> >> >>
> >> >> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so
> that
> >> >> >> you can e.g. pull your own TermsEnum and iterate the terms
> yourself.
> >> >> >>
> >> >> >> Mike McCandless
> >> >> >>
> >> >> >> http://blog.mikemccandless.com
> >> >> >>
> >> >> >>
> >> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
> >> >> >> <[email protected]> wrote:
> >> >> >> > I use a Codec to flush data. All methods delegate to actual
> >> >> >> Lucene42Codec,
> >> >> >> > except for intercepting one single-field. This field is indexed
> as
> >> an
> >> >> >> > IntField [Numeric-Trie...], with precisionStep=4.
> >> >> >> >
> >> >> >> > The purpose of the Codec is as follows
> >> >> >> >
> >> >> >> > 1. Note the first BytesRef for this field
> >> >> >> > 2. During finish() call [TermsConsumer.java], note the last
> >> BytesRef
> >> >> for
> >> >> >> > this field
> >> >> >> > 3. Converts both the first/last BytesRef to respective integers
> >> >> >> > 4. Store these 2 ints in segment-info diagnostics
> >> >> >> >
> >> >> >> > The problem with this approach is that, first/last BytesRef is
> >> totally
> >> >> >> > different from the actual "int" values I try to index. I guess,
> >> this
> >> >> is
> >> >> >> > because Numeric-Trie explodes all the integers into it's own
> >> format of
> >> >> >> > BytesRefs. Hence my Codec stores the wrong values in
> >> >> segment-diagnostics
> >> >> >> >
> >> >> >> > Is there a way I can record actual min/max int-values correctly
> in
> >> my
> >> >> >> codec
> >> >> >> > and still support NumericRange search?
> >> >> >> >
> >> >> >> > --
> >> >> >> > Ravi
> >> >> >>
> >> >> >>
> ---------------------------------------------------------------------
> >> >> >> To unsubscribe, e-mail: [email protected]
> >> >> >> For additional commands, e-mail: [email protected]
> >> >> >>
> >> >> >>
> >> >>
> >> >> ---------------------------------------------------------------------
> >> >> To unsubscribe, e-mail: [email protected]
> >> >> For additional commands, e-mail: [email protected]
> >> >>
> >> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Actual min and max-value of NumericField during codec flush

Reply via email to