Re: Actual min and max-value of NumericField during codec flush

Ravikumar Govindarajan Fri, 07 Feb 2014 05:20:09 -0800

Mike,

Each of my flushed segment is fully ordered by time. But TieredMergePolicy
or LogByteSizeMergePolicy is going to pick arbitrary time-segments and
disturb this arrangement and I wanted some kind of control on this.


But like you pointed-out, going by only be time-adjacent merges can be
disastrous.

Is there a way to mix both time and size to arrive at a somewhat
[less-than-accurate] global order of segment merges.

Like attempt a time-adjacent merge, provided size of segments is not
extremely skewed etc...

--
Ravi







On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
[email protected]> wrote:

> You want to focus merging on the segments containing newer documents?
> Why?  This seems somewhat dangerous...
>
> Not taking into account the "true" segment size can lead to very very
> poor merge decisions ... you should turn on IndexWriter's infoStream
> and do a long running test to convince yourself the merging is being
> sane.
>
> Mike
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
> <[email protected]> wrote:
> > Thanks Mike,
> >
> > Will try your suggestion. I will try to describe the actual use-case
> itself
> >
> > There is a requirement for merging time-adjacent segments [append-only,
> > rolling time-series data]
> >
> > All Documents have a timestamp affixed and during flush I need to note
> down
> > the least timestamp for all documents, through Codec.
> >
> > Then, I define a TimeMergePolicy extends LogMergePolicy and define the
> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
> >
> > LogMergePolicy will auto-arrange levels of segments according time and
> > proceed with merges. Latest segments will be lesser in size and preferred
> > during merges than older and bigger segments
> >
> > Do you think such an approach will be fine or there are better ways to
> > solve this?
> >
> > --
> > Ravi
> >
> >
> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
> > [email protected]> wrote:
> >
> >> Somewhere in those numeric trie terms are the exact integers from your
> >> documents, encoded.
> >>
> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the int
> >> value back from the BytesRef term.
> >>
> >> But you need to filter out the "higher level" terms, e.g. using
> >> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I believe
> >> all the terms you want come first, so once you hit a term where
> >> .getPrefixCodedLongShift is > 0, that's your max term and you can stop
> >> checking.
> >>
> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so that
> >> you can e.g. pull your own TermsEnum and iterate the terms yourself.
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
> >> <[email protected]> wrote:
> >> > I use a Codec to flush data. All methods delegate to actual
> >> Lucene42Codec,
> >> > except for intercepting one single-field. This field is indexed as an
> >> > IntField [Numeric-Trie...], with precisionStep=4.
> >> >
> >> > The purpose of the Codec is as follows
> >> >
> >> > 1. Note the first BytesRef for this field
> >> > 2. During finish() call [TermsConsumer.java], note the last BytesRef
> for
> >> > this field
> >> > 3. Converts both the first/last BytesRef to respective integers
> >> > 4. Store these 2 ints in segment-info diagnostics
> >> >
> >> > The problem with this approach is that, first/last BytesRef is totally
> >> > different from the actual "int" values I try to index. I guess, this
> is
> >> > because Numeric-Trie explodes all the integers into it's own format of
> >> > BytesRefs. Hence my Codec stores the wrong values in
> segment-diagnostics
> >> >
> >> > Is there a way I can record actual min/max int-values correctly in my
> >> codec
> >> > and still support NumericRange search?
> >> >
> >> > --
> >> > Ravi
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [email protected]
> >> For additional commands, e-mail: [email protected]
> >>
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Actual min and max-value of NumericField during codec flush

Reply via email to