Re: Actual min and max-value of NumericField during codec flush

Ravikumar Govindarajan Thu, 13 Feb 2014 21:15:34 -0800

Yeah, now I understood a little bit.

Since LogMP always merges adjacent segments, that should pretty much serve
my use-case, when used with a SortingMP


Early-Query termination quits by throwing an Exception right?. Is it ok to
individually search using SegmentReader and then break-off, instead of
using a MultiReader, especially when the order is known before search
begins?

The reason why I insisted on a time-stamp based merging is because there is
a possiblility of an out-of-order segment added via addIndex(...) call.
That segment can be of any older time-stamp [month ago, year-ago etc...],
albeit extremely rare. Should I worry about it during merges, or just
handle overlaps during search

--
Ravi



On Thu, Feb 13, 2014 at 1:21 PM, Shai Erera <ser...@gmail.com> wrote:

> Hi
>
> LogMP *always* picks adjacent segments together. Therefore, if you have
> segments S1, S2, S3, S4 where the date-wise sort order is S4>S3>S2>S1, then
> LogMP will pick either S1-S4, S2-S4, S2-S3 and so on. But always adjacent
> segments and in a raw (i.e. it doesn't skip segments).
>
> I guess what both Mike and I don't understand is why you insist on merging
> based on the timestamp of each segment. I.e. if the order, timestamp-wise,
> of the segments isn't as I described above, then merging them like so won't
> hurt - i.e. they will still be unsorted. No harm is done.
>
> Maybe MergePolicy isn't what you need here. If you can record somewhere the
> min/max timestamp of each segment, you can use a MultiReader to wrap the
> sorted list of IndexReaders (actually SegmentReaders). Then your "reader",
> always traverses segments from new to old.
>
> If this approach won't address your issue, then you can merge based on
> timestamps - there's nothing wrong about it. What Mike suggested is that
> you benchmark your application with this merge policy, for a long period of
> time (few hours/days, depending on your indexing rate), because what might
> happen is that your merges are always unbalanced and your indexing
> performance will degrade because of unbalanced amount of IO that happens
> during the merge.
>
> Shai
>
>
> On Thu, Feb 13, 2014 at 7:25 AM, Ravikumar Govindarajan <
> ravikumar.govindara...@gmail.com> wrote:
>
> > @Mike,
> >
> > I had suggested the same approach in one of my previous mails, where-by
> > each segment records min/max timestamps in seg-info diagnostics and use
> it
> > for merging adjacent segments.
> >
> > "Then, I define a TimeMergePolicy extends LogMergePolicy and define the
> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag]. "
> >
> > But you have expressed reservations
> >
> > "This seems somewhat dangerous...
> >
> > Not taking into account the "true" segment size can lead to very very
> > poor merge decisions ... you should turn on IndexWriter's infoStream
> > and do a long running test to convince yourself the merging is being
> > sane."
> >
> > Will merging be disastrous, if I choose a TimeMergePolicy? I will also
> test
> > and verify, but it's always great to hear finer points from experts.
> >
> > @Shai,
> >
> > LogByteSizeMP categorizes "adjacency" by "size", whereas it would be
> better
> > if "timestamp" is used in my case
> >
> > Sure, I need to wrap this in an SMP to make sure that the newly-created
> > segment is also in sorted-order
> >
> > --
> > Ravi
> >
> >
> >
> > On Wed, Feb 12, 2014 at 8:29 PM, Shai Erera <ser...@gmail.com> wrote:
> >
> > > Why not use LogByteSizeMP in conjunction w/ SortingMP? LogMP picks
> > adjacent
> > > segments and SortingMP ensures the merged segment is also sorted.
> > >
> > > Shai
> > >
> > >
> > > On Wed, Feb 12, 2014 at 3:16 PM, Ravikumar Govindarajan <
> > > ravikumar.govindara...@gmail.com> wrote:
> > >
> > > > Yes exactly as you have described.
> > > >
> > > > Ex: Consider Segment[S1,S2,S3 & S4] are in reverse-chronological
> order
> > > and
> > > > goes for a merge
> > > >
> > > > While SortingMergePolicy will correctly solve the merge-part, it does
> > not
> > > > however play any role in picking segments to merge right?
> > > >
> > > > SMP internally delegates to TieredMergePolicy, which might pick S1&S4
> > to
> > > > merge disturbing the global-order. Ideally only "adjacent" segments
> > > should
> > > > be picked up for merge. Ex: {S1,S2} or {S2,S3,S4} etc...
> > > >
> > > > Can there be a better selection of segments to merge in this case, so
> > as
> > > to
> > > > maintain a semblance of global-ordering?
> > > >
> > > > --
> > > > Ravi
> > > >
> > > >
> > > >
> > > > On Wed, Feb 12, 2014 at 6:21 PM, Michael McCandless <
> > > > luc...@mikemccandless.com> wrote:
> > > >
> > > > > OK, I see (early termination).
> > > > >
> > > > > That's a challenge, because you really want the docs sorted
> backwards
> > > > > from how they were added right?  And, e.g., merged and then
> searched
> > > > > in "reverse segment order"?
> > > > >
> > > > > I think you should be able to do this w/ SortingMergePolicy?  And
> > then
> > > > > use a custom collector that stops after you've gone back enough in
> > > > > time for a given search.
> > > > >
> > > > > Mike McCandless
> > > > >
> > > > > http://blog.mikemccandless.com
> > > > >
> > > > >
> > > > > On Wed, Feb 12, 2014 at 6:04 AM, Ravikumar Govindarajan
> > > > > <ravikumar.govindara...@gmail.com> wrote:
> > > > > > Mike,
> > > > > >
> > > > > > All our queries need to be sorted by timestamp field, in
> descending
> > > > order
> > > > > > of time. [latest-first]
> > > > > >
> > > > > > Each segment is sorted in itself. But TieredMergePolicy picks
> > > arbitrary
> > > > > > segments and merges them [even with SortingMergePolicy etc...]. I
> > am
> > > > > trying
> > > > > > to avoid this and see if an approximate global ordering of
> segments
> > > [by
> > > > > > time-stamp field] can be maintained via merge.
> > > > > >
> > > > > > Ex: TopN results will only examine recent 2-3 smaller segments
> > > > > [best-case]
> > > > > > and return, without examining older and bigger segments.
> > > > > >
> > > > > > I do not know the terminology, may be "Early Query Termination
> > Across
> > > > > > Segments" etc...?
> > > > > >
> > > > > > --
> > > > > > Ravi
> > > > > >
> > > > > >
> > > > > > On Fri, Feb 7, 2014 at 10:42 PM, Michael McCandless <
> > > > > > luc...@mikemccandless.com> wrote:
> > > > > >
> > > > > >> LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the
> > > total
> > > > > >> order.
> > > > > >>
> > > > > >> Only TieredMergePolicy merges out-of-order segments.
> > > > > >>
> > > > > >> I don't understand why you need to encouraging merging of the
> more
> > > > > >> recent (by your "time" field) segments...
> > > > > >>
> > > > > >> Mike McCandless
> > > > > >>
> > > > > >> http://blog.mikemccandless.com
> > > > > >>
> > > > > >>
> > > > > >> On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan
> > > > > >> <ravikumar.govindara...@gmail.com> wrote:
> > > > > >> > Mike,
> > > > > >> >
> > > > > >> > Each of my flushed segment is fully ordered by time. But
> > > > > >> TieredMergePolicy
> > > > > >> > or LogByteSizeMergePolicy is going to pick arbitrary
> > time-segments
> > > > and
> > > > > >> > disturb this arrangement and I wanted some kind of control on
> > > this.
> > > > > >> >
> > > > > >> > But like you pointed-out, going by only be time-adjacent
> merges
> > > can
> > > > be
> > > > > >> > disastrous.
> > > > > >> >
> > > > > >> > Is there a way to mix both time and size to arrive at a
> somewhat
> > > > > >> > [less-than-accurate] global order of segment merges.
> > > > > >> >
> > > > > >> > Like attempt a time-adjacent merge, provided size of segments
> is
> > > not
> > > > > >> > extremely skewed etc...
> > > > > >> >
> > > > > >> > --
> > > > > >> > Ravi
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
> > > > > >> > luc...@mikemccandless.com> wrote:
> > > > > >> >
> > > > > >> >> You want to focus merging on the segments containing newer
> > > > documents?
> > > > > >> >> Why?  This seems somewhat dangerous...
> > > > > >> >>
> > > > > >> >> Not taking into account the "true" segment size can lead to
> > very
> > > > very
> > > > > >> >> poor merge decisions ... you should turn on IndexWriter's
> > > > infoStream
> > > > > >> >> and do a long running test to convince yourself the merging
> is
> > > > being
> > > > > >> >> sane.
> > > > > >> >>
> > > > > >> >> Mike
> > > > > >> >>
> > > > > >> >> Mike McCandless
> > > > > >> >>
> > > > > >> >> http://blog.mikemccandless.com
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
> > > > > >> >> <ravikumar.govindara...@gmail.com> wrote:
> > > > > >> >> > Thanks Mike,
> > > > > >> >> >
> > > > > >> >> > Will try your suggestion. I will try to describe the actual
> > > > > use-case
> > > > > >> >> itself
> > > > > >> >> >
> > > > > >> >> > There is a requirement for merging time-adjacent segments
> > > > > >> [append-only,
> > > > > >> >> > rolling time-series data]
> > > > > >> >> >
> > > > > >> >> > All Documents have a timestamp affixed and during flush I
> > need
> > > to
> > > > > note
> > > > > >> >> down
> > > > > >> >> > the least timestamp for all documents, through Codec.
> > > > > >> >> >
> > > > > >> >> > Then, I define a TimeMergePolicy extends LogMergePolicy and
> > > > define
> > > > > the
> > > > > >> >> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME
> [segment-diag].
> > > > > >> >> >
> > > > > >> >> > LogMergePolicy will auto-arrange levels of segments
> according
> > > > time
> > > > > and
> > > > > >> >> > proceed with merges. Latest segments will be lesser in size
> > and
> > > > > >> preferred
> > > > > >> >> > during merges than older and bigger segments
> > > > > >> >> >
> > > > > >> >> > Do you think such an approach will be fine or there are
> > better
> > > > > ways to
> > > > > >> >> > solve this?
> > > > > >> >> >
> > > > > >> >> > --
> > > > > >> >> > Ravi
> > > > > >> >> >
> > > > > >> >> >
> > > > > >> >> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
> > > > > >> >> > luc...@mikemccandless.com> wrote:
> > > > > >> >> >
> > > > > >> >> >> Somewhere in those numeric trie terms are the exact
> integers
> > > > from
> > > > > >> your
> > > > > >> >> >> documents, encoded.
> > > > > >> >> >>
> > > > > >> >> >> You can use oal.util.NumericUtils.prefixCodecToInt to get
> > the
> > > > int
> > > > > >> >> >> value back from the BytesRef term.
> > > > > >> >> >>
> > > > > >> >> >> But you need to filter out the "higher level" terms, e.g.
> > > using
> > > > > >> >> >> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
> > > > > >> >> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.
>  I
> > > > > believe
> > > > > >> >> >> all the terms you want come first, so once you hit a term
> > > where
> > > > > >> >> >> .getPrefixCodedLongShift is > 0, that's your max term and
> > you
> > > > can
> > > > > >> stop
> > > > > >> >> >> checking.
> > > > > >> >> >>
> > > > > >> >> >> BTW, in 5.0, the codec API for PostingsFormat has
> improved,
> > so
> > > > > that
> > > > > >> >> >> you can e.g. pull your own TermsEnum and iterate the terms
> > > > > yourself.
> > > > > >> >> >>
> > > > > >> >> >> Mike McCandless
> > > > > >> >> >>
> > > > > >> >> >> http://blog.mikemccandless.com
> > > > > >> >> >>
> > > > > >> >> >>
> > > > > >> >> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
> > > > > >> >> >> <ravikumar.govindara...@gmail.com> wrote:
> > > > > >> >> >> > I use a Codec to flush data. All methods delegate to
> > actual
> > > > > >> >> >> Lucene42Codec,
> > > > > >> >> >> > except for intercepting one single-field. This field is
> > > > indexed
> > > > > as
> > > > > >> an
> > > > > >> >> >> > IntField [Numeric-Trie...], with precisionStep=4.
> > > > > >> >> >> >
> > > > > >> >> >> > The purpose of the Codec is as follows
> > > > > >> >> >> >
> > > > > >> >> >> > 1. Note the first BytesRef for this field
> > > > > >> >> >> > 2. During finish() call [TermsConsumer.java], note the
> > last
> > > > > >> BytesRef
> > > > > >> >> for
> > > > > >> >> >> > this field
> > > > > >> >> >> > 3. Converts both the first/last BytesRef to respective
> > > > integers
> > > > > >> >> >> > 4. Store these 2 ints in segment-info diagnostics
> > > > > >> >> >> >
> > > > > >> >> >> > The problem with this approach is that, first/last
> > BytesRef
> > > is
> > > > > >> totally
> > > > > >> >> >> > different from the actual "int" values I try to index. I
> > > > guess,
> > > > > >> this
> > > > > >> >> is
> > > > > >> >> >> > because Numeric-Trie explodes all the integers into it's
> > own
> > > > > >> format of
> > > > > >> >> >> > BytesRefs. Hence my Codec stores the wrong values in
> > > > > >> >> segment-diagnostics
> > > > > >> >> >> >
> > > > > >> >> >> > Is there a way I can record actual min/max int-values
> > > > correctly
> > > > > in
> > > > > >> my
> > > > > >> >> >> codec
> > > > > >> >> >> > and still support NumericRange search?
> > > > > >> >> >> >
> > > > > >> >> >> > --
> > > > > >> >> >> > Ravi
> > > > > >> >> >>
> > > > > >> >> >>
> > > > >
> ---------------------------------------------------------------------
> > > > > >> >> >> To unsubscribe, e-mail:
> > > java-user-unsubscr...@lucene.apache.org
> > > > > >> >> >> For additional commands, e-mail:
> > > > java-user-h...@lucene.apache.org
> > > > > >> >> >>
> > > > > >> >> >>
> > > > > >> >>
> > > > > >> >>
> > > > ---------------------------------------------------------------------
> > > > > >> >> To unsubscribe, e-mail:
> > java-user-unsubscr...@lucene.apache.org
> > > > > >> >> For additional commands, e-mail:
> > > java-user-h...@lucene.apache.org
> > > > > >> >>
> > > > > >> >>
> > > > > >>
> > > > > >>
> > > ---------------------------------------------------------------------
> > > > > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > > >> For additional commands, e-mail:
> java-user-h...@lucene.apache.org
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Actual min and max-value of NumericField during codec flush

Reply via email to