Re: Actual min and max-value of NumericField during codec flush

Michael McCandless Fri, 07 Feb 2014 09:14:15 -0800

LogByteSizeMergePolicy (and LogDocMergePolicy) will preserve the total order.


Only TieredMergePolicy merges out-of-order segments.

I don't understand why you need to encouraging merging of the more
recent (by your "time" field) segments...

Mike McCandless

http://blog.mikemccandless.com


On Fri, Feb 7, 2014 at 8:18 AM, Ravikumar Govindarajan
<ravikumar.govindara...@gmail.com> wrote:
> Mike,
>
> Each of my flushed segment is fully ordered by time. But TieredMergePolicy
> or LogByteSizeMergePolicy is going to pick arbitrary time-segments and
> disturb this arrangement and I wanted some kind of control on this.
>
> But like you pointed-out, going by only be time-adjacent merges can be
> disastrous.
>
> Is there a way to mix both time and size to arrive at a somewhat
> [less-than-accurate] global order of segment merges.
>
> Like attempt a time-adjacent merge, provided size of segments is not
> extremely skewed etc...
>
> --
> Ravi
>
>
>
>
>
>
>
> On Fri, Feb 7, 2014 at 4:17 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> You want to focus merging on the segments containing newer documents?
>> Why?  This seems somewhat dangerous...
>>
>> Not taking into account the "true" segment size can lead to very very
>> poor merge decisions ... you should turn on IndexWriter's infoStream
>> and do a long running test to convince yourself the merging is being
>> sane.
>>
>> Mike
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Thu, Feb 6, 2014 at 11:24 PM, Ravikumar Govindarajan
>> <ravikumar.govindara...@gmail.com> wrote:
>> > Thanks Mike,
>> >
>> > Will try your suggestion. I will try to describe the actual use-case
>> itself
>> >
>> > There is a requirement for merging time-adjacent segments [append-only,
>> > rolling time-series data]
>> >
>> > All Documents have a timestamp affixed and during flush I need to note
>> down
>> > the least timestamp for all documents, through Codec.
>> >
>> > Then, I define a TimeMergePolicy extends LogMergePolicy and define the
>> > segment-size=Long.MAX_VALUE - SEG_LEAST_TIME [segment-diag].
>> >
>> > LogMergePolicy will auto-arrange levels of segments according time and
>> > proceed with merges. Latest segments will be lesser in size and preferred
>> > during merges than older and bigger segments
>> >
>> > Do you think such an approach will be fine or there are better ways to
>> > solve this?
>> >
>> > --
>> > Ravi
>> >
>> >
>> > On Thu, Feb 6, 2014 at 4:34 PM, Michael McCandless <
>> > luc...@mikemccandless.com> wrote:
>> >
>> >> Somewhere in those numeric trie terms are the exact integers from your
>> >> documents, encoded.
>> >>
>> >> You can use oal.util.NumericUtils.prefixCodecToInt to get the int
>> >> value back from the BytesRef term.
>> >>
>> >> But you need to filter out the "higher level" terms, e.g. using
>> >> NumericUtils.getPrefixCodedLongShift(term) == 0.  Or use
>> >> NumericUtils.filterPrefixCodedLongs to wrap a TermsEnum.  I believe
>> >> all the terms you want come first, so once you hit a term where
>> >> .getPrefixCodedLongShift is > 0, that's your max term and you can stop
>> >> checking.
>> >>
>> >> BTW, in 5.0, the codec API for PostingsFormat has improved, so that
>> >> you can e.g. pull your own TermsEnum and iterate the terms yourself.
>> >>
>> >> Mike McCandless
>> >>
>> >> http://blog.mikemccandless.com
>> >>
>> >>
>> >> On Thu, Feb 6, 2014 at 5:16 AM, Ravikumar Govindarajan
>> >> <ravikumar.govindara...@gmail.com> wrote:
>> >> > I use a Codec to flush data. All methods delegate to actual
>> >> Lucene42Codec,
>> >> > except for intercepting one single-field. This field is indexed as an
>> >> > IntField [Numeric-Trie...], with precisionStep=4.
>> >> >
>> >> > The purpose of the Codec is as follows
>> >> >
>> >> > 1. Note the first BytesRef for this field
>> >> > 2. During finish() call [TermsConsumer.java], note the last BytesRef
>> for
>> >> > this field
>> >> > 3. Converts both the first/last BytesRef to respective integers
>> >> > 4. Store these 2 ints in segment-info diagnostics
>> >> >
>> >> > The problem with this approach is that, first/last BytesRef is totally
>> >> > different from the actual "int" values I try to index. I guess, this
>> is
>> >> > because Numeric-Trie explodes all the integers into it's own format of
>> >> > BytesRefs. Hence my Codec stores the wrong values in
>> segment-diagnostics
>> >> >
>> >> > Is there a way I can record actual min/max int-values correctly in my
>> >> codec
>> >> > and still support NumericRange search?
>> >> >
>> >> > --
>> >> > Ravi
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >>
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Actual min and max-value of NumericField during codec flush

Reply via email to