Re: [VOTE] KIP-33 - Add a time based log index

Guozhang Wang Wed, 31 Aug 2016 11:59:42 -0700

Some of the streams integration tests also encounters this issue where the
timestamps we are using in the test are very small (e.g. 1,2,3...) which
cause the log to roll upon each append, and the old segment gets deleted
very soon. Arguably this can be resolved to enforce LogAppendTime
configuration on the embedded server.


+1 on the proposed change, makes sense to me.

Guozhang


On Tue, Aug 30, 2016 at 4:33 PM, Becket Qin <becket....@gmail.com> wrote:

> Hi folks,
>
> Here is another update on the change of time based log rolling.
>
> After the latest implementation, we encountered KAFKA-4099. The issue is
> that if users move replicas, for the messages in the old segments, the new
> replica will create one log segment for each message. The root cause of
> this is we are comparing the wall clock time with the message timestamp. A
> solution to that is also described in KAFKA-4099, which is to change the
> log rolling purely based on the timestamp in the messages. More
> specifically, we roll out the log segment if the timestamp in the current
> message is greater than the timestamp of the first message in the segment
> by more than log.roll.ms. This approach is wall clock independent and
> should solve the problem. With message.timestamp.difference.max.ms
> configuration, we can achieve 1) the log segment will be rolled out in a
> bounded time, 2) no excessively large timestamp will be accepted and cause
> frequent log rolling.
>
> Any concern regarding this change?
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Mon, Jun 13, 2016 at 2:30 PM, Guozhang Wang <wangg...@gmail.com> wrote:
>
> > Thanks Jiangjie,
> >
> > I see the need for sensitive data purging, the above proposed change
> LGTM.
> > One minor concern is that a wrongly marked timestamp on the first record
> > could cause the segment to roll much later / earlier, though it may be
> > rare.
> >
> > Guozhang
> >
> > On Fri, Jun 10, 2016 at 10:07 AM, Becket Qin <becket....@gmail.com>
> wrote:
> >
> > > Hi,
> > >
> > > During the implementation of KIP-33, we found it might be useful to
> have
> > a
> > > more deterministic time based log rolling than what proposed in the
> KIP.
> > >
> > > The current KIP proposal uses the largest timestamp in the segment for
> > time
> > > based rolling. The active log segment only rolls when there is no
> message
> > > appended in max.roll.ms since the largest timestamp in the segment.
> i.e.
> > > the rolling time may change if user keeping appending messages into the
> > > segment. This may not be a desirable behavior for people who have
> > sensitive
> > > data and want to make sure they are removed after some time.
> > >
> > > To solve the above issue, we want to modify the KIP proposal regarding
> > the
> > > time based rolling to the following behavior. The time based log
> rolling
> > > will be based on the first message with a timestamp in the log segment
> if
> > > there is such a message. If no message in the segment has a timestamp,
> > the
> > > time based log rolling will still be based on log segment create time,
> > > which is the same as we are doing now. The reasons we don't want to
> > always
> > > roll based on file create time are because 1) the message timestamp may
> > be
> > > assigned by clients which can be different from the create time of the
> > log
> > > segment file. 2) On some Linux, the file create time is not available,
> so
> > > using segment file create time may not always work.
> > >
> > > Do people have any concern for this change? I will update the KIP if
> > people
> > > think the change is OK.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Tue, Apr 19, 2016 at 6:27 PM, Becket Qin <becket....@gmail.com>
> > wrote:
> > >
> > > > Thanks Joel and Ismael. I just updated the KIP based on your
> feedback.
> > > >
> > > > KIP-33 has passed with +4 (binding) and +2 (non-binding)
> > > >
> > > > Thanks everyone for the reading, feedback and voting!
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Tue, Apr 19, 2016 at 5:25 PM, Ismael Juma <ism...@juma.me.uk>
> > wrote:
> > > >
> > > >> Thanks Becket. I think it would be nice to update the KIP with
> regards
> > > to
> > > >> point 3 and 4.
> > > >>
> > > >> In any case, +1 (non-binding)
> > > >>
> > > >> Ismael
> > > >>
> > > >> On Tue, Apr 19, 2016 at 2:03 AM, Becket Qin <becket....@gmail.com>
> > > wrote:
> > > >>
> > > >> > Thanks for the comments Ismael. Please see the replies inline.
> > > >> >
> > > >> > On Mon, Apr 18, 2016 at 6:50 AM, Ismael Juma <ism...@juma.me.uk>
> > > wrote:
> > > >> >
> > > >> > > Hi Jiangjie,
> > > >> > >
> > > >> > > Thanks for the KIP, it's a nice improvement. Since it seems like
> > we
> > > >> have
> > > >> > > been using the voting thread for discussion, I'll do the same.
> > > >> > >
> > > >> > > A few minor comments/questions:
> > > >> > >
> > > >> > > 1. The proposed name for the time index file
> > > >> > `SegmentBaseOffset.timeindex`.
> > > >> > > Would `SegmentBaseOffset.time-index` be a little better? It
> would
> > > >> clearly
> > > >> > > separate the type of index in case we add additional index types
> > in
> > > >> the
> > > >> > > future.
> > > >> > >
> > > >> > I have no strong opinion on this, I am not adding any thing
> > separator
> > > >> > because it is more regex friendly.
> > > >> > I am not sure about the other indexes, time and space seems to be
> > two
> > > >> most
> > > >> > common dimensions.
> > > >> >
> > > >> > 2. When describing the time index entry, we say "Offset - the next
> > > >> offset
> > > >> > > when the time index entry is inserted". I found the mention of
> > > `next`
> > > >> a
> > > >> > bit
> > > >> > > confusing as it looks to me like the time index entry has the
> > first
> > > >> > offset
> > > >> > > in the message set.
> > > >> >
> > > >> > This semantic meaning is a little different from the offset index.
> > The
> > > >> > offset index information is self-contained by nature. i.e. all the
> > > >> offsets
> > > >> > before is smaller than the offset of this message set. So we only
> > need
> > > >> to
> > > >> > say "the offset of this message set is OFFSET". This does not
> quite
> > > >> apply
> > > >> > to the time index because the max timestamp may or may not be in
> the
> > > >> > message set being appended. So we have to either say, "the max
> > > timestamp
> > > >> > before I append this message set is T", or "the max timestamp
> after
> > I
> > > >> > appended this message set is T". The former case means that we can
> > > skip
> > > >> all
> > > >> > the previous messages if we are looking for a timestamp > T and
> > start
> > > >> from
> > > >> > this offset. The latter one means if we are searching for
> timestamp
> > >
> > > >> T, we
> > > >> > should start after this message set, which is essentially the same
> > as
> > > >> the
> > > >> > former case but require an additional interpretation.
> > > >> >
> > > >> > 3. We say "The default initial / max size of the time index files
> is
> > > the
> > > >> > > same as the offset index files. (time index entry is 1.5x of the
> > > size
> > > >> of
> > > >> > > offset index entry, user should set the configuration
> > accordingly)".
> > > >> It
> > > >> > may
> > > >> > > be worth elaborating a little on what a user should do with
> > regards
> > > to
> > > >> > this
> > > >> > > configuration when upgrading (ie maybe under "Compatibility,
> > > >> Deprecation,
> > > >> > > and Migration Plan").
> > > >> > >
> > > >> > Makes sense.
> > > >> >
> > > >> >
> > > >> > > 4. In a previous vote thread, Jun said "The simplest thing is
> > > probably
> > > >> > > to change
> > > >> > > the default index size to 2MB to match the default log segment
> > size"
> > > >> and
> > > >> > > you seemed to agree. I couldn't find anything about this in the
> > KIP.
> > > >> Are
> > > >> > we
> > > >> > > still doing it?
> > > >> > >
> > > >> > Yes, we can still make the change for default settings. User might
> > > want
> > > >> to
> > > >> > set the index size a little larger if they have a customized size
> > but
> > > in
> > > >> > reality it should not cause problems other than rolling out a
> little
> > > >> more
> > > >> > log segments.
> > > >> >
> > > >> > 5. We say "Instead, it is only monotonically increasing within
> each
> > > time
> > > >> > > index file. i.e. It is possible that the time index file for a
> > later
> > > >> log
> > > >> > > segment contains smaller timestamp than some timestamp in the
> time
> > > >> index
> > > >> > > file of an earlier segment.". I think it would be good to
> explain
> > > >> under
> > > >> > > which scenario a time index file for a later log segment
> contains
> > a
> > > >> > smaller
> > > >> > > timestamp (is this only when CreateTime is used?).
> > > >> > >
> > > >> > Yes, it only happens when CreateTime is used.
> > > >> >
> > > >> >
> > > >> > > 6. We say "When searching by timestamp, broker will start from
> the
> > > >> > earliest
> > > >> > > log segment and check the last time index entry.". The existing
> > > logic
> > > >> > > searches from newest segment backwards. Is there a reason why we
> > are
> > > >> > > changing it?
> > > >> > >
> > > >> > Suppose segment 0 has max timestamp 100, segment 1 has max
> timestamp
> > > 50
> > > >> and
> > > >> > segment 3 has max timestamp 90, now user want to search for
> > timestamp
> > > >> 80.
> > > >> > If we search backwards, we have to take a look at all the
> segments.
> > If
> > > >> we
> > > >> > search forward, we will stop at the first segment whose max
> > timestamp
> > > is
> > > >> > greater than 80 (i.e all the previous segments has smaller
> > timestamps)
> > > >> and
> > > >> > start the finer search on that segment.
> > > >> >
> > > >> >
> > > >> > > 7. Do you mind if I fix typos and minor grammar issues directly
> in
> > > the
> > > >> > > wiki? It seems easier than doing that via email.
> > > >> > >
> > > >> > Not at all, thanks for help.
> > > >> >
> > > >> >
> > > >> > > Thanks,
> > > >> > > Ismael
> > > >> > >
> > > >> > > On Thu, Apr 7, 2016 at 1:44 AM, Becket Qin <
> becket....@gmail.com>
> > > >> wrote:
> > > >> > >
> > > >> > > > Hi all,
> > > >> > > >
> > > >> > > > I updated KIP-33 based on the initial implementation. Per
> > > >> discussion on
> > > >> > > > yesterday's KIP hangout, I would like to initiate the new vote
> > > >> thread
> > > >> > for
> > > >> > > > KIP-33.
> > > >> > > >
> > > >> > > > The KIP wiki:
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 33+-+Add+a+time+based+log+index
> > > >> > > >
> > > >> > > > Here is a brief summary of the KIP:
> > > >> > > > 1. We propose to add a time index for each log segment.
> > > >> > > > 2. The time indices are going to be used of log retention, log
> > > >> rolling
> > > >> > > and
> > > >> > > > message search by timestamp.
> > > >> > > >
> > > >> > > > There was an old voting thread which has some discussions on
> > this
> > > >> KIP.
> > > >> > > The
> > > >> > > > mail thread link is following:
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > http://mail-archives.apache.org/mod_mbox/kafka-dev/201602.mbox/%
> > 3ccabtagwgoebukyapfpchmycjk2tepq3ngtuwnhtr2tjvsnc8...@mail.gmail.com%3E
> > > >> > > >
> > > >> > > > I have the following WIP patch for reference. It needs a few
> > more
> > > >> unit
> > > >> > > > tests and documentation. Other than that it should run fine.
> > > >> > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > https://github.com/becketqin/kafka/commit/
> 712357a3fbf1423e05f9eed7d2fed5
> > b6fe6c37b7
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > >
> > > >> > > > Jiangjie (Becket) Qin
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>



-- 
-- Guozhang

Re: [VOTE] KIP-33 - Add a time based log index

Reply via email to