Hi Jun,

I think index.interval.bytes is used to control the density of the offset
index. The counterpart of index.interval.bytes for time index is
time.index.interval.ms. If we did not change the semantic of log.roll.ms,
log.roll.ms/time.index.interval.ms and
log.segment.bytes/index.interval.bytes are a perfect mapping from bytes to
time. However, because we changed the behavior of log.roll.ms, we need to
guard against a potentially excessively large time index. We can either
reuse index.interval.bytes or introduce time.index.interval.bytes, but I
cannot think of additional usage for time.index.interval.bytes other than
limiting the time index size.

I agree that the memory mapped file is probably not a big issue here and we
can change the default index size to 2MB.

For the two cases you mentioned.
1. Because the message offset in the time index is also monotonically
increasing, truncating should be straightforward. i.e. only keep the
entries that are pointing to the offsets earlier than the truncated to
offsets.

2. The current assumption is that if the time index of a segment is empty
and there are no previous time index entry, we will assume that segment
should be removed - because all the older segment with even larger
timestamp have been removed. So in the case you mentioned, during startup
we will remove all the segments and roll out a new empty segment.

Thanks,

Jiangjie (Becket) Qin



On Mon, Feb 29, 2016 at 6:09 PM, Jun Rao <j...@confluent.io> wrote:

> Hi, Becket,
>
> I thought that your proposal to build time-based index just based off
> index.interval.bytes
> is reasonable. Is there a particular need to also add time.
> index.interval.bytes?
>
> Compute the pre-allocated index file size based on log segment file size
> can be useful. However, the tricky thing is that log segment size can be
> changed dynamically. Also, for mmap files, they don't use heap space, just
> virtual memory, which will be paged in on demand. So, I am not sure if
> memory space is a big concern there. The simplest thing is probably to
> change the default index size to 2MB to match the default log segment size.
>
> A couple of other things to think through.
>
> 1. Currently, LogSegment supports truncating to an offset. How do we do
> that on a time-based index?
>
> 2. Since it's possible to have a empty time-based index (if all message
> timestamps are smaller than the largest timestamp in previous segment), we
> need to figure out what timestamp to use for retaining such log segment. In
> the extreme case, it can happen that after we delete an old log segment,
> all of the new log segments have an empty time-based index, in this case,
> how do we avoid losing track of the latest timestamp?
>
> Thanks,
>
> Jun
>
> On Sun, Feb 28, 2016 at 3:26 PM, Becket Qin <becket....@gmail.com> wrote:
>
> > Hi Guozhang,
> >
> > The size of memory mapped index file was also our concern as well. That
> is
> > why we are suggesting minute level time indexing instead of second level.
> > There are a few thoughts on the extra memory cost of time index.
> >
> > 1. Currently all the index files are loaded as memory mapped files.
> Notice
> > that only the index of the active segment is of the default size 10MB.
> > Typically the index of the old segments are much smaller than 10MB. So if
> > we use the same initial size for time index files, the total amount of
> > memory won't be doubled, but the memory cost of active segments will be
> > doubled. (However, the 10MB value itself seems problematic, see later
> > reasoning).
> >
> > 2. It is likely that the time index is much smaller than the offset index
> > because user would adjust the time index interval ms depending on the
> topic
> > volume. i.e for a low volume topic the time index interval ms will be
> much
> > longer so that we can avoid inserting one time index entry for each
> message
> > in the extreme case.
> >
> > 3. To further guard against the unnecessary frequent insertion of time
> > index entry, we used the index.interval.bytes as a restriction for time
> > index entry as well. Such that even for a newly created topic with the
> > default time.index.interval.ms we don't need to worry about overly
> > aggressive time index entry insertion.
> >
> > Considering the above. The overall memory cost for time index should be
> > much smaller compared with the offset index. However, as you pointed out
> > for (1) might still be an issue. I am actually not sure about why we
> always
> > allocate 10 MB for the index file. This itself looks a problem given we
> > actually have a pretty good way to know the upper bound of memory taken
> by
> > an offset index.
> >
> > Theoretically, the offset index file will at most have
> (log.segment.bytes /
> > index.interval.bytes) entries. In our default configuration,
> > log.segment.size=1GB, and index.interval.bytes=4K. This means we only
> need
> > (1GB/4K)*8 Bytes = 2MB. Allocating 10 MB is really a big waste of memory.
> >
> > I suggest we do the following:
> > 1. When creating the log index file, we always allocate memory using the
> > above calculation.
> > 2. If the memory calculated in (1) is greater than segment.index.bytes,
> we
> > use segment.index.bytes instead. Otherwise we simply use the result in
> (1)
> >
> > If we do this I believe the memory for index file will probably be
> smaller
> > even if we have the time index added. I will create a separate ticket for
> > the index file initial size.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Thu, Feb 25, 2016 at 3:30 PM, Guozhang Wang <wangg...@gmail.com>
> wrote:
> >
> > > Jiangjie,
> > >
> > > I was originally only thinking about the "time.index.size.max.bytes"
> > config
> > > in addition to the "offset.index.size.max.bytes". Since the latter's
> > > default size is 10MB, and for memory mapped file, we will allocate that
> > > much of memory at the start which could be a pressure on RAM if we
> double
> > > it.
> > >
> > > Guozhang
> > >
> > > On Wed, Feb 24, 2016 at 4:56 PM, Becket Qin <becket....@gmail.com>
> > wrote:
> > >
> > > > Hi Guozhang,
> > > >
> > > > I thought about this again and it seems we stilll need the
> > > > time.index.interval.ms configuration to avoid unnecessary frequent
> > time
> > > > index insertion.
> > > >
> > > > I just updated the wiki to add index.interval.bytes as an additional
> > > > constraints for time index entry insertion. Another slight change
> made
> > > was
> > > > that as long as a message timestamp shows time.index.interval.ms has
> > > > passed
> > > > since the timestamp of last time index entry, we will insert another
> > > > timestmap index entry. Previously we always insert time index at
> > > > time.index.interval.ms bucket boundaries.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Wed, Feb 24, 2016 at 2:40 PM, Becket Qin <becket....@gmail.com>
> > > wrote:
> > > >
> > > > > Thanks for the comment Guozhang,
> > > > >
> > > > > I just changed the configuration name to "time.index.interval.ms".
> > > > >
> > > > > It seems the real question here is how big the offset indices will
> > be.
> > > > > Theoretically we can have one time index entry for each message in
> a
> > > log
> > > > > segment. For example, if there is one event per minute appended, we
> > > might
> > > > > have to have a time index entry for each message until the segment
> > size
> > > > is
> > > > > reached. In that case, the number of index entries in the time
> index
> > > > would
> > > > > be (segment size / avg message size). So the time index file size
> can
> > > > > potentially be big.
> > > > >
> > > > > I am wondering if we can simply reuse the "index.interval.bytes"
> > > > > configuration instead of having a separate time index interval ms.
> > i.e.
> > > > > instead of inserting a new entry based on time interval, we still
> > > insert
> > > > it
> > > > > based on bytes interval. This does not affect the granularity
> because
> > > we
> > > > > can search from the nearest index entry to find the message with
> > > correct
> > > > > timestamp. The good thing is that this guarantees there will not be
> > > huge
> > > > > time indices. We also save the new configuration.
> > > > >
> > > > > What do you think?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jiangjie (Becket) Qin
> > > > >
> > > > > On Wed, Feb 24, 2016 at 1:00 PM, Guozhang Wang <wangg...@gmail.com
> >
> > > > wrote:
> > > > >
> > > > >> Thanks Jiangjie, a few comments on the wiki:
> > > > >>
> > > > >> 1. Config name "time.index.interval" to "time.index.interval.ms"
> to
> > > be
> > > > >> consistent. Also do we need a "time.index.size.max.bytes" as well?
> > > > >>
> > > > >> 2. Will the memory mapped index file for timestamp have the same
> > > default
> > > > >> initial / max size (10485760) as the offset index?
> > > > >>
> > > > >> Otherwise LGTM.
> > > > >>
> > > > >> Guozhang
> > > > >>
> > > > >> On Tue, Feb 23, 2016 at 5:05 PM, Becket Qin <becket....@gmail.com
> >
> > > > wrote:
> > > > >>
> > > > >> > Bump.
> > > > >> >
> > > > >> > Per Jun's comments during KIP hangout, I have updated wiki with
> > the
> > > > >> upgrade
> > > > >> > plan or KIP-33.
> > > > >> >
> > > > >> > Let's vote!
> > > > >> >
> > > > >> > Thanks,
> > > > >> >
> > > > >> > Jiangjie (Becket) Qin
> > > > >> >
> > > > >> > On Wed, Feb 3, 2016 at 10:32 AM, Becket Qin <
> becket....@gmail.com
> > >
> > > > >> wrote:
> > > > >> >
> > > > >> > > Hi all,
> > > > >> > >
> > > > >> > > I would like to initiate the vote for KIP-33.
> > > > >> > >
> > > > >> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33
> > > > >> > > +-+Add+a+time+based+log+index
> > > > >> > >
> > > > >> > > A good amount of the KIP has been touched during the
> discussion
> > on
> > > > >> > KIP-32.
> > > > >> > > So I also put the link to KIP-32 here for reference.
> > > > >> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP
> > > > >> > > -32+-+Add+timestamps+to+Kafka+message
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > >
> > > > >> > > Jiangjie (Becket) Qin
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> -- Guozhang
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>

Reply via email to