Hi, Becket,

I thought that your proposal to build time-based index just based off
index.interval.bytes
is reasonable. Is there a particular need to also add time.
index.interval.bytes?

Compute the pre-allocated index file size based on log segment file size
can be useful. However, the tricky thing is that log segment size can be
changed dynamically. Also, for mmap files, they don't use heap space, just
virtual memory, which will be paged in on demand. So, I am not sure if
memory space is a big concern there. The simplest thing is probably to
change the default index size to 2MB to match the default log segment size.

A couple of other things to think through.

1. Currently, LogSegment supports truncating to an offset. How do we do
that on a time-based index?

2. Since it's possible to have a empty time-based index (if all message
timestamps are smaller than the largest timestamp in previous segment), we
need to figure out what timestamp to use for retaining such log segment. In
the extreme case, it can happen that after we delete an old log segment,
all of the new log segments have an empty time-based index, in this case,
how do we avoid losing track of the latest timestamp?

Thanks,

Jun

On Sun, Feb 28, 2016 at 3:26 PM, Becket Qin <becket....@gmail.com> wrote:

> Hi Guozhang,
>
> The size of memory mapped index file was also our concern as well. That is
> why we are suggesting minute level time indexing instead of second level.
> There are a few thoughts on the extra memory cost of time index.
>
> 1. Currently all the index files are loaded as memory mapped files. Notice
> that only the index of the active segment is of the default size 10MB.
> Typically the index of the old segments are much smaller than 10MB. So if
> we use the same initial size for time index files, the total amount of
> memory won't be doubled, but the memory cost of active segments will be
> doubled. (However, the 10MB value itself seems problematic, see later
> reasoning).
>
> 2. It is likely that the time index is much smaller than the offset index
> because user would adjust the time index interval ms depending on the topic
> volume. i.e for a low volume topic the time index interval ms will be much
> longer so that we can avoid inserting one time index entry for each message
> in the extreme case.
>
> 3. To further guard against the unnecessary frequent insertion of time
> index entry, we used the index.interval.bytes as a restriction for time
> index entry as well. Such that even for a newly created topic with the
> default time.index.interval.ms we don't need to worry about overly
> aggressive time index entry insertion.
>
> Considering the above. The overall memory cost for time index should be
> much smaller compared with the offset index. However, as you pointed out
> for (1) might still be an issue. I am actually not sure about why we always
> allocate 10 MB for the index file. This itself looks a problem given we
> actually have a pretty good way to know the upper bound of memory taken by
> an offset index.
>
> Theoretically, the offset index file will at most have (log.segment.bytes /
> index.interval.bytes) entries. In our default configuration,
> log.segment.size=1GB, and index.interval.bytes=4K. This means we only need
> (1GB/4K)*8 Bytes = 2MB. Allocating 10 MB is really a big waste of memory.
>
> I suggest we do the following:
> 1. When creating the log index file, we always allocate memory using the
> above calculation.
> 2. If the memory calculated in (1) is greater than segment.index.bytes, we
> use segment.index.bytes instead. Otherwise we simply use the result in (1)
>
> If we do this I believe the memory for index file will probably be smaller
> even if we have the time index added. I will create a separate ticket for
> the index file initial size.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Thu, Feb 25, 2016 at 3:30 PM, Guozhang Wang <wangg...@gmail.com> wrote:
>
> > Jiangjie,
> >
> > I was originally only thinking about the "time.index.size.max.bytes"
> config
> > in addition to the "offset.index.size.max.bytes". Since the latter's
> > default size is 10MB, and for memory mapped file, we will allocate that
> > much of memory at the start which could be a pressure on RAM if we double
> > it.
> >
> > Guozhang
> >
> > On Wed, Feb 24, 2016 at 4:56 PM, Becket Qin <becket....@gmail.com>
> wrote:
> >
> > > Hi Guozhang,
> > >
> > > I thought about this again and it seems we stilll need the
> > > time.index.interval.ms configuration to avoid unnecessary frequent
> time
> > > index insertion.
> > >
> > > I just updated the wiki to add index.interval.bytes as an additional
> > > constraints for time index entry insertion. Another slight change made
> > was
> > > that as long as a message timestamp shows time.index.interval.ms has
> > > passed
> > > since the timestamp of last time index entry, we will insert another
> > > timestmap index entry. Previously we always insert time index at
> > > time.index.interval.ms bucket boundaries.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Wed, Feb 24, 2016 at 2:40 PM, Becket Qin <becket....@gmail.com>
> > wrote:
> > >
> > > > Thanks for the comment Guozhang,
> > > >
> > > > I just changed the configuration name to "time.index.interval.ms".
> > > >
> > > > It seems the real question here is how big the offset indices will
> be.
> > > > Theoretically we can have one time index entry for each message in a
> > log
> > > > segment. For example, if there is one event per minute appended, we
> > might
> > > > have to have a time index entry for each message until the segment
> size
> > > is
> > > > reached. In that case, the number of index entries in the time index
> > > would
> > > > be (segment size / avg message size). So the time index file size can
> > > > potentially be big.
> > > >
> > > > I am wondering if we can simply reuse the "index.interval.bytes"
> > > > configuration instead of having a separate time index interval ms.
> i.e.
> > > > instead of inserting a new entry based on time interval, we still
> > insert
> > > it
> > > > based on bytes interval. This does not affect the granularity because
> > we
> > > > can search from the nearest index entry to find the message with
> > correct
> > > > timestamp. The good thing is that this guarantees there will not be
> > huge
> > > > time indices. We also save the new configuration.
> > > >
> > > > What do you think?
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Wed, Feb 24, 2016 at 1:00 PM, Guozhang Wang <wangg...@gmail.com>
> > > wrote:
> > > >
> > > >> Thanks Jiangjie, a few comments on the wiki:
> > > >>
> > > >> 1. Config name "time.index.interval" to "time.index.interval.ms" to
> > be
> > > >> consistent. Also do we need a "time.index.size.max.bytes" as well?
> > > >>
> > > >> 2. Will the memory mapped index file for timestamp have the same
> > default
> > > >> initial / max size (10485760) as the offset index?
> > > >>
> > > >> Otherwise LGTM.
> > > >>
> > > >> Guozhang
> > > >>
> > > >> On Tue, Feb 23, 2016 at 5:05 PM, Becket Qin <becket....@gmail.com>
> > > wrote:
> > > >>
> > > >> > Bump.
> > > >> >
> > > >> > Per Jun's comments during KIP hangout, I have updated wiki with
> the
> > > >> upgrade
> > > >> > plan or KIP-33.
> > > >> >
> > > >> > Let's vote!
> > > >> >
> > > >> > Thanks,
> > > >> >
> > > >> > Jiangjie (Becket) Qin
> > > >> >
> > > >> > On Wed, Feb 3, 2016 at 10:32 AM, Becket Qin <becket....@gmail.com
> >
> > > >> wrote:
> > > >> >
> > > >> > > Hi all,
> > > >> > >
> > > >> > > I would like to initiate the vote for KIP-33.
> > > >> > >
> > > >> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33
> > > >> > > +-+Add+a+time+based+log+index
> > > >> > >
> > > >> > > A good amount of the KIP has been touched during the discussion
> on
> > > >> > KIP-32.
> > > >> > > So I also put the link to KIP-32 here for reference.
> > > >> > > https://cwiki.apache.org/confluence/display/KAFKA/KIP
> > > >> > > -32+-+Add+timestamps+to+Kafka+message
> > > >> > >
> > > >> > > Thanks,
> > > >> > >
> > > >> > > Jiangjie (Becket) Qin
> > > >> > >
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> -- Guozhang
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>

Reply via email to