[VOTE] KIP-33 - Add a time based log index

2016-04-06 Thread Becket Qin
Hi all, I updated KIP-33 based on the initial implementation. Per discussion on yesterday's KIP hangout, I would like to initiate the new vote thread for KIP-33. The KIP wiki: https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index Here is a brief summary of the KIP

Re: [VOTE] KIP-33 - Add a time based log index

2016-08-30 Thread Becket Qin
Hi folks, Here is another update on the change of time based log rolling. After the latest implementation, we encountered KAFKA-4099. The issue is that if users move replicas, for the messages in the old segments, the new replica will create one log segment for each message. The root cause of thi

Re: [VOTE] KIP-33 - Add a time based log index

2016-08-31 Thread Guozhang Wang
Some of the streams integration tests also encounters this issue where the timestamps we are using in the test are very small (e.g. 1,2,3...) which cause the log to roll upon each append, and the old segment gets deleted very soon. Arguably this can be resolved to enforce LogAppendTime configuratio

Re: [VOTE] KIP-33 - Add a time based log index

2016-08-31 Thread Becket Qin
Thanks for the feedback, Guozhang. I'll update the KIP wiki and submit a patch if no one have concerns on this. On Wed, Aug 31, 2016 at 11:59 AM, Guozhang Wang wrote: > Some of the streams integration tests also encounters this issue where the > timestamps we are using in the test are very small

Re: [VOTE] KIP-33 - Add a time based log index

2016-09-01 Thread Ismael Juma
Sounds good to me too. For completeness, one thing that is mentioned in the JIRA but not in the message below is: "If the first message in the segment does not have a timetamp, we will fall back to use the wall clock time for log rolling". Ismael On Wed, Aug 31, 2016 at 7:59 PM, Guozhang Wang w

Re: [VOTE] KIP-33 - Add a time based log index

2016-04-07 Thread Guozhang Wang
Great job Jiangjie. A few comments: 1. "if an offset index entry is inserted, it will also insert a time index entry" what is the motivation for co-inserting offset index and timestamp index? Is it just for simplicity or are there any other considerations? 2. "Search message by timestamp": you m

Re: [VOTE] KIP-33 - Add a time based log index

2016-04-10 Thread Jun Rao
Hi, Jiangjie, Thanks for the update. Looks good to me overall. Just a few minor comments below. 10. On broker startup, it's not clear to me why we need to scan the log segment to retrieve the largest timestamp since the time index always has an entry for the largest timestamp. Is that only for re

Re: [VOTE] KIP-33 - Add a time based log index

2016-04-11 Thread Becket Qin
Hi Guozhang and Jun, Thanks for the comments. Please see the responses below. Regarding to Guozhang's question #1 and Jun's question #12. I was inserting the time index and offset index entry together mostly for simplicity as Guozhang mentioned. The purpose of using index interval bytes for time

Re: [VOTE] KIP-33 - Add a time based log index

2016-04-13 Thread Becket Qin
Hi Jun and Guozhang, I have updated the KIP wiki to incorporate your comments. Please let me know if you prefer starting another discussion thread for further discussion. Thanks, Jiangjie (Becket) Qin On Mon, Apr 11, 2016 at 12:21 AM, Becket Qin wrote: > Hi Guozhang and Jun, > > Thanks for th

Re: [VOTE] KIP-33 - Add a time based log index

2016-04-14 Thread Jun Rao
Hi, Jiangjie, 11. Rebuilding all missing time indexes will make the upgrade process longer since the log segments pre 0.10.0 don't have the time index. Could we only rebuild the missing indexes after the last flush offset? For other segments missing the time index, we just assume lastTimestamp to

Re: [VOTE] KIP-33 - Add a time based log index

2016-04-14 Thread Becket Qin
Hi Jun, 11. Yes, that sounds a reasonable solution. In the latest patch I am doing the following in order: a. Create an empty time index for a log segment if there isn't one. b. For all non-active log segments, append an entry of (last_modification_time -> next_base_offset) into the index. The tim

Re: [VOTE] KIP-33 - Add a time based log index

2016-04-15 Thread Jun Rao
Hi, Jiangjie, Thanks for the latest update. +1 on the KIP. Jun On Thu, Apr 14, 2016 at 2:36 PM, Becket Qin wrote: > Hi Jun, > > 11. Yes, that sounds a reasonable solution. In the latest patch I am doing > the following in order: > a. Create an empty time index for a log segment if there isn't

Re: [VOTE] KIP-33 - Add a time based log index

2016-04-15 Thread Guozhang Wang
+1 from me. Thanks. On Fri, Apr 15, 2016 at 9:16 AM, Jun Rao wrote: > Hi, Jiangjie, > > Thanks for the latest update. +1 on the KIP. > > Jun > > On Thu, Apr 14, 2016 at 2:36 PM, Becket Qin wrote: > > > Hi Jun, > > > > 11. Yes, that sounds a reasonable solution. In the latest patch I am > doing

Re: [VOTE] KIP-33 - Add a time based log index

2016-04-16 Thread Gwen Shapira
+1 On Fri, Apr 15, 2016 at 9:37 AM, Guozhang Wang wrote: > +1 from me. Thanks. > > On Fri, Apr 15, 2016 at 9:16 AM, Jun Rao wrote: > >> Hi, Jiangjie, >> >> Thanks for the latest update. +1 on the KIP. >> >> Jun >> >> On Thu, Apr 14, 2016 at 2:36 PM, Becket Qin wrote: >> >> > Hi Jun, >> > >> > 1

Re: [VOTE] KIP-33 - Add a time based log index

2016-04-17 Thread Liquan Pei
+1 On Sat, Apr 16, 2016 at 10:25 PM, Gwen Shapira wrote: > +1 > > On Fri, Apr 15, 2016 at 9:37 AM, Guozhang Wang wrote: > > +1 from me. Thanks. > > > > On Fri, Apr 15, 2016 at 9:16 AM, Jun Rao wrote: > > > >> Hi, Jiangjie, > >> > >> Thanks for the latest update. +1 on the KIP. > >> > >> Jun >

Re: [VOTE] KIP-33 - Add a time based log index

2016-04-18 Thread Ismael Juma
Hi Jiangjie, Thanks for the KIP, it's a nice improvement. Since it seems like we have been using the voting thread for discussion, I'll do the same. A few minor comments/questions: 1. The proposed name for the time index file `SegmentBaseOffset.timeindex`. Would `SegmentBaseOffset.time-index` be

Re: [VOTE] KIP-33 - Add a time based log index

2016-04-18 Thread Grant Henke
Thanks for the detailed write up Jiangjie. Overall the proposal looks good to me. I have a few implementation detail questions that don't necessarily need to block progress: When searching by timestamp, broker will start from the earliest log > segment and check the last time index entry. If the t

Re: [VOTE] KIP-33 - Add a time based log index

2016-04-18 Thread Becket Qin
Thanks for the comments Ismael. Please see the replies inline. On Mon, Apr 18, 2016 at 6:50 AM, Ismael Juma wrote: > Hi Jiangjie, > > Thanks for the KIP, it's a nice improvement. Since it seems like we have > been using the voting thread for discussion, I'll do the same. > > A few minor comments

Re: [VOTE] KIP-33 - Add a time based log index

2016-04-18 Thread Becket Qin
Hi Grant, Thanks for the review. Please see the replies inline. On Mon, Apr 18, 2016 at 1:56 PM, Grant Henke wrote: > Thanks for the detailed write up Jiangjie. Overall the proposal looks good > to me. I have a few implementation detail questions that > don't necessarily need to block progress:

Re: [VOTE] KIP-33 - Add a time based log index

2016-04-19 Thread Joel Koshy
Hi Becket, +1 Just a few more minor comments that I don't think should result in any material change to the KIP: - Under "time index format": *A time index entry (timestamp, offset) means that any message whose timestamp is greater than timestamp come after offset. * - I think it may

Re: [VOTE] KIP-33 - Add a time based log index

2016-04-19 Thread Ismael Juma
Thanks Becket. I think it would be nice to update the KIP with regards to point 3 and 4. In any case, +1 (non-binding) Ismael On Tue, Apr 19, 2016 at 2:03 AM, Becket Qin wrote: > Thanks for the comments Ismael. Please see the replies inline. > > On Mon, Apr 18, 2016 at 6:50 AM, Ismael Juma wr

Re: [VOTE] KIP-33 - Add a time based log index

2016-04-19 Thread Becket Qin
Thanks Joel and Ismael. I just updated the KIP based on your feedback. KIP-33 has passed with +4 (binding) and +2 (non-binding) Thanks everyone for the reading, feedback and voting! Jiangjie (Becket) Qin On Tue, Apr 19, 2016 at 5:25 PM, Ismael Juma wrote: > Thanks Becket. I think it would be

Re: [VOTE] KIP-33 - Add a time based log index

2016-06-10 Thread Becket Qin
Hi, During the implementation of KIP-33, we found it might be useful to have a more deterministic time based log rolling than what proposed in the KIP. The current KIP proposal uses the largest timestamp in the segment for time based rolling. The active log segment only rolls when there is no mes

Re: [VOTE] KIP-33 - Add a time based log index

2016-06-13 Thread Guozhang Wang
Thanks Jiangjie, I see the need for sensitive data purging, the above proposed change LGTM. One minor concern is that a wrongly marked timestamp on the first record could cause the segment to roll much later / earlier, though it may be rare. Guozhang On Fri, Jun 10, 2016 at 10:07 AM, Becket Qin

[VOTE] KIP-33 - Add a time based log index to Kafka

2016-02-03 Thread Becket Qin
Hi all, I would like to initiate the vote for KIP-33. https://cwiki.apache.org/confluence/display/KAFKA/KIP-33 +-+Add+a+time+based+log+index A good amount of the KIP has been touched during the discussion on KIP-32. So I also put the link to KIP-32 here for reference. https://cwiki.apache.org/co

Re: [VOTE] KIP-33 - Add a time based log index to Kafka

2016-02-23 Thread Becket Qin
Bump. Per Jun's comments during KIP hangout, I have updated wiki with the upgrade plan or KIP-33. Let's vote! Thanks, Jiangjie (Becket) Qin On Wed, Feb 3, 2016 at 10:32 AM, Becket Qin wrote: > Hi all, > > I would like to initiate the vote for KIP-33. > > https://cwiki.apache.org/confluence/d

Re: [VOTE] KIP-33 - Add a time based log index to Kafka

2016-02-24 Thread Guozhang Wang
Thanks Jiangjie, a few comments on the wiki: 1. Config name "time.index.interval" to "time.index.interval.ms" to be consistent. Also do we need a "time.index.size.max.bytes" as well? 2. Will the memory mapped index file for timestamp have the same default initial / max size (10485760) as the offs

Re: [VOTE] KIP-33 - Add a time based log index to Kafka

2016-02-24 Thread Becket Qin
Thanks for the comment Guozhang, I just changed the configuration name to "time.index.interval.ms". It seems the real question here is how big the offset indices will be. Theoretically we can have one time index entry for each message in a log segment. For example, if there is one event per minut

Re: [VOTE] KIP-33 - Add a time based log index to Kafka

2016-02-24 Thread Becket Qin
Hi Guozhang, I thought about this again and it seems we stilll need the time.index.interval.ms configuration to avoid unnecessary frequent time index insertion. I just updated the wiki to add index.interval.bytes as an additional constraints for time index entry insertion. Another slight change m

Re: [VOTE] KIP-33 - Add a time based log index to Kafka

2016-02-25 Thread Guozhang Wang
Jiangjie, I was originally only thinking about the "time.index.size.max.bytes" config in addition to the "offset.index.size.max.bytes". Since the latter's default size is 10MB, and for memory mapped file, we will allocate that much of memory at the start which could be a pressure on RAM if we doub

Re: [VOTE] KIP-33 - Add a time based log index to Kafka

2016-02-28 Thread Becket Qin
Hi Guozhang, The size of memory mapped index file was also our concern as well. That is why we are suggesting minute level time indexing instead of second level. There are a few thoughts on the extra memory cost of time index. 1. Currently all the index files are loaded as memory mapped files. No

Re: [VOTE] KIP-33 - Add a time based log index to Kafka

2016-02-29 Thread Jun Rao
Hi, Becket, I thought that your proposal to build time-based index just based off index.interval.bytes is reasonable. Is there a particular need to also add time. index.interval.bytes? Compute the pre-allocated index file size based on log segment file size can be useful. However, the tricky thin

Re: [VOTE] KIP-33 - Add a time based log index to Kafka

2016-02-29 Thread Becket Qin
Hi Jun, I think index.interval.bytes is used to control the density of the offset index. The counterpart of index.interval.bytes for time index is time.index.interval.ms. If we did not change the semantic of log.roll.ms, log.roll.ms/time.index.interval.ms and log.segment.bytes/index.interval.bytes

Re: [VOTE] KIP-33 - Add a time based log index to Kafka

2016-03-01 Thread Jun Rao
Jiangjie, Currently, we roll a new log segment if the index is full. We can probably just do the same on the time index. This will bound the index size. 1. Sounds good. 2. I was wondering an edge case where the largest timestamp is in the oldest segment and the time index is empty is in all newe

Re: [VOTE] KIP-33 - Add a time based log index to Kafka

2016-03-01 Thread Becket Qin
Hi Jun, Rolling out a new segment when the time index is full sounds good. So both time index and offset index will be sharing the configuration of max index size. If we do that, do you think we still want to reuse index.interval.bytes? If we don't, the risk is that in some corner cases, we might

Re: [VOTE] KIP-33 - Add a time based log index to Kafka

2016-03-01 Thread Jun Rao
Hi, Jiangjie, I was thinking perhaps just reusing index.interval.bytes is enough. Not sure if there is much value in adding an additional time.index.interval.ms. For 1, the timestamp index has entries of timestamp -> file position. So, there is actually no offset in the index, right? For 2, what

Re: [VOTE] KIP-33 - Add a time based log index to Kafka

2016-03-01 Thread Becket Qin
Hi Jun, I see. If we only use index.interval.bytes, the time index entry will be inserted when (1) the largest timestamp is in this segment AND (2) at least index.interval.bytes have been appended since last time index entry insertion. In this case (1) becomes implicit instead of having an explici

Re: [VOTE] KIP-33 - Add a time based log index to Kafka

2016-03-07 Thread Becket Qin
Hi Jun, What do you think about the above solution? I am trying to include KIP-33 into 0.10.0 because the log retention has been a long pending issue. Thanks, Jiangjie (Becket) Qin On Tue, Mar 1, 2016 at 8:18 PM, Becket Qin wrote: > Hi Jun, > > I see. If we only use index.interval.bytes, the

Re: [VOTE] KIP-33 - Add a time based log index to Kafka

2016-03-08 Thread Becket Qin
I updated the wiki to make the following change: Instead of maintaining a globally monotonically increasing time index. We only make sure the time index for each log segment is monotonically increasing. By doing that everything seems much simpler. we avoid empty time index. Enforcing log retention