Hi, Jason, Thanks for the reply. They sound good to me.
Jun On Fri, Jan 27, 2017 at 4:42 PM, Jason Gustafson <ja...@confluent.io> wrote: > A few more responses: > > > > 101. Compatibility during upgrade: Suppose that the brokers are upgraded > to > > the new version, but the broker message format is still the old one. If a > > new producer uses the transaction feature, should the producer get an > error > > in this case? A tricky case can be that the leader broker is on the new > > message format, but the follower broker is still on the old message > format. > > In this case, the transactional info will be lost in the follower due to > > down conversion. Should we failed the transactional requests when the > > followers are still on the old message format? > > > We've added some more details to the document about migration. Please take > a look. Two points worth mentioning: > > 1. Replicas currently take the message format used by the leader. As long > as users do the usual procedure of two rolling bounces, it should be safe > to upgrade the message format. > > 2. There is no way to support idempotent or transactional features if we > downgrade the message format in the produce request handler. We've modified > the design document to only permit message downgrades if the producer has > disabled idempotence. Otherwise, we will return an > UNSUPPORTED_FOR_MESSAGE_FORMAT error. > > 110. Transaction log: > > 110.1 "Key => Version AppID Version" It seems that Version should really > be > > Type? > > 110.2 "Value => Version Epoch Status ExpirationTime [Topic Partition]" > > Should we store [Topic [Partition]] instead? > > 110.3 To expire an AppId, do we need to insert a tombstone with the > expired > > AppID as the key to physically remove the existing AppID entries in the > > transaction log? > > > Fixed in the document. For 110.3, yes, we need to insert a tombstone after > the AppID has expired. This will work in much the same way as the consumer > coordinator expires offsets using a periodic task. > > 116. ProducerRequest: The existing format doesn't have "MessageSetSize" at > > the partition level. > > > This was intentional, but it is easy to overlook. The idea is to modify the > ProduceRequest so that only one message set is included for each partition. > Since the message set contains its own length field, it seemed unnecessary > to have a separate field. The justification for this change was to make the > produce request atomic. With only a single message set for each partition, > either it will be written successfully or not, so an error in the response > will be unambiguous. We are uncertain whether there are legitimate use > cases that require producing smaller message sets in the ProduceRequest, so > we would love to hear feedback on this. > > Thanks, > Jason > > On Fri, Jan 27, 2017 at 4:21 PM, Apurva Mehta <apu...@confluent.io> wrote: > > > Hi again Jun, > > > > I have update the document to address your comments below, but including > > the responses inline to make it easier for everyone to stay on top of the > > conversation. > > > > > > > > > 106. Compacted topics. > > > 106.1. When all messages in a transaction are removed, we could remove > > the > > > commit/abort marker for that transaction too. However, we have to be a > > bit > > > careful. If the marker is removed too quickly, it's possible for a > > consumer > > > to see a message in that transaction, but not to see the marker, and > > > therefore will be stuck in that transaction forever. We have a similar > > > issue when dealing with tombstones. The solution is to preserve the > > > tombstone for at least a preconfigured amount of time after the > cleaning > > > has passed the tombstone. Then, as long as a consumer can finish > reading > > to > > > the cleaning point within the configured amount of time, it's > guaranteed > > > not to miss the tombstone after it has seen a non-tombstone message on > > the > > > same key. I am wondering if we should do something similar here. > > > > > > > This is a good point. As we discussed offline, the solution for the > removal > > of control messages will be the same as the solution for problem of > > tombstone removal documented in > > https://issues.apache.org/jira/browse/KAFKA-4545. > > > > 106.2. "To address this problem, we propose to preserve the last epoch > and > > > sequence number written by each producer for a fixed amount of time as > an > > > empty message set. This is allowed by the new message format we are > > > proposing in this document. The time to preserve the sequence number > will > > > be governed by the log retention settings. " Could you be a bit more > > > specific on what retention time will be used since by default, there is > > no > > > retention time for compacted (but not delete) topic? > > > > > > > We discussed this offline, and the consensus that it is reasonable to use > > brokers global log.retention.* settings for these messages. > > > > > > > 106.3 "As for control messages, if the broker does not have any > > > corresponding transaction cached with the PID when encountering a > control > > > message, that message can be safely removed." > > > Do controlled messages have keys? If not, do we need to relax the > > > > constraint that messages in a compacted topic must have keys? > > > > > > > The key of a control messages is the control message type. As such, > regular > > compaction logic based on key will not apply to control messages. We will > > have to update the log cleaner to ignore messages which have the control > > message bit set. > > > > Control messages can be removed at some point after the last messages of > > the corresponding transaction are removed. As suggested in KAFKA-4545, we > > can use the timestamp associated with the log segment to deduce the safe > > expiration time for control messages in that segment. > > > > > > > > > 112. Control message: Will control messages be used for timestamp > > indexing? > > > If so, what timestamp will we use if the timestamp type is creation > time? > > > > > > > > Control messages will not be used for timestamp indexing. Each control > > message will have the log append time for the timestamp, but these > messages > > will be ignored when building the timestamp index. Since control messages > > are for system use only and will never be exposed to users, it doesn't > make > > sense to include them in the timestamp index. > > > > Further, as you mentioned, when a topic uses creation time, it is > > impossible to ensure that control messages will not skew the time based > > index, since these messages are sent by the transaction coordinator which > > has no notion of the application level message creation time. > > > > Thanks, > > Apurva > > >