Re: [VOTE] KIP-232: Detect outdated metadata using leaderEpoch and partitionEpoch

Dong Lin Mon, 04 Jun 2018 19:04:15 -0700

Hey Jun,

It seems that we have made considerable progress on the discussion of
KIP-253 since February. Do you think we should continue the discussion
there, or can we continue the voting for this KIP? I am happy to submit the
PR and move forward the progress for this KIP.


Thanks!
Dong


On Wed, Feb 7, 2018 at 11:42 PM, Dong Lin <[email protected]> wrote:

> Hey Jun,
>
> Sure, I will come up with a KIP this week. I think there is a way to allow
> partition expansion to arbitrary number without introducing new concepts
> such as read-only partition or repartition epoch.
>
> Thanks,
> Dong
>
> On Wed, Feb 7, 2018 at 5:28 PM, Jun Rao <[email protected]> wrote:
>
>> Hi, Dong,
>>
>> Thanks for the reply. The general idea that you had for adding partitions
>> is similar to what we had in mind. It would be useful to make this more
>> general, allowing adding an arbitrary number of partitions (instead of
>> just
>> doubling) and potentially removing partitions as well. The following is
>> the
>> high level idea from the discussion with Colin, Jason and Ismael.
>>
>> * To change the number of partitions from X to Y in a topic, the
>> controller
>> marks all existing X partitions as read-only and creates Y new partitions.
>> The new partitions are writable and are tagged with a higher repartition
>> epoch (RE).
>>
>> * The controller propagates the new metadata to every broker. Once the
>> leader of a partition is marked as read-only, it rejects the produce
>> requests on this partition. The producer will then refresh the metadata
>> and
>> start publishing to the new writable partitions.
>>
>> * The consumers will then be consuming messages in RE order. The consumer
>> coordinator will only assign partitions in the same RE to consumers. Only
>> after all messages in an RE are consumed, will partitions in a higher RE
>> be
>> assigned to consumers.
>>
>> As Colin mentioned, if we do the above, we could potentially (1) use a
>> globally unique partition id, or (2) use a globally unique topic id to
>> distinguish recreated partitions due to topic deletion.
>>
>> So, perhaps we can sketch out the re-partitioning KIP a bit more and see
>> if
>> there is any overlap with KIP-232. Would you be interested in doing that?
>> If not, we can do that next week.
>>
>> Jun
>>
>>
>> On Tue, Feb 6, 2018 at 11:30 AM, Dong Lin <[email protected]> wrote:
>>
>> > Hey Jun,
>> >
>> > Interestingly I am also planning to sketch a KIP to allow partition
>> > expansion for keyed topics after this KIP. Since you are already doing
>> > that, I guess I will just share my high level idea here in case it is
>> > helpful.
>> >
>> > The motivation for the KIP is that we currently lose order guarantee for
>> > messages with the same key if we expand partitions of keyed topic.
>> >
>> > The solution can probably be built upon the following ideas:
>> >
>> > - Partition number of the keyed topic should always be doubled (or
>> > multiplied by power of 2). Given that we select a partition based on
>> > hash(key) % partitionNum, this should help us ensure that, a message
>> > assigned to an existing partition will not be mapped to another existing
>> > partition after partition expansion.
>> >
>> > - Producer includes in the ProduceRequest some information that helps
>> > ensure that messages produced ti a partition will monotonically
>> increase in
>> > the partitionNum of the topic. In other words, if broker receives a
>> > ProduceRequest and notices that the producer does not know the partition
>> > number has increased, broker should reject this request. That
>> "information"
>> > maybe leaderEpoch, max partitionEpoch of the partitions of the topic, or
>> > simply partitionNum of the topic. The benefit of this property is that
>> we
>> > can keep the new logic for in-order message consumption entirely in how
>> > consumer leader determines the partition -> consumer mapping.
>> >
>> > - When consumer leader determines partition -> consumer mapping, leader
>> > first reads the start position for each partition using
>> OffsetFetchRequest.
>> > If start position are all non-zero, then assignment can be done in its
>> > current manner. The assumption is that, a message in the new partition
>> > should only be consumed after all messages with the same key produced
>> > before it has been consumed. Since some messages in the new partition
>> has
>> > been consumed, we should not worry about consuming messages
>> out-of-order.
>> > This benefit of this approach is that we can avoid unnecessary overhead
>> in
>> > the common case.
>> >
>> > - If the consumer leader finds that the start position for some
>> partition
>> > is 0. Say the current partition number is 18 and the partition index is
>> 12,
>> > then consumer leader should ensure that messages produced to partition
>> 12 -
>> > 18/2 = 3 before the first message of partition 12 is consumed, before it
>> > assigned partition 12 to any consumer in the consumer group. Since we
>> have
>> > a "information" that is monotonically increasing per partition, consumer
>> > can read the value of this information from the first message in
>> partition
>> > 12, get the offset corresponding to this value in partition 3, assign
>> > partition except for partition 12 (and probably other new partitions) to
>> > the existing consumers, waiting for the committed offset to go beyond
>> this
>> > offset for partition 3, and trigger rebalance again so that partition 3
>> can
>> > be reassigned to some consumer.
>> >
>> >
>> > Thanks,
>> > Dong
>> >
>> >
>> > On Tue, Feb 6, 2018 at 10:10 AM, Jun Rao <[email protected]> wrote:
>> >
>> > > Hi, Dong,
>> > >
>> > > Thanks for the KIP. It looks good overall. We are working on a
>> separate
>> > KIP
>> > > for adding partitions while preserving the ordering guarantees. That
>> may
>> > > require another flavor of partition epoch. It's not very clear whether
>> > that
>> > > partition epoch can be merged with the partition epoch in this KIP.
>> So,
>> > > perhaps you can wait on this a bit until we post the other KIP in the
>> > next
>> > > few days.
>> > >
>> > > Jun
>> > >
>> > >
>> > >
>> > > On Mon, Feb 5, 2018 at 2:43 PM, Becket Qin <[email protected]>
>> wrote:
>> > >
>> > > > +1 on the KIP.
>> > > >
>> > > > I think the KIP is mainly about adding the capability of tracking
>> the
>> > > > system state change lineage. It does not seem necessary to bundle
>> this
>> > > KIP
>> > > > with replacing the topic partition with partition epoch in
>> > produce/fetch.
>> > > > Replacing topic-partition string with partition epoch is
>> essentially a
>> > > > performance improvement on top of this KIP. That can probably be
>> done
>> > > > separately.
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Jiangjie (Becket) Qin
>> > > >
>> > > > On Mon, Jan 29, 2018 at 11:52 AM, Dong Lin <[email protected]>
>> > wrote:
>> > > >
>> > > > > Hey Colin,
>> > > > >
>> > > > > On Mon, Jan 29, 2018 at 11:23 AM, Colin McCabe <
>> [email protected]>
>> > > > wrote:
>> > > > >
>> > > > > > > On Mon, Jan 29, 2018 at 10:35 AM, Dong Lin <
>> [email protected]>
>> > > > > wrote:
>> > > > > > >
>> > > > > > > > Hey Colin,
>> > > > > > > >
>> > > > > > > > I understand that the KIP will adds overhead by introducing
>> > > > > > per-partition
>> > > > > > > > partitionEpoch. I am open to alternative solutions that does
>> > not
>> > > > > incur
>> > > > > > > > additional overhead. But I don't see a better way now.
>> > > > > > > >
>> > > > > > > > IMO the overhead in the FetchResponse may not be that much.
>> We
>> > > > > probably
>> > > > > > > > should discuss the percentage increase rather than the
>> absolute
>> > > > > number
>> > > > > > > > increase. Currently after KIP-227, per-partition header has
>> 23
>> > > > bytes.
>> > > > > > This
>> > > > > > > > KIP adds another 4 bytes. Assume the records size is 10KB,
>> the
>> > > > > > percentage
>> > > > > > > > increase is 4 / (23 + 10000) = 0.03%. It seems negligible,
>> > right?
>> > > > > >
>> > > > > > Hi Dong,
>> > > > > >
>> > > > > > Thanks for the response.  I agree that the FetchRequest /
>> > > FetchResponse
>> > > > > > overhead should be OK, now that we have incremental fetch
>> requests
>> > > and
>> > > > > > responses.  However, there are a lot of cases where the
>> percentage
>> > > > > increase
>> > > > > > is much greater.  For example, if a client is doing full
>> > > > > MetadataRequests /
>> > > > > > Responses, we have some math kind of like this per partition:
>> > > > > >
>> > > > > > > UpdateMetadataRequestPartitionState => topic partition
>> > > > > controller_epoch
>> > > > > > leader  leader_epoch partition_epoch isr zk_version replicas
>> > > > > > offline_replicas
>> > > > > > > 14 bytes:  topic => string (assuming about 10 byte topic
>> names)
>> > > > > > > 4 bytes:  partition => int32
>> > > > > > > 4  bytes: conroller_epoch => int32
>> > > > > > > 4  bytes: leader => int32
>> > > > > > > 4  bytes: leader_epoch => int32
>> > > > > > > +4 EXTRA bytes: partition_epoch => int32        <-- NEW
>> > > > > > > 2+4+4+4 bytes: isr => [int32] (assuming 3 in the ISR)
>> > > > > > > 4 bytes: zk_version => int32
>> > > > > > > 2+4+4+4 bytes: replicas => [int32] (assuming 3 replicas)
>> > > > > > > 2  offline_replicas => [int32] (assuming no offline replicas)
>> > > > > >
>> > > > > > Assuming I added that up correctly, the per-partition overhead
>> goes
>> > > > from
>> > > > > > 64 bytes per partition to 68, a 6.2% increase.
>> > > > > >
>> > > > > > We could do similar math for a lot of the other RPCs.  And you
>> will
>> > > > have
>> > > > > a
>> > > > > > similar memory and garbage collection impact on the brokers
>> since
>> > you
>> > > > > have
>> > > > > > to store all this extra state as well.
>> > > > > >
>> > > > >
>> > > > > That is correct. IMO the Metadata is only updated periodically
>> and is
>> > > > > probably not a big deal if we increase it by 6%. The FetchResponse
>> > and
>> > > > > ProduceRequest are probably the only requests that are bounded by
>> the
>> > > > > bandwidth throughput.
>> > > > >
>> > > > >
>> > > > > >
>> > > > > > > >
>> > > > > > > > I agree that we can probably save more space by using
>> partition
>> > > ID
>> > > > so
>> > > > > > that
>> > > > > > > > we no longer needs the string topic name. The similar idea
>> has
>> > > also
>> > > > > > been
>> > > > > > > > put in the Rejected Alternative section in KIP-227. While
>> this
>> > > idea
>> > > > > is
>> > > > > > > > promising, it seems orthogonal to the goal of this KIP.
>> Given
>> > > that
>> > > > > > there is
>> > > > > > > > already many work to do in this KIP, maybe we can do the
>> > > partition
>> > > > ID
>> > > > > > in a
>> > > > > > > > separate KIP?
>> > > > > >
>> > > > > > I guess my thinking is that the goal here is to replace an
>> > identifier
>> > > > > > which can be re-used (the tuple of topic name, partition ID)
>> with
>> > an
>> > > > > > identifier that cannot be re-used (the tuple of topic name,
>> > partition
>> > > > ID,
>> > > > > > partition epoch) in order to gain better semantics.  As long as
>> we
>> > > are
>> > > > > > replacing the identifier, why not replace it with an identifier
>> > that
>> > > > has
>> > > > > > important performance advantages?  The KIP freeze for the next
>> > > release
>> > > > > has
>> > > > > > already passed, so there is time to do this.
>> > > > > >
>> > > > >
>> > > > > In general it can be easier for discussion and implementation if
>> we
>> > can
>> > > > > split a larger task into smaller and independent tasks. For
>> example,
>> > > > > KIP-112 and KIP-113 both deals with the JBOD support. KIP-31,
>> KIP-32
>> > > and
>> > > > > KIP-33 are about timestamp support. The option on this can be
>> subject
>> > > > > though.
>> > > > >
>> > > > > IMO the change to switch from (topic, partition ID) to
>> partitionEpch
>> > in
>> > > > all
>> > > > > request/response requires us to going through all request one by
>> one.
>> > > It
>> > > > > may not be hard but it can be time consuming and tedious. At high
>> > level
>> > > > the
>> > > > > goal and the change for that will be orthogonal to the changes
>> > required
>> > > > in
>> > > > > this KIP. That is the main reason I think we can split them into
>> two
>> > > > KIPs.
>> > > > >
>> > > > >
>> > > > > > On Mon, Jan 29, 2018, at 10:54, Dong Lin wrote:
>> > > > > > > I think it is possible to move to entirely use partitionEpoch
>> > > instead
>> > > > > of
>> > > > > > > (topic, partition) to identify a partition. Client can obtain
>> the
>> > > > > > > partitionEpoch -> (topic, partition) mapping from
>> > MetadataResponse.
>> > > > We
>> > > > > > > probably need to figure out a way to assign partitionEpoch to
>> > > > existing
>> > > > > > > partitions in the cluster. But this should be doable.
>> > > > > > >
>> > > > > > > This is a good idea. I think it will save us some space in the
>> > > > > > > request/response. The actual space saving in percentage
>> probably
>> > > > > depends
>> > > > > > on
>> > > > > > > the amount of data and the number of partitions of the same
>> > topic.
>> > > I
>> > > > > just
>> > > > > > > think we can do it in a separate KIP.
>> > > > > >
>> > > > > > Hmm.  How much extra work would be required?  It seems like we
>> are
>> > > > > already
>> > > > > > changing almost every RPC that involves topics and partitions,
>> > > already
>> > > > > > adding new per-partition state to ZooKeeper, already changing
>> how
>> > > > clients
>> > > > > > interact with partitions.  Is there some other big piece of work
>> > we'd
>> > > > > have
>> > > > > > to do to move to partition IDs that we wouldn't need for
>> partition
>> > > > > epochs?
>> > > > > > I guess we'd have to find a way to support regular
>> expression-based
>> > > > topic
>> > > > > > subscriptions.  If we split this into multiple KIPs, wouldn't we
>> > end
>> > > up
>> > > > > > changing all that RPCs and ZK state a second time?  Also, I'm
>> > curious
>> > > > if
>> > > > > > anyone has done any proof of concept GC, memory, and network
>> usage
>> > > > > > measurements on switching topic names for topic IDs.
>> > > > > >
>> > > > >
>> > > > >
>> > > > > We will need to go over all requests/responses to check how to
>> > replace
>> > > > > (topic, partition ID) with partition epoch. It requires
>> non-trivial
>> > > work
>> > > > > and could take time. As you mentioned, we may want to see how much
>> > > saving
>> > > > > we can get by switching from topic names to partition epoch. That
>> > > itself
>> > > > > requires time and experiment. It seems that the new idea does not
>> > > > rollback
>> > > > > any change proposed in this KIP. So I am not sure we can get much
>> by
>> > > > > putting them into the same KIP.
>> > > > >
>> > > > > Anyway, if more people are interested in seeing the new idea in
>> the
>> > > same
>> > > > > KIP, I can try that.
>> > > > >
>> > > > >
>> > > > >
>> > > > > >
>> > > > > > best,
>> > > > > > Colin
>> > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Mon, Jan 29, 2018 at 10:18 AM, Colin McCabe <
>> > > [email protected]
>> > > > >
>> > > > > > wrote:
>> > > > > > > >
>> > > > > > > >> On Fri, Jan 26, 2018, at 12:17, Dong Lin wrote:
>> > > > > > > >> > Hey Colin,
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> > On Fri, Jan 26, 2018 at 10:16 AM, Colin McCabe <
>> > > > > [email protected]>
>> > > > > > > >> wrote:
>> > > > > > > >> >
>> > > > > > > >> > > On Thu, Jan 25, 2018, at 16:47, Dong Lin wrote:
>> > > > > > > >> > > > Hey Colin,
>> > > > > > > >> > > >
>> > > > > > > >> > > > Thanks for the comment.
>> > > > > > > >> > > >
>> > > > > > > >> > > > On Thu, Jan 25, 2018 at 4:15 PM, Colin McCabe <
>> > > > > > [email protected]>
>> > > > > > > >> > > wrote:
>> > > > > > > >> > > >
>> > > > > > > >> > > > > On Wed, Jan 24, 2018, at 21:07, Dong Lin wrote:
>> > > > > > > >> > > > > > Hey Colin,
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > Thanks for reviewing the KIP.
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > If I understand you right, you maybe suggesting
>> that
>> > > we
>> > > > > can
>> > > > > > use
>> > > > > > > >> a
>> > > > > > > >> > > global
>> > > > > > > >> > > > > > metadataEpoch that is incremented every time
>> > > controller
>> > > > > > updates
>> > > > > > > >> > > metadata.
>> > > > > > > >> > > > > > The problem with this solution is that, if a
>> topic
>> > is
>> > > > > > deleted
>> > > > > > > >> and
>> > > > > > > >> > > created
>> > > > > > > >> > > > > > again, user will not know whether that the offset
>> > > which
>> > > > is
>> > > > > > > >> stored
>> > > > > > > >> > > before
>> > > > > > > >> > > > > > the topic deletion is no longer valid. This
>> > motivates
>> > > > the
>> > > > > > idea
>> > > > > > > >> to
>> > > > > > > >> > > include
>> > > > > > > >> > > > > > per-partition partitionEpoch. Does this sound
>> > > > reasonable?
>> > > > > > > >> > > > >
>> > > > > > > >> > > > > Hi Dong,
>> > > > > > > >> > > > >
>> > > > > > > >> > > > > Perhaps we can store the last valid offset of each
>> > > deleted
>> > > > > > topic
>> > > > > > > >> in
>> > > > > > > >> > > > > ZooKeeper.  Then, when a topic with one of those
>> names
>> > > > gets
>> > > > > > > >> > > re-created, we
>> > > > > > > >> > > > > can start the topic at the previous end offset
>> rather
>> > > than
>> > > > > at
>> > > > > > 0.
>> > > > > > > >> This
>> > > > > > > >> > > > > preserves immutability.  It is no more burdensome
>> than
>> > > > > having
>> > > > > > to
>> > > > > > > >> > > preserve a
>> > > > > > > >> > > > > "last epoch" for the deleted partition somewhere,
>> > right?
>> > > > > > > >> > > > >
>> > > > > > > >> > > >
>> > > > > > > >> > > > My concern with this solution is that the number of
>> > > > zookeeper
>> > > > > > nodes
>> > > > > > > >> get
>> > > > > > > >> > > > more and more over time if some users keep deleting
>> and
>> > > > > creating
>> > > > > > > >> topics.
>> > > > > > > >> > > Do
>> > > > > > > >> > > > you think this can be a problem?
>> > > > > > > >> > >
>> > > > > > > >> > > Hi Dong,
>> > > > > > > >> > >
>> > > > > > > >> > > We could expire the "partition tombstones" after an
>> hour
>> > or
>> > > > so.
>> > > > > > In
>> > > > > > > >> > > practice this would solve the issue for clients that
>> like
>> > to
>> > > > > > destroy
>> > > > > > > >> and
>> > > > > > > >> > > re-create topics all the time.  In any case, doesn't
>> the
>> > > > current
>> > > > > > > >> proposal
>> > > > > > > >> > > add per-partition znodes as well that we have to track
>> > even
>> > > > > after
>> > > > > > the
>> > > > > > > >> > > partition is deleted?  Or did I misunderstand that?
>> > > > > > > >> > >
>> > > > > > > >> >
>> > > > > > > >> > Actually the current KIP does not add per-partition
>> znodes.
>> > > > Could
>> > > > > > you
>> > > > > > > >> > double check? I can fix the KIP wiki if there is anything
>> > > > > > misleading.
>> > > > > > > >>
>> > > > > > > >> Hi Dong,
>> > > > > > > >>
>> > > > > > > >> I double-checked the KIP, and I can see that you are in
>> fact
>> > > > using a
>> > > > > > > >> global counter for initializing partition epochs.  So, you
>> are
>> > > > > > correct, it
>> > > > > > > >> doesn't add per-partition znodes for partitions that no
>> longer
>> > > > > exist.
>> > > > > > > >>
>> > > > > > > >> >
>> > > > > > > >> > If we expire the "partition tomstones" after an hour, and
>> > the
>> > > > > topic
>> > > > > > is
>> > > > > > > >> > re-created after more than an hour since the topic
>> deletion,
>> > > > then
>> > > > > > we are
>> > > > > > > >> > back to the situation where user can not tell whether the
>> > > topic
>> > > > > has
>> > > > > > been
>> > > > > > > >> > re-created or not, right?
>> > > > > > > >>
>> > > > > > > >> Yes, with an expiration period, it would not ensure
>> > > immutability--
>> > > > > you
>> > > > > > > >> could effectively reuse partition names and they would look
>> > the
>> > > > > same.
>> > > > > > > >>
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> > >
>> > > > > > > >> > > It's not really clear to me what should happen when a
>> > topic
>> > > is
>> > > > > > > >> destroyed
>> > > > > > > >> > > and re-created with new data.  Should consumers
>> continue
>> > to
>> > > be
>> > > > > > able to
>> > > > > > > >> > > consume?  We don't know where they stopped consuming
>> from
>> > > the
>> > > > > > previous
>> > > > > > > >> > > incarnation of the topic, so messages may have been
>> lost.
>> > > > > > Certainly
>> > > > > > > >> > > consuming data from offset X of the new incarnation of
>> the
>> > > > topic
>> > > > > > may
>> > > > > > > >> give
>> > > > > > > >> > > something totally different from what you would have
>> > gotten
>> > > > from
>> > > > > > > >> offset X
>> > > > > > > >> > > of the previous incarnation of the topic.
>> > > > > > > >> > >
>> > > > > > > >> >
>> > > > > > > >> > With the current KIP, if a consumer consumes a topic
>> based
>> > on
>> > > > the
>> > > > > > last
>> > > > > > > >> > remembered (offset, partitionEpoch, leaderEpoch), and if
>> the
>> > > > topic
>> > > > > > is
>> > > > > > > >> > re-created, consume will throw
>> > InvalidPartitionEpochException
>> > > > > > because
>> > > > > > > >> the
>> > > > > > > >> > previous partitionEpoch will be different from the
>> current
>> > > > > > > >> partitionEpoch.
>> > > > > > > >> > This is described in the Proposed Changes -> Consumption
>> > after
>> > > > > topic
>> > > > > > > >> > deletion in the KIP. I can improve the KIP if there is
>> > > anything
>> > > > > not
>> > > > > > > >> clear.
>> > > > > > > >>
>> > > > > > > >> Thanks for the clarification.  It sounds like what you
>> really
>> > > want
>> > > > > is
>> > > > > > > >> immutability-- i.e., to never "really" reuse partition
>> > > > identifiers.
>> > > > > > And
>> > > > > > > >> you do this by making the partition name no longer the
>> "real"
>> > > > > > identifier.
>> > > > > > > >>
>> > > > > > > >> My big concern about this KIP is that it seems like an
>> > > > > > anti-scalability
>> > > > > > > >> feature.  Now we are adding 4 extra bytes for every
>> partition
>> > in
>> > > > the
>> > > > > > > >> FetchResponse and Request, for example.  That could be 40
>> kb
>> > per
>> > > > > > request,
>> > > > > > > >> if the user has 10,000 partitions.  And of course, the KIP
>> > also
>> > > > > makes
>> > > > > > > >> massive changes to UpdateMetadataRequest, MetadataResponse,
>> > > > > > > >> OffsetCommitRequest, OffsetFetchResponse,
>> LeaderAndIsrRequest,
>> > > > > > > >> ListOffsetResponse, etc. which will also increase their
>> size
>> > on
>> > > > the
>> > > > > > wire
>> > > > > > > >> and in memory.
>> > > > > > > >>
>> > > > > > > >> One thing that we talked a lot about in the past is
>> replacing
>> > > > > > partition
>> > > > > > > >> names with IDs.  IDs have a lot of really nice features.
>> They
>> > > > take
>> > > > > > up much
>> > > > > > > >> less space in memory than strings (especially 2-byte Java
>> > > > strings).
>> > > > > > They
>> > > > > > > >> can often be allocated on the stack rather than the heap
>> > > > (important
>> > > > > > when
>> > > > > > > >> you are dealing with hundreds of thousands of them).  They
>> can
>> > > be
>> > > > > > > >> efficiently deserialized and serialized.  If we use 64-bit
>> > ones,
>> > > > we
>> > > > > > will
>> > > > > > > >> never run out of IDs, which means that they can always be
>> > unique
>> > > > per
>> > > > > > > >> partition.
>> > > > > > > >>
>> > > > > > > >> Given that the partition name is no longer the "real"
>> > identifier
>> > > > for
>> > > > > > > >> partitions in the current KIP-232 proposal, why not just
>> move
>> > to
>> > > > > using
>> > > > > > > >> partition IDs entirely instead of strings?  You have to
>> change
>> > > all
>> > > > > the
>> > > > > > > >> messages anyway.  There isn't much point any more to
>> carrying
>> > > > around
>> > > > > > the
>> > > > > > > >> partition name in every RPC, since you really need (name,
>> > epoch)
>> > > > to
>> > > > > > > >> identify the partition.
>> > > > > > > >> Probably the metadata response and a few other messages
>> would
>> > > have
>> > > > > to
>> > > > > > > >> still carry the partition name, to allow clients to go from
>> > name
>> > > > to
>> > > > > > id.
>> > > > > > > >> But we could mostly forget about the strings.  And then
>> this
>> > > would
>> > > > > be
>> > > > > > a
>> > > > > > > >> scalability improvement rather than a scalability problem.
>> > > > > > > >>
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> > > By choosing to reuse the same (topic, partition,
>> offset)
>> > > > > 3-tuple,
>> > > > > > we
>> > > > > > > >> have
>> > > > > > > >> >
>> > > > > > > >> > chosen to give up immutability.  That was a really bad
>> > > decision.
>> > > > > > And
>> > > > > > > >> now
>> > > > > > > >> > > we have to worry about time dependencies, stale cached
>> > data,
>> > > > and
>> > > > > > all
>> > > > > > > >> the
>> > > > > > > >> > > rest.  We can't completely fix this inside Kafka no
>> matter
>> > > > what
>> > > > > > we do,
>> > > > > > > >> > > because not all that cached data is inside Kafka
>> itself.
>> > > Some
>> > > > > of
>> > > > > > it
>> > > > > > > >> may be
>> > > > > > > >> > > in systems that Kafka has sent data to, such as other
>> > > daemons,
>> > > > > SQL
>> > > > > > > >> > > databases, streams, and so forth.
>> > > > > > > >> > >
>> > > > > > > >> >
>> > > > > > > >> > The current KIP will uniquely identify a message using
>> > (topic,
>> > > > > > > >> partition,
>> > > > > > > >> > offset, partitionEpoch) 4-tuple. This addresses the
>> message
>> > > > > > immutability
>> > > > > > > >> > issue that you mentioned. Is there any corner case where
>> the
>> > > > > message
>> > > > > > > >> > immutability is still not preserved with the current KIP?
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> > >
>> > > > > > > >> > > I guess the idea here is that mirror maker should work
>> as
>> > > > > expected
>> > > > > > > >> when
>> > > > > > > >> > > users destroy a topic and re-create it with the same
>> name.
>> > > > > That's
>> > > > > > > >> kind of
>> > > > > > > >> > > tough, though, since in that scenario, mirror maker
>> > probably
>> > > > > > should
>> > > > > > > >> destroy
>> > > > > > > >> > > and re-create the topic on the other end, too, right?
>> > > > > Otherwise,
>> > > > > > > >> what you
>> > > > > > > >> > > end up with on the other end could be half of one
>> > > incarnation
>> > > > of
>> > > > > > the
>> > > > > > > >> topic,
>> > > > > > > >> > > and half of another.
>> > > > > > > >> > >
>> > > > > > > >> > > What mirror maker really needs is to be able to follow
>> a
>> > > > stream
>> > > > > of
>> > > > > > > >> events
>> > > > > > > >> > > about the kafka cluster itself.  We could have some
>> master
>> > > > topic
>> > > > > > > >> which is
>> > > > > > > >> > > always present and which contains data about all topic
>> > > > > deletions,
>> > > > > > > >> > > creations, etc.  Then MM can simply follow this topic
>> and
>> > do
>> > > > > what
>> > > > > > is
>> > > > > > > >> needed.
>> > > > > > > >> > >
>> > > > > > > >> > > >
>> > > > > > > >> > > >
>> > > > > > > >> > > > >
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > Then the next question maybe, should we use a
>> global
>> > > > > > > >> metadataEpoch +
>> > > > > > > >> > > > > > per-partition partitionEpoch, instead of using
>> > > > > per-partition
>> > > > > > > >> > > leaderEpoch
>> > > > > > > >> > > > > +
>> > > > > > > >> > > > > > per-partition leaderEpoch. The former solution
>> using
>> > > > > > > >> metadataEpoch
>> > > > > > > >> > > would
>> > > > > > > >> > > > > > not work due to the following scenario (provided
>> by
>> > > > Jun):
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > "Consider the following scenario. In metadata v1,
>> > the
>> > > > > leader
>> > > > > > > >> for a
>> > > > > > > >> > > > > > partition is at broker 1. In metadata v2, leader
>> is
>> > at
>> > > > > > broker
>> > > > > > > >> 2. In
>> > > > > > > >> > > > > > metadata v3, leader is at broker 1 again. The
>> last
>> > > > > committed
>> > > > > > > >> offset
>> > > > > > > >> > > in
>> > > > > > > >> > > > > v1,
>> > > > > > > >> > > > > > v2 and v3 are 10, 20 and 30, respectively. A
>> > consumer
>> > > is
>> > > > > > > >> started and
>> > > > > > > >> > > > > reads
>> > > > > > > >> > > > > > metadata v1 and reads messages from offset 0 to
>> 25
>> > > from
>> > > > > > broker
>> > > > > > > >> 1. My
>> > > > > > > >> > > > > > understanding is that in the current proposal,
>> the
>> > > > > metadata
>> > > > > > > >> version
>> > > > > > > >> > > > > > associated with offset 25 is v1. The consumer is
>> > then
>> > > > > > restarted
>> > > > > > > >> and
>> > > > > > > >> > > > > fetches
>> > > > > > > >> > > > > > metadata v2. The consumer tries to read from
>> broker
>> > 2,
>> > > > > > which is
>> > > > > > > >> the
>> > > > > > > >> > > old
>> > > > > > > >> > > > > > leader with the last offset at 20. In this case,
>> the
>> > > > > > consumer
>> > > > > > > >> will
>> > > > > > > >> > > still
>> > > > > > > >> > > > > > get OffsetOutOfRangeException incorrectly."
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > Regarding your comment "For the second purpose,
>> this
>> > > is
>> > > > > > "soft
>> > > > > > > >> state"
>> > > > > > > >> > > > > > anyway.  If the client thinks X is the leader
>> but Y
>> > is
>> > > > > > really
>> > > > > > > >> the
>> > > > > > > >> > > leader,
>> > > > > > > >> > > > > > the client will talk to X, and X will point out
>> its
>> > > > > mistake
>> > > > > > by
>> > > > > > > >> > > sending
>> > > > > > > >> > > > > back
>> > > > > > > >> > > > > > a NOT_LEADER_FOR_PARTITION.", it is probably no
>> > true.
>> > > > The
>> > > > > > > >> problem
>> > > > > > > >> > > here is
>> > > > > > > >> > > > > > that the old leader X may still think it is the
>> > leader
>> > > > of
>> > > > > > the
>> > > > > > > >> > > partition
>> > > > > > > >> > > > > and
>> > > > > > > >> > > > > > thus it will not send back
>> NOT_LEADER_FOR_PARTITION.
>> > > The
>> > > > > > reason
>> > > > > > > >> is
>> > > > > > > >> > > > > provided
>> > > > > > > >> > > > > > in KAFKA-6262. Can you check if that makes sense?
>> > > > > > > >> > > > >
>> > > > > > > >> > > > > This is solvable with a timeout, right?  If the
>> leader
>> > > > can't
>> > > > > > > >> > > communicate
>> > > > > > > >> > > > > with the controller for a certain period of time,
>> it
>> > > > should
>> > > > > > stop
>> > > > > > > >> > > acting as
>> > > > > > > >> > > > > the leader.  We have to solve this problem,
>> anyway, in
>> > > > order
>> > > > > > to
>> > > > > > > >> fix
>> > > > > > > >> > > all the
>> > > > > > > >> > > > > corner cases.
>> > > > > > > >> > > > >
>> > > > > > > >> > > >
>> > > > > > > >> > > > Not sure if I fully understand your proposal. The
>> > proposal
>> > > > > > seems to
>> > > > > > > >> > > require
>> > > > > > > >> > > > non-trivial changes to our existing leadership
>> election
>> > > > > > mechanism.
>> > > > > > > >> Could
>> > > > > > > >> > > > you provide more detail regarding how it works? For
>> > > example,
>> > > > > how
>> > > > > > > >> should
>> > > > > > > >> > > > user choose this timeout, how leader determines
>> whether
>> > it
>> > > > can
>> > > > > > still
>> > > > > > > >> > > > communicate with controller, and how this triggers
>> > > > controller
>> > > > > to
>> > > > > > > >> elect
>> > > > > > > >> > > new
>> > > > > > > >> > > > leader?
>> > > > > > > >> > >
>> > > > > > > >> > > Before I come up with any proposal, let me make sure I
>> > > > > understand
>> > > > > > the
>> > > > > > > >> > > problem correctly.  My big question was, what prevents
>> > > > > split-brain
>> > > > > > > >> here?
>> > > > > > > >> > >
>> > > > > > > >> > > Let's say I have a partition which is on nodes A, B,
>> and
>> > C,
>> > > > with
>> > > > > > > >> min-ISR
>> > > > > > > >> > > 2.  The controller is D.  At some point, there is a
>> > network
>> > > > > > partition
>> > > > > > > >> > > between A and B and the rest of the cluster.  The
>> > Controller
>> > > > > > > >> re-assigns the
>> > > > > > > >> > > partition to nodes C, D, and E.  But A and B keep
>> chugging
>> > > > away,
>> > > > > > even
>> > > > > > > >> > > though they can no longer communicate with the
>> controller.
>> > > > > > > >> > >
>> > > > > > > >> > > At some point, a client with stale metadata writes to
>> the
>> > > > > > partition.
>> > > > > > > >> It
>> > > > > > > >> > > still thinks the partition is on node A, B, and C, so
>> > that's
>> > > > > > where it
>> > > > > > > >> sends
>> > > > > > > >> > > the data.  It's unable to talk to C, but A and B reply
>> > back
>> > > > that
>> > > > > > all
>> > > > > > > >> is
>> > > > > > > >> > > well.
>> > > > > > > >> > >
>> > > > > > > >> > > Is this not a case where we could lose data due to
>> split
>> > > > brain?
>> > > > > > Or is
>> > > > > > > >> > > there a mechanism for preventing this that I missed?
>> If
>> > it
>> > > > is,
>> > > > > it
>> > > > > > > >> seems
>> > > > > > > >> > > like a pretty serious failure case that we should be
>> > > handling
>> > > > > > with our
>> > > > > > > >> > > metadata rework.  And I think epoch numbers and
>> timeouts
>> > > might
>> > > > > be
>> > > > > > > >> part of
>> > > > > > > >> > > the solution.
>> > > > > > > >> > >
>> > > > > > > >> >
>> > > > > > > >> > Right, split brain can happen if RF=4 and minIsr=2.
>> > However, I
>> > > > am
>> > > > > > not
>> > > > > > > >> sure
>> > > > > > > >> > it is a pretty serious issue which we need to address
>> today.
>> > > > This
>> > > > > > can be
>> > > > > > > >> > prevented by configuring the Kafka topic so that minIsr >
>> > > RF/2.
>> > > > > > > >> Actually,
>> > > > > > > >> > if user sets minIsr=2, is there anything reason that user
>> > > wants
>> > > > to
>> > > > > > set
>> > > > > > > >> RF=4
>> > > > > > > >> > instead of 4?
>> > > > > > > >> >
>> > > > > > > >> > Introducing timeout in leader election mechanism is
>> > > > non-trivial. I
>> > > > > > > >> think we
>> > > > > > > >> > probably want to do that only if there is good use-case
>> that
>> > > can
>> > > > > not
>> > > > > > > >> > otherwise be addressed with the current mechanism.
>> > > > > > > >>
>> > > > > > > >> I still would like to think about these corner cases more.
>> > But
>> > > > > > perhaps
>> > > > > > > >> it's not directly related to this KIP.
>> > > > > > > >>
>> > > > > > > >> regards,
>> > > > > > > >> Colin
>> > > > > > > >>
>> > > > > > > >>
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> > > best,
>> > > > > > > >> > > Colin
>> > > > > > > >> > >
>> > > > > > > >> > >
>> > > > > > > >> > > >
>> > > > > > > >> > > >
>> > > > > > > >> > > > > best,
>> > > > > > > >> > > > > Colin
>> > > > > > > >> > > > >
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > Regards,
>> > > > > > > >> > > > > > Dong
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > On Wed, Jan 24, 2018 at 10:39 AM, Colin McCabe <
>> > > > > > > >> [email protected]>
>> > > > > > > >> > > > > wrote:
>> > > > > > > >> > > > > >
>> > > > > > > >> > > > > > > Hi Dong,
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > Thanks for proposing this KIP.  I think a
>> metadata
>> > > > epoch
>> > > > > > is a
>> > > > > > > >> > > really
>> > > > > > > >> > > > > good
>> > > > > > > >> > > > > > > idea.
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > I read through the DISCUSS thread, but I still
>> > don't
>> > > > > have
>> > > > > > a
>> > > > > > > >> clear
>> > > > > > > >> > > > > picture
>> > > > > > > >> > > > > > > of why the proposal uses a metadata epoch per
>> > > > partition
>> > > > > > rather
>> > > > > > > >> > > than a
>> > > > > > > >> > > > > > > global metadata epoch.  A metadata epoch per
>> > > partition
>> > > > > is
>> > > > > > > >> kind of
>> > > > > > > >> > > > > > > unpleasant-- it's at least 4 extra bytes per
>> > > partition
>> > > > > > that we
>> > > > > > > >> > > have to
>> > > > > > > >> > > > > send
>> > > > > > > >> > > > > > > over the wire in every full metadata request,
>> > which
>> > > > > could
>> > > > > > > >> become
>> > > > > > > >> > > extra
>> > > > > > > >> > > > > > > kilobytes on the wire when the number of
>> > partitions
>> > > > > > becomes
>> > > > > > > >> large.
>> > > > > > > >> > > > > Plus,
>> > > > > > > >> > > > > > > we have to update all the auxillary classes to
>> > > include
>> > > > > an
>> > > > > > > >> epoch.
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > We need to have a global metadata epoch anyway
>> to
>> > > > handle
>> > > > > > > >> partition
>> > > > > > > >> > > > > > > addition and deletion.  For example, if I give
>> you
>> > > > > > > >> > > > > > > MetadataResponse{part1,epoch 1, part2, epoch 1}
>> > and
>> > > > > > {part1,
>> > > > > > > >> > > epoch1},
>> > > > > > > >> > > > > which
>> > > > > > > >> > > > > > > MetadataResponse is newer?  You have no way of
>> > > > knowing.
>> > > > > > It
>> > > > > > > >> could
>> > > > > > > >> > > be
>> > > > > > > >> > > > > that
>> > > > > > > >> > > > > > > part2 has just been created, and the response
>> > with 2
>> > > > > > > >> partitions is
>> > > > > > > >> > > > > newer.
>> > > > > > > >> > > > > > > Or it coudl be that part2 has just been
>> deleted,
>> > and
>> > > > > > > >> therefore the
>> > > > > > > >> > > > > response
>> > > > > > > >> > > > > > > with 1 partition is newer.  You must have a
>> global
>> > > > epoch
>> > > > > > to
>> > > > > > > >> > > > > disambiguate
>> > > > > > > >> > > > > > > these two cases.
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > Previously, I worked on the Ceph distributed
>> > > > filesystem.
>> > > > > > > >> Ceph had
>> > > > > > > >> > > the
>> > > > > > > >> > > > > > > concept of a map of the whole cluster,
>> maintained
>> > > by a
>> > > > > few
>> > > > > > > >> servers
>> > > > > > > >> > > > > doing
>> > > > > > > >> > > > > > > paxos.  This map was versioned by a single
>> 64-bit
>> > > > epoch
>> > > > > > number
>> > > > > > > >> > > which
>> > > > > > > >> > > > > > > increased on every change.  It was propagated
>> to
>> > > > clients
>> > > > > > > >> through
>> > > > > > > >> > > > > gossip.  I
>> > > > > > > >> > > > > > > wonder if something similar could work here?
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > It seems like the the Kafka MetadataResponse
>> > serves
>> > > > two
>> > > > > > > >> somewhat
>> > > > > > > >> > > > > unrelated
>> > > > > > > >> > > > > > > purposes.  Firstly, it lets clients know what
>> > > > partitions
>> > > > > > > >> exist in
>> > > > > > > >> > > the
>> > > > > > > >> > > > > > > system and where they live.  Secondly, it lets
>> > > clients
>> > > > > > know
>> > > > > > > >> which
>> > > > > > > >> > > nodes
>> > > > > > > >> > > > > > > within the partition are in-sync (in the ISR)
>> and
>> > > > which
>> > > > > > node
>> > > > > > > >> is the
>> > > > > > > >> > > > > leader.
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > The first purpose is what you really need a
>> > metadata
>> > > > > epoch
>> > > > > > > >> for, I
>> > > > > > > >> > > > > think.
>> > > > > > > >> > > > > > > You want to know whether a partition exists or
>> > not,
>> > > or
>> > > > > you
>> > > > > > > >> want to
>> > > > > > > >> > > know
>> > > > > > > >> > > > > > > which nodes you should talk to in order to
>> write
>> > to
>> > > a
>> > > > > > given
>> > > > > > > >> > > > > partition.  A
>> > > > > > > >> > > > > > > single metadata epoch for the whole response
>> > should
>> > > be
>> > > > > > > >> adequate
>> > > > > > > >> > > here.
>> > > > > > > >> > > > > We
>> > > > > > > >> > > > > > > should not change the partition assignment
>> without
>> > > > going
>> > > > > > > >> through
>> > > > > > > >> > > > > zookeeper
>> > > > > > > >> > > > > > > (or a similar system), and this inherently
>> > > serializes
>> > > > > > updates
>> > > > > > > >> into
>> > > > > > > >> > > a
>> > > > > > > >> > > > > > > numbered stream.  Brokers should also stop
>> > > responding
>> > > > to
>> > > > > > > >> requests
>> > > > > > > >> > > when
>> > > > > > > >> > > > > they
>> > > > > > > >> > > > > > > are unable to contact ZK for a certain time
>> > period.
>> > > > > This
>> > > > > > > >> prevents
>> > > > > > > >> > > the
>> > > > > > > >> > > > > case
>> > > > > > > >> > > > > > > where a given partition has been moved off some
>> > set
>> > > of
>> > > > > > nodes,
>> > > > > > > >> but a
>> > > > > > > >> > > > > client
>> > > > > > > >> > > > > > > still ends up talking to those nodes and
>> writing
>> > > data
>> > > > > > there.
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > For the second purpose, this is "soft state"
>> > anyway.
>> > > > If
>> > > > > > the
>> > > > > > > >> client
>> > > > > > > >> > > > > thinks
>> > > > > > > >> > > > > > > X is the leader but Y is really the leader, the
>> > > client
>> > > > > > will
>> > > > > > > >> talk
>> > > > > > > >> > > to X,
>> > > > > > > >> > > > > and
>> > > > > > > >> > > > > > > X will point out its mistake by sending back a
>> > > > > > > >> > > > > NOT_LEADER_FOR_PARTITION.
>> > > > > > > >> > > > > > > Then the client can update its metadata again
>> and
>> > > find
>> > > > > > the new
>> > > > > > > >> > > leader,
>> > > > > > > >> > > > > if
>> > > > > > > >> > > > > > > there is one.  There is no need for an epoch to
>> > > handle
>> > > > > > this.
>> > > > > > > >> > > > > Similarly, I
>> > > > > > > >> > > > > > > can't think of a reason why changing the
>> in-sync
>> > > > replica
>> > > > > > set
>> > > > > > > >> needs
>> > > > > > > >> > > to
>> > > > > > > >> > > > > bump
>> > > > > > > >> > > > > > > the epoch.
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > best,
>> > > > > > > >> > > > > > > Colin
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > > > > On Wed, Jan 24, 2018, at 09:45, Dong Lin wrote:
>> > > > > > > >> > > > > > > > Thanks much for reviewing the KIP!
>> > > > > > > >> > > > > > > >
>> > > > > > > >> > > > > > > > Dong
>> > > > > > > >> > > > > > > >
>> > > > > > > >> > > > > > > > On Wed, Jan 24, 2018 at 7:10 AM, Guozhang
>> Wang <
>> > > > > > > >> > > [email protected]>
>> > > > > > > >> > > > > > > wrote:
>> > > > > > > >> > > > > > > >
>> > > > > > > >> > > > > > > > > Yeah that makes sense, again I'm just
>> making
>> > > sure
>> > > > we
>> > > > > > > >> understand
>> > > > > > > >> > > > > all the
>> > > > > > > >> > > > > > > > > scenarios and what to expect.
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > > I agree that if, more generally speaking,
>> say
>> > > > users
>> > > > > > have
>> > > > > > > >> only
>> > > > > > > >> > > > > consumed
>> > > > > > > >> > > > > > > to
>> > > > > > > >> > > > > > > > > offset 8, and then call seek(16) to "jump"
>> to
>> > a
>> > > > > > further
>> > > > > > > >> > > position,
>> > > > > > > >> > > > > then
>> > > > > > > >> > > > > > > she
>> > > > > > > >> > > > > > > > > needs to be aware that OORE maybe thrown
>> and
>> > she
>> > > > > > needs to
>> > > > > > > >> > > handle
>> > > > > > > >> > > > > it or
>> > > > > > > >> > > > > > > rely
>> > > > > > > >> > > > > > > > > on reset policy which should not surprise
>> her.
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > > I'm +1 on the KIP.
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > > Guozhang
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > > On Wed, Jan 24, 2018 at 12:31 AM, Dong Lin
>> <
>> > > > > > > >> > > [email protected]>
>> > > > > > > >> > > > > > > wrote:
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > > > Yes, in general we can not prevent
>> > > > > > > >> OffsetOutOfRangeException
>> > > > > > > >> > > if
>> > > > > > > >> > > > > user
>> > > > > > > >> > > > > > > > > seeks
>> > > > > > > >> > > > > > > > > > to a wrong offset. The main goal is to
>> > prevent
>> > > > > > > >> > > > > > > OffsetOutOfRangeException
>> > > > > > > >> > > > > > > > > if
>> > > > > > > >> > > > > > > > > > user has done things in the right way,
>> e.g.
>> > > user
>> > > > > > should
>> > > > > > > >> know
>> > > > > > > >> > > that
>> > > > > > > >> > > > > > > there
>> > > > > > > >> > > > > > > > > is
>> > > > > > > >> > > > > > > > > > message with this offset.
>> > > > > > > >> > > > > > > > > >
>> > > > > > > >> > > > > > > > > > For example, if user calls seek(..) right
>> > > after
>> > > > > > > >> > > construction, the
>> > > > > > > >> > > > > > > only
>> > > > > > > >> > > > > > > > > > reason I can think of is that user stores
>> > > offset
>> > > > > > > >> externally.
>> > > > > > > >> > > In
>> > > > > > > >> > > > > this
>> > > > > > > >> > > > > > > > > case,
>> > > > > > > >> > > > > > > > > > user currently needs to use the offset
>> which
>> > > is
>> > > > > > obtained
>> > > > > > > >> > > using
>> > > > > > > >> > > > > > > > > position(..)
>> > > > > > > >> > > > > > > > > > from the last run. With this KIP, user
>> needs
>> > > to
>> > > > > get
>> > > > > > the
>> > > > > > > >> > > offset
>> > > > > > > >> > > > > and
>> > > > > > > >> > > > > > > the
>> > > > > > > >> > > > > > > > > > offsetEpoch using
>> > positionAndOffsetEpoch(...)
>> > > > and
>> > > > > > stores
>> > > > > > > >> > > these
>> > > > > > > >> > > > > > > > > information
>> > > > > > > >> > > > > > > > > > externally. The next time user starts
>> > > consumer,
>> > > > > > he/she
>> > > > > > > >> needs
>> > > > > > > >> > > to
>> > > > > > > >> > > > > call
>> > > > > > > >> > > > > > > > > > seek(..., offset, offsetEpoch) right
>> after
>> > > > > > construction.
>> > > > > > > >> > > Then KIP
>> > > > > > > >> > > > > > > should
>> > > > > > > >> > > > > > > > > be
>> > > > > > > >> > > > > > > > > > able to ensure that we don't throw
>> > > > > > > >> OffsetOutOfRangeException
>> > > > > > > >> > > if
>> > > > > > > >> > > > > > > there is
>> > > > > > > >> > > > > > > > > no
>> > > > > > > >> > > > > > > > > > unclean leader election.
>> > > > > > > >> > > > > > > > > >
>> > > > > > > >> > > > > > > > > > Does this sound OK?
>> > > > > > > >> > > > > > > > > >
>> > > > > > > >> > > > > > > > > > Regards,
>> > > > > > > >> > > > > > > > > > Dong
>> > > > > > > >> > > > > > > > > >
>> > > > > > > >> > > > > > > > > >
>> > > > > > > >> > > > > > > > > > On Tue, Jan 23, 2018 at 11:44 PM,
>> Guozhang
>> > > Wang
>> > > > <
>> > > > > > > >> > > > > [email protected]>
>> > > > > > > >> > > > > > > > > > wrote:
>> > > > > > > >> > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > "If consumer wants to consume message
>> with
>> > > > > offset
>> > > > > > 16,
>> > > > > > > >> then
>> > > > > > > >> > > > > consumer
>> > > > > > > >> > > > > > > > > must
>> > > > > > > >> > > > > > > > > > > have
>> > > > > > > >> > > > > > > > > > > already fetched message with offset 15"
>> > > > > > > >> > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > --> this may not be always true right?
>> > What
>> > > if
>> > > > > > > >> consumer
>> > > > > > > >> > > just
>> > > > > > > >> > > > > call
>> > > > > > > >> > > > > > > > > > seek(16)
>> > > > > > > >> > > > > > > > > > > after construction and then poll
>> without
>> > > > > committed
>> > > > > > > >> offset
>> > > > > > > >> > > ever
>> > > > > > > >> > > > > > > stored
>> > > > > > > >> > > > > > > > > > > before? Admittedly it is rare but we do
>> > not
>> > > > > > > >> programmably
>> > > > > > > >> > > > > disallow
>> > > > > > > >> > > > > > > it.
>> > > > > > > >> > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > Guozhang
>> > > > > > > >> > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > On Tue, Jan 23, 2018 at 10:42 PM, Dong
>> > Lin <
>> > > > > > > >> > > > > [email protected]>
>> > > > > > > >> > > > > > > > > wrote:
>> > > > > > > >> > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > Hey Guozhang,
>> > > > > > > >> > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > Thanks much for reviewing the KIP!
>> > > > > > > >> > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > In the scenario you described, let's
>> > > assume
>> > > > > that
>> > > > > > > >> broker
>> > > > > > > >> > > A has
>> > > > > > > >> > > > > > > > > messages
>> > > > > > > >> > > > > > > > > > > with
>> > > > > > > >> > > > > > > > > > > > offset up to 10, and broker B has
>> > messages
>> > > > > with
>> > > > > > > >> offset
>> > > > > > > >> > > up to
>> > > > > > > >> > > > > 20.
>> > > > > > > >> > > > > > > If
>> > > > > > > >> > > > > > > > > > > > consumer wants to consume message
>> with
>> > > > offset
>> > > > > > 9, it
>> > > > > > > >> will
>> > > > > > > >> > > not
>> > > > > > > >> > > > > > > receive
>> > > > > > > >> > > > > > > > > > > > OffsetOutOfRangeException
>> > > > > > > >> > > > > > > > > > > > from broker A.
>> > > > > > > >> > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > If consumer wants to consume message
>> > with
>> > > > > offset
>> > > > > > > >> 16, then
>> > > > > > > >> > > > > > > consumer
>> > > > > > > >> > > > > > > > > must
>> > > > > > > >> > > > > > > > > > > > have already fetched message with
>> offset
>> > > 15,
>> > > > > > which
>> > > > > > > >> can
>> > > > > > > >> > > only
>> > > > > > > >> > > > > come
>> > > > > > > >> > > > > > > from
>> > > > > > > >> > > > > > > > > > > > broker B. Because consumer will fetch
>> > from
>> > > > > > broker B
>> > > > > > > >> only
>> > > > > > > >> > > if
>> > > > > > > >> > > > > > > > > leaderEpoch
>> > > > > > > >> > > > > > > > > > > >=
>> > > > > > > >> > > > > > > > > > > > 2, then the current consumer
>> leaderEpoch
>> > > can
>> > > > > > not be
>> > > > > > > >> 1
>> > > > > > > >> > > since
>> > > > > > > >> > > > > this
>> > > > > > > >> > > > > > > KIP
>> > > > > > > >> > > > > > > > > > > > prevents leaderEpoch rewind. Thus we
>> > will
>> > > > not
>> > > > > > have
>> > > > > > > >> > > > > > > > > > > > OffsetOutOfRangeException
>> > > > > > > >> > > > > > > > > > > > in this case.
>> > > > > > > >> > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > Does this address your question, or
>> > maybe
>> > > > > there
>> > > > > > is
>> > > > > > > >> more
>> > > > > > > >> > > > > advanced
>> > > > > > > >> > > > > > > > > > scenario
>> > > > > > > >> > > > > > > > > > > > that the KIP does not handle?
>> > > > > > > >> > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > Thanks,
>> > > > > > > >> > > > > > > > > > > > Dong
>> > > > > > > >> > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > On Tue, Jan 23, 2018 at 9:43 PM,
>> > Guozhang
>> > > > > Wang <
>> > > > > > > >> > > > > > > [email protected]>
>> > > > > > > >> > > > > > > > > > > wrote:
>> > > > > > > >> > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > > Thanks Dong, I made a pass over the
>> > wiki
>> > > > and
>> > > > > > it
>> > > > > > > >> lgtm.
>> > > > > > > >> > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > > Just a quick question: can we
>> > completely
>> > > > > > > >> eliminate the
>> > > > > > > >> > > > > > > > > > > > > OffsetOutOfRangeException with this
>> > > > > approach?
>> > > > > > Say
>> > > > > > > >> if
>> > > > > > > >> > > there
>> > > > > > > >> > > > > is
>> > > > > > > >> > > > > > > > > > > consecutive
>> > > > > > > >> > > > > > > > > > > > > leader changes such that the cached
>> > > > > metadata's
>> > > > > > > >> > > partition
>> > > > > > > >> > > > > epoch
>> > > > > > > >> > > > > > > is
>> > > > > > > >> > > > > > > > > 1,
>> > > > > > > >> > > > > > > > > > > and
>> > > > > > > >> > > > > > > > > > > > > the metadata fetch response returns
>> > > with
>> > > > > > > >> partition
>> > > > > > > >> > > epoch 2
>> > > > > > > >> > > > > > > > > pointing
>> > > > > > > >> > > > > > > > > > to
>> > > > > > > >> > > > > > > > > > > > > leader broker A, while the actual
>> > > > up-to-date
>> > > > > > > >> metadata
>> > > > > > > >> > > has
>> > > > > > > >> > > > > > > partition
>> > > > > > > >> > > > > > > > > > > > epoch 3
>> > > > > > > >> > > > > > > > > > > > > whose leader is now broker B, the
>> > > metadata
>> > > > > > > >> refresh will
>> > > > > > > >> > > > > still
>> > > > > > > >> > > > > > > > > succeed
>> > > > > > > >> > > > > > > > > > > and
>> > > > > > > >> > > > > > > > > > > > > the follow-up fetch request may
>> still
>> > > see
>> > > > > > OORE?
>> > > > > > > >> > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > > Guozhang
>> > > > > > > >> > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > > On Tue, Jan 23, 2018 at 3:47 PM,
>> Dong
>> > > Lin
>> > > > <
>> > > > > > > >> > > > > [email protected]
>> > > > > > > >> > > > > > > >
>> > > > > > > >> > > > > > > > > > wrote:
>> > > > > > > >> > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > > > Hi all,
>> > > > > > > >> > > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > > > I would like to start the voting
>> > > process
>> > > > > for
>> > > > > > > >> KIP-232:
>> > > > > > > >> > > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > > > https://cwiki.apache.org/
>> > > > > > > >> > > confluence/display/KAFKA/KIP-
>> > > > > > > >> > > > > > > > > > > > > > 232%3A+Detect+outdated+metadat
>> > > > > > > >> a+using+leaderEpoch+
>> > > > > > > >> > > > > > > > > > and+partitionEpoch
>> > > > > > > >> > > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > > > The KIP will help fix a
>> concurrency
>> > > > issue
>> > > > > in
>> > > > > > > >> Kafka
>> > > > > > > >> > > which
>> > > > > > > >> > > > > > > > > currently
>> > > > > > > >> > > > > > > > > > > can
>> > > > > > > >> > > > > > > > > > > > > > cause message loss or message
>> > > > duplication
>> > > > > in
>> > > > > > > >> > > consumer.
>> > > > > > > >> > > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > > > Regards,
>> > > > > > > >> > > > > > > > > > > > > > Dong
>> > > > > > > >> > > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > > > --
>> > > > > > > >> > > > > > > > > > > > > -- Guozhang
>> > > > > > > >> > > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > >
>> > > > > > > >> > > > > > > > > > > --
>> > > > > > > >> > > > > > > > > > > -- Guozhang
>> > > > > > > >> > > > > > > > > > >
>> > > > > > > >> > > > > > > > > >
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > > > > --
>> > > > > > > >> > > > > > > > > -- Guozhang
>> > > > > > > >> > > > > > > > >
>> > > > > > > >> > > > > > >
>> > > > > > > >> > > > >
>> > > > > > > >> > >
>> > > > > > > >>
>> > > > > > > >
>> > > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [VOTE] KIP-232: Detect outdated metadata using leaderEpoch and partitionEpoch

Reply via email to