Re: [VOTE] KIP-232: Detect outdated metadata using leaderEpoch and partitionEpoch

Colin McCabe Mon, 29 Jan 2018 10:18:28 -0800

On Fri, Jan 26, 2018, at 12:17, Dong Lin wrote:
> Hey Colin,
> 
> 
> On Fri, Jan 26, 2018 at 10:16 AM, Colin McCabe <[email protected]> wrote:
> 
> > On Thu, Jan 25, 2018, at 16:47, Dong Lin wrote:
> > > Hey Colin,
> > >
> > > Thanks for the comment.
> > >
> > > On Thu, Jan 25, 2018 at 4:15 PM, Colin McCabe <[email protected]>
> > wrote:
> > >
> > > > On Wed, Jan 24, 2018, at 21:07, Dong Lin wrote:
> > > > > Hey Colin,
> > > > >
> > > > > Thanks for reviewing the KIP.
> > > > >
> > > > > If I understand you right, you maybe suggesting that we can use a
> > global
> > > > > metadataEpoch that is incremented every time controller updates
> > metadata.
> > > > > The problem with this solution is that, if a topic is deleted and
> > created
> > > > > again, user will not know whether that the offset which is stored
> > before
> > > > > the topic deletion is no longer valid. This motivates the idea to
> > include
> > > > > per-partition partitionEpoch. Does this sound reasonable?
> > > >
> > > > Hi Dong,
> > > >
> > > > Perhaps we can store the last valid offset of each deleted topic in
> > > > ZooKeeper.  Then, when a topic with one of those names gets
> > re-created, we
> > > > can start the topic at the previous end offset rather than at 0.  This
> > > > preserves immutability.  It is no more burdensome than having to
> > preserve a
> > > > "last epoch" for the deleted partition somewhere, right?
> > > >
> > >
> > > My concern with this solution is that the number of zookeeper nodes get
> > > more and more over time if some users keep deleting and creating topics.
> > Do
> > > you think this can be a problem?
> >
> > Hi Dong,
> >
> > We could expire the "partition tombstones" after an hour or so.  In
> > practice this would solve the issue for clients that like to destroy and
> > re-create topics all the time.  In any case, doesn't the current proposal
> > add per-partition znodes as well that we have to track even after the
> > partition is deleted?  Or did I misunderstand that?
> >
> 
> Actually the current KIP does not add per-partition znodes. Could you
> double check? I can fix the KIP wiki if there is anything misleading.


Hi Dong,

I double-checked the KIP, and I can see that you are in fact using a global 
counter for initializing partition epochs.  So, you are correct, it doesn't add 
per-partition znodes for partitions that no longer exist.

> 
> If we expire the "partition tomstones" after an hour, and the topic is
> re-created after more than an hour since the topic deletion, then we are
> back to the situation where user can not tell whether the topic has been
> re-created or not, right?

Yes, with an expiration period, it would not ensure immutability-- you could 
effectively reuse partition names and they would look the same.

> 
> 
> >
> > It's not really clear to me what should happen when a topic is destroyed
> > and re-created with new data.  Should consumers continue to be able to
> > consume?  We don't know where they stopped consuming from the previous
> > incarnation of the topic, so messages may have been lost.  Certainly
> > consuming data from offset X of the new incarnation of the topic may give
> > something totally different from what you would have gotten from offset X
> > of the previous incarnation of the topic.
> >
> 
> With the current KIP, if a consumer consumes a topic based on the last
> remembered (offset, partitionEpoch, leaderEpoch), and if the topic is
> re-created, consume will throw InvalidPartitionEpochException because the
> previous partitionEpoch will be different from the current partitionEpoch.
> This is described in the Proposed Changes -> Consumption after topic
> deletion in the KIP. I can improve the KIP if there is anything not clear.

Thanks for the clarification.  It sounds like what you really want is 
immutability-- i.e., to never "really" reuse partition identifiers.  And you do 
this by making the partition name no longer the "real" identifier.

My big concern about this KIP is that it seems like an anti-scalability 
feature.  Now we are adding 4 extra bytes for every partition in the 
FetchResponse and Request, for example.  That could be 40 kb per request, if 
the user has 10,000 partitions.  And of course, the KIP also makes massive 
changes to UpdateMetadataRequest, MetadataResponse, OffsetCommitRequest, 
OffsetFetchResponse, LeaderAndIsrRequest, ListOffsetResponse, etc. which will 
also increase their size on the wire and in memory.

One thing that we talked a lot about in the past is replacing partition names 
with IDs.  IDs have a lot of really nice features.  They take up much less 
space in memory than strings (especially 2-byte Java strings).  They can often 
be allocated on the stack rather than the heap (important when you are dealing 
with hundreds of thousands of them).  They can be efficiently deserialized and 
serialized.  If we use 64-bit ones, we will never run out of IDs, which means 
that they can always be unique per partition.

Given that the partition name is no longer the "real" identifier for partitions 
in the current KIP-232 proposal, why not just move to using partition IDs 
entirely instead of strings?  You have to change all the messages anyway.  
There isn't much point any more to carrying around the partition name in every 
RPC, since you really need (name, epoch) to identify the partition.
Probably the metadata response and a few other messages would have to still 
carry the partition name, to allow clients to go from name to id.  But we could 
mostly forget about the strings.  And then this would be a scalability 
improvement rather than a scalability problem.

> 
> 
> > By choosing to reuse the same (topic, partition, offset) 3-tuple, we have
> 
> chosen to give up immutability.  That was a really bad decision.  And now
> > we have to worry about time dependencies, stale cached data, and all the
> > rest.  We can't completely fix this inside Kafka no matter what we do,
> > because not all that cached data is inside Kafka itself.  Some of it may be
> > in systems that Kafka has sent data to, such as other daemons, SQL
> > databases, streams, and so forth.
> >
> 
> The current KIP will uniquely identify a message using (topic, partition,
> offset, partitionEpoch) 4-tuple. This addresses the message immutability
> issue that you mentioned. Is there any corner case where the message
> immutability is still not preserved with the current KIP?
> 
> 
> >
> > I guess the idea here is that mirror maker should work as expected when
> > users destroy a topic and re-create it with the same name.  That's kind of
> > tough, though, since in that scenario, mirror maker probably should destroy
> > and re-create the topic on the other end, too, right?  Otherwise, what you
> > end up with on the other end could be half of one incarnation of the topic,
> > and half of another.
> >
> > What mirror maker really needs is to be able to follow a stream of events
> > about the kafka cluster itself.  We could have some master topic which is
> > always present and which contains data about all topic deletions,
> > creations, etc.  Then MM can simply follow this topic and do what is needed.
> >
> > >
> > >
> > > >
> > > > >
> > > > > Then the next question maybe, should we use a global metadataEpoch +
> > > > > per-partition partitionEpoch, instead of using per-partition
> > leaderEpoch
> > > > +
> > > > > per-partition leaderEpoch. The former solution using metadataEpoch
> > would
> > > > > not work due to the following scenario (provided by Jun):
> > > > >
> > > > > "Consider the following scenario. In metadata v1, the leader for a
> > > > > partition is at broker 1. In metadata v2, leader is at broker 2. In
> > > > > metadata v3, leader is at broker 1 again. The last committed offset
> > in
> > > > v1,
> > > > > v2 and v3 are 10, 20 and 30, respectively. A consumer is started and
> > > > reads
> > > > > metadata v1 and reads messages from offset 0 to 25 from broker 1. My
> > > > > understanding is that in the current proposal, the metadata version
> > > > > associated with offset 25 is v1. The consumer is then restarted and
> > > > fetches
> > > > > metadata v2. The consumer tries to read from broker 2, which is the
> > old
> > > > > leader with the last offset at 20. In this case, the consumer will
> > still
> > > > > get OffsetOutOfRangeException incorrectly."
> > > > >
> > > > > Regarding your comment "For the second purpose, this is "soft state"
> > > > > anyway.  If the client thinks X is the leader but Y is really the
> > leader,
> > > > > the client will talk to X, and X will point out its mistake by
> > sending
> > > > back
> > > > > a NOT_LEADER_FOR_PARTITION.", it is probably no true. The problem
> > here is
> > > > > that the old leader X may still think it is the leader of the
> > partition
> > > > and
> > > > > thus it will not send back NOT_LEADER_FOR_PARTITION. The reason is
> > > > provided
> > > > > in KAFKA-6262. Can you check if that makes sense?
> > > >
> > > > This is solvable with a timeout, right?  If the leader can't
> > communicate
> > > > with the controller for a certain period of time, it should stop
> > acting as
> > > > the leader.  We have to solve this problem, anyway, in order to fix
> > all the
> > > > corner cases.
> > > >
> > >
> > > Not sure if I fully understand your proposal. The proposal seems to
> > require
> > > non-trivial changes to our existing leadership election mechanism. Could
> > > you provide more detail regarding how it works? For example, how should
> > > user choose this timeout, how leader determines whether it can still
> > > communicate with controller, and how this triggers controller to elect
> > new
> > > leader?
> >
> > Before I come up with any proposal, let me make sure I understand the
> > problem correctly.  My big question was, what prevents split-brain here?
> >
> > Let's say I have a partition which is on nodes A, B, and C, with min-ISR
> > 2.  The controller is D.  At some point, there is a network partition
> > between A and B and the rest of the cluster.  The Controller re-assigns the
> > partition to nodes C, D, and E.  But A and B keep chugging away, even
> > though they can no longer communicate with the controller.
> >
> > At some point, a client with stale metadata writes to the partition.  It
> > still thinks the partition is on node A, B, and C, so that's where it sends
> > the data.  It's unable to talk to C, but A and B reply back that all is
> > well.
> >
> > Is this not a case where we could lose data due to split brain?  Or is
> > there a mechanism for preventing this that I missed?  If it is, it seems
> > like a pretty serious failure case that we should be handling with our
> > metadata rework.  And I think epoch numbers and timeouts might be part of
> > the solution.
> >
> 
> Right, split brain can happen if RF=4 and minIsr=2. However, I am not sure
> it is a pretty serious issue which we need to address today. This can be
> prevented by configuring the Kafka topic so that minIsr > RF/2. Actually,
> if user sets minIsr=2, is there anything reason that user wants to set RF=4
> instead of 4?
> 
> Introducing timeout in leader election mechanism is non-trivial. I think we
> probably want to do that only if there is good use-case that can not
> otherwise be addressed with the current mechanism.

I still would like to think about these corner cases more.  But perhaps it's 
not directly related to this KIP.

regards,
Colin


> 
> 
> > best,
> > Colin
> >
> >
> > >
> > >
> > > > best,
> > > > Colin
> > > >
> > > > >
> > > > > Regards,
> > > > > Dong
> > > > >
> > > > >
> > > > > On Wed, Jan 24, 2018 at 10:39 AM, Colin McCabe <[email protected]>
> > > > wrote:
> > > > >
> > > > > > Hi Dong,
> > > > > >
> > > > > > Thanks for proposing this KIP.  I think a metadata epoch is a
> > really
> > > > good
> > > > > > idea.
> > > > > >
> > > > > > I read through the DISCUSS thread, but I still don't have a clear
> > > > picture
> > > > > > of why the proposal uses a metadata epoch per partition rather
> > than a
> > > > > > global metadata epoch.  A metadata epoch per partition is kind of
> > > > > > unpleasant-- it's at least 4 extra bytes per partition that we
> > have to
> > > > send
> > > > > > over the wire in every full metadata request, which could become
> > extra
> > > > > > kilobytes on the wire when the number of partitions becomes large.
> > > > Plus,
> > > > > > we have to update all the auxillary classes to include an epoch.
> > > > > >
> > > > > > We need to have a global metadata epoch anyway to handle partition
> > > > > > addition and deletion.  For example, if I give you
> > > > > > MetadataResponse{part1,epoch 1, part2, epoch 1} and {part1,
> > epoch1},
> > > > which
> > > > > > MetadataResponse is newer?  You have no way of knowing.  It could
> > be
> > > > that
> > > > > > part2 has just been created, and the response with 2 partitions is
> > > > newer.
> > > > > > Or it coudl be that part2 has just been deleted, and therefore the
> > > > response
> > > > > > with 1 partition is newer.  You must have a global epoch to
> > > > disambiguate
> > > > > > these two cases.
> > > > > >
> > > > > > Previously, I worked on the Ceph distributed filesystem.  Ceph had
> > the
> > > > > > concept of a map of the whole cluster, maintained by a few servers
> > > > doing
> > > > > > paxos.  This map was versioned by a single 64-bit epoch number
> > which
> > > > > > increased on every change.  It was propagated to clients through
> > > > gossip.  I
> > > > > > wonder if something similar could work here?
> > > > > >
> > > > > > It seems like the the Kafka MetadataResponse serves two somewhat
> > > > unrelated
> > > > > > purposes.  Firstly, it lets clients know what partitions exist in
> > the
> > > > > > system and where they live.  Secondly, it lets clients know which
> > nodes
> > > > > > within the partition are in-sync (in the ISR) and which node is the
> > > > leader.
> > > > > >
> > > > > > The first purpose is what you really need a metadata epoch for, I
> > > > think.
> > > > > > You want to know whether a partition exists or not, or you want to
> > know
> > > > > > which nodes you should talk to in order to write to a given
> > > > partition.  A
> > > > > > single metadata epoch for the whole response should be adequate
> > here.
> > > > We
> > > > > > should not change the partition assignment without going through
> > > > zookeeper
> > > > > > (or a similar system), and this inherently serializes updates into
> > a
> > > > > > numbered stream.  Brokers should also stop responding to requests
> > when
> > > > they
> > > > > > are unable to contact ZK for a certain time period.  This prevents
> > the
> > > > case
> > > > > > where a given partition has been moved off some set of nodes, but a
> > > > client
> > > > > > still ends up talking to those nodes and writing data there.
> > > > > >
> > > > > > For the second purpose, this is "soft state" anyway.  If the client
> > > > thinks
> > > > > > X is the leader but Y is really the leader, the client will talk
> > to X,
> > > > and
> > > > > > X will point out its mistake by sending back a
> > > > NOT_LEADER_FOR_PARTITION.
> > > > > > Then the client can update its metadata again and find the new
> > leader,
> > > > if
> > > > > > there is one.  There is no need for an epoch to handle this.
> > > > Similarly, I
> > > > > > can't think of a reason why changing the in-sync replica set needs
> > to
> > > > bump
> > > > > > the epoch.
> > > > > >
> > > > > > best,
> > > > > > Colin
> > > > > >
> > > > > >
> > > > > > On Wed, Jan 24, 2018, at 09:45, Dong Lin wrote:
> > > > > > > Thanks much for reviewing the KIP!
> > > > > > >
> > > > > > > Dong
> > > > > > >
> > > > > > > On Wed, Jan 24, 2018 at 7:10 AM, Guozhang Wang <
> > [email protected]>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Yeah that makes sense, again I'm just making sure we understand
> > > > all the
> > > > > > > > scenarios and what to expect.
> > > > > > > >
> > > > > > > > I agree that if, more generally speaking, say users have only
> > > > consumed
> > > > > > to
> > > > > > > > offset 8, and then call seek(16) to "jump" to a further
> > position,
> > > > then
> > > > > > she
> > > > > > > > needs to be aware that OORE maybe thrown and she needs to
> > handle
> > > > it or
> > > > > > rely
> > > > > > > > on reset policy which should not surprise her.
> > > > > > > >
> > > > > > > >
> > > > > > > > I'm +1 on the KIP.
> > > > > > > >
> > > > > > > > Guozhang
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Jan 24, 2018 at 12:31 AM, Dong Lin <
> > [email protected]>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Yes, in general we can not prevent OffsetOutOfRangeException
> > if
> > > > user
> > > > > > > > seeks
> > > > > > > > > to a wrong offset. The main goal is to prevent
> > > > > > OffsetOutOfRangeException
> > > > > > > > if
> > > > > > > > > user has done things in the right way, e.g. user should know
> > that
> > > > > > there
> > > > > > > > is
> > > > > > > > > message with this offset.
> > > > > > > > >
> > > > > > > > > For example, if user calls seek(..) right after
> > construction, the
> > > > > > only
> > > > > > > > > reason I can think of is that user stores offset externally.
> > In
> > > > this
> > > > > > > > case,
> > > > > > > > > user currently needs to use the offset which is obtained
> > using
> > > > > > > > position(..)
> > > > > > > > > from the last run. With this KIP, user needs to get the
> > offset
> > > > and
> > > > > > the
> > > > > > > > > offsetEpoch using positionAndOffsetEpoch(...) and stores
> > these
> > > > > > > > information
> > > > > > > > > externally. The next time user starts consumer, he/she needs
> > to
> > > > call
> > > > > > > > > seek(..., offset, offsetEpoch) right after construction.
> > Then KIP
> > > > > > should
> > > > > > > > be
> > > > > > > > > able to ensure that we don't throw OffsetOutOfRangeException
> > if
> > > > > > there is
> > > > > > > > no
> > > > > > > > > unclean leader election.
> > > > > > > > >
> > > > > > > > > Does this sound OK?
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Dong
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Jan 23, 2018 at 11:44 PM, Guozhang Wang <
> > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > "If consumer wants to consume message with offset 16, then
> > > > consumer
> > > > > > > > must
> > > > > > > > > > have
> > > > > > > > > > already fetched message with offset 15"
> > > > > > > > > >
> > > > > > > > > > --> this may not be always true right? What if consumer
> > just
> > > > call
> > > > > > > > > seek(16)
> > > > > > > > > > after construction and then poll without committed offset
> > ever
> > > > > > stored
> > > > > > > > > > before? Admittedly it is rare but we do not programmably
> > > > disallow
> > > > > > it.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Guozhang
> > > > > > > > > >
> > > > > > > > > > On Tue, Jan 23, 2018 at 10:42 PM, Dong Lin <
> > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hey Guozhang,
> > > > > > > > > > >
> > > > > > > > > > > Thanks much for reviewing the KIP!
> > > > > > > > > > >
> > > > > > > > > > > In the scenario you described, let's assume that broker
> > A has
> > > > > > > > messages
> > > > > > > > > > with
> > > > > > > > > > > offset up to 10, and broker B has messages with offset
> > up to
> > > > 20.
> > > > > > If
> > > > > > > > > > > consumer wants to consume message with offset 9, it will
> > not
> > > > > > receive
> > > > > > > > > > > OffsetOutOfRangeException
> > > > > > > > > > > from broker A.
> > > > > > > > > > >
> > > > > > > > > > > If consumer wants to consume message with offset 16, then
> > > > > > consumer
> > > > > > > > must
> > > > > > > > > > > have already fetched message with offset 15, which can
> > only
> > > > come
> > > > > > from
> > > > > > > > > > > broker B. Because consumer will fetch from broker B only
> > if
> > > > > > > > leaderEpoch
> > > > > > > > > > >=
> > > > > > > > > > > 2, then the current consumer leaderEpoch can not be 1
> > since
> > > > this
> > > > > > KIP
> > > > > > > > > > > prevents leaderEpoch rewind. Thus we will not have
> > > > > > > > > > > OffsetOutOfRangeException
> > > > > > > > > > > in this case.
> > > > > > > > > > >
> > > > > > > > > > > Does this address your question, or maybe there is more
> > > > advanced
> > > > > > > > > scenario
> > > > > > > > > > > that the KIP does not handle?
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Dong
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Jan 23, 2018 at 9:43 PM, Guozhang Wang <
> > > > > > [email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Thanks Dong, I made a pass over the wiki and it lgtm.
> > > > > > > > > > > >
> > > > > > > > > > > > Just a quick question: can we completely eliminate the
> > > > > > > > > > > > OffsetOutOfRangeException with this approach? Say if
> > there
> > > > is
> > > > > > > > > > consecutive
> > > > > > > > > > > > leader changes such that the cached metadata's
> > partition
> > > > epoch
> > > > > > is
> > > > > > > > 1,
> > > > > > > > > > and
> > > > > > > > > > > > the metadata fetch response returns  with partition
> > epoch 2
> > > > > > > > pointing
> > > > > > > > > to
> > > > > > > > > > > > leader broker A, while the actual up-to-date metadata
> > has
> > > > > > partition
> > > > > > > > > > > epoch 3
> > > > > > > > > > > > whose leader is now broker B, the metadata refresh will
> > > > still
> > > > > > > > succeed
> > > > > > > > > > and
> > > > > > > > > > > > the follow-up fetch request may still see OORE?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Guozhang
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Jan 23, 2018 at 3:47 PM, Dong Lin <
> > > > [email protected]
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I would like to start the voting process for KIP-232:
> > > > > > > > > > > > >
> > > > > > > > > > > > > https://cwiki.apache.org/
> > confluence/display/KAFKA/KIP-
> > > > > > > > > > > > > 232%3A+Detect+outdated+metadata+using+leaderEpoch+
> > > > > > > > > and+partitionEpoch
> > > > > > > > > > > > >
> > > > > > > > > > > > > The KIP will help fix a concurrency issue in Kafka
> > which
> > > > > > > > currently
> > > > > > > > > > can
> > > > > > > > > > > > > cause message loss or message duplication in
> > consumer.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > Dong
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > -- Guozhang
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > -- Guozhang
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > -- Guozhang
> > > > > > > >
> > > > > >
> > > >
> >

Re: [VOTE] KIP-232: Detect outdated metadata using leaderEpoch and partitionEpoch

Reply via email to