Re: [DISCUSS] KIP-455: Create an Administrative API for Replica Reassignment

Colin McCabe Thu, 27 Jun 2019 14:49:45 -0700

On Thu, Jun 27, 2019, at 08:58, Jason Gustafson wrote:
> >
> > We'd remove nodes from targetReplicas just as soon as they entered the
> > ISR.  They would become regular replicas at that point.
> 
> 
> I think we can save a lot of back and forth by working through an example.
> Suppose we have the following initial state:
> 
> replicas: [1, 2, 3]
> isr: [1, 2, 3]
> targetReplicas: []
> 
> We want to reassign to [4, 5, 6]. My understanding is that leads to the
> following state:
> 
> replicas: [1, 2, 3]
> isr: [1, 2, 3]
> targetReplicas: [4, 5, 6]
> 
> Let's say replica 4 comes into the ISR first:
> 
> replicas: [1, 2, 3]
> isr: [1, 2, 3, 4]
> targetReplicas: [4, 5, 6]
> 
> What happens now?


The state transitions to:
replicas: [1, 2, 3, 4]
isr: [1, 2, 3, 4]
targetReplicas: [4, 5, 6]

The full sequence would be something like this:
R: [1, 2, 3], I: [1, 2, 3], T: [4, 5, 6]
R: [1, 2, 3], I: [1, 2, 3, 4], T: [4, 5, 6]
R: [1, 2, 3, 4], I: [1, 2, 3, 4], T: [4, 5, 6]
R: [1, 2, 3, 4], I: [1, 2, 3, 4, 5], T: [4, 5, 6]
R: [1, 2, 3, 4, 5], I: [1, 2, 3, 4, 5], T: [4, 5, 6]
R: [1, 2, 3, 4, 5], I: [1, 2, 3, 4, 5, 6], T: [4, 5, 6]
R: [1, 2, 3, 4, 5, 6], I: [1, 2, 3, 4, 5, 6], T: [4, 5, 6]
R: [4, 5, 6], I: [4, 5, 6], T: null

Here's another example:
R: [1, 2, 3], I: [1, 2, 3], T: [2, 3, 4]
R: [1, 2, 3], I: [1, 2, 3, 4], T: [2, 3, 4]
R: [1, 2, 3, 4], I: [1, 2, 3, 4], T: [2, 3, 4]
R: [2, 3, 4], I: [2, 3, 4], T: null

Basically, "target replicas" represents what you want the partition assignment 
to be.  "Replicas" represents what it currently is.  In other words, "target 
replicas" has the role that was formerly played by the assignment shown in 
/admin/partition_reassignments.

The main thing that is different is that we don't want to automatically add all 
the target replicas to the normal replica set.  The reason why we don't want to 
do this is because it removes the information we need for cancellation: namely, 
whether each replica was one of the original ones, or one of the ones we're 
reassigning to.  A secondary reason is because nothing useful will come of 
telling users they have a bunch of replicas that don't really exist.  We may 
not have even started copying the first byte over to a target replica, so it 
doesn't make sense to treat it like a normal replica that's not in the ISR 
because it's lagging slightly.

best,
Colin


> 
> (Sorry if I'm being dense, it's just not clear to me exactly what the
> expected transitions are from here.)
> 
> 
> Thanks,
> Jason
> 
> 
> On Wed, Jun 26, 2019 at 2:12 PM Colin McCabe <[email protected]> wrote:
> 
> > On Wed, Jun 26, 2019, at 12:02, Jason Gustafson wrote:
> > > Hi Colin,
> > >
> > > Responses below and another question:
> > >
> > > > I guess the thought process here is that most reassignment tools want
> > to
> > > > know about all the reassignments that are going on.  If you don't know
> > all
> > > > the pending reassignments, then it's hard to say whether adding a new
> > one
> > > > is a good idea, or cancelling an existing one.  So I guess I can't
> > think of
> > > > a case where a reassignment tool would want a partial set rather than
> > the
> > > > full one.
> > >
> > > UIs often have "drill into" options. If there is a large ongoing
> > > reassignment, I can see wanting to limit the scope to a specific topic.
> > Any
> > > downside that I'm missing?
> > >
> >
> > We could add a mode that only lists a given set of partitions.  To be
> > consistent with how we handle topics, that could be a separate "describe"
> > method.  I don't think there's any downside, except some extra code to
> > write.
> >
> > > > Good question.  It will be the current behavior.  Basically, we
> > immediately
> > > > try to replicate to all the targetReplicas.  As they enter the ISR,
> > they
> > > > leave "targetReplicas" and enter "replicas."  Making this more
> > incremental
> > > > would be a good follow-on improvement.
> > >
> > > The current behavior is to wait until all target replicas are in the ISR
> > > before making a change. Are you saying that we will keep this behavior?
> >
> > We'd remove nodes from targetReplicas just as soon as they entered the
> > ISR.  They would become regular replicas at that point.
> >
> > >
> > > > When the /admin/reassign_partitions znode is empty, we'll listen for
> > > > updates.  When an update is made, we'll treat it like a call to
> > > > AlterPartitionReassignments.  Whenever we have zero pending
> > reassignments,
> > > > we'll delete the /admin/reassign_partitions znode.  If the znode
> > already
> > > > exists, we don't listen on it (same as now).
> > >
> > >
> > > So just to be a bit more explicit, what you are saying is that we will
> > keep
> > > the reassignment state under /admin/reassign_partitions as we do
> > currently,
> > > but we will update the target_replicas field in /partition/state
> > following
> > > this new logic. Then as soon as the current replica set matches the
> > target
> > > assignment, we will remove the /admin/reassign_partitions znode. Right?
> >
> > One clarification: I wasn't proposing that the controller should write to
> > /admin/reassign_partitions.  We will just remove the znode when we
> > transition to having no ongoing reassignments.  There's no guarantee that
> > what is in the znode reflects the current reassignments that are going on.
> > The only thing you can know is that if the znode exists, there is at least
> > one reassignment going on.  But if someone makes a new reassignment with
> > the AlterPartitionReassignments API, it won't appear in the znode.
> >
> > Another thing to note is that if the znode exists and you overwrite it,
> > your updates will be ignored.  This matches the current behavior of this
> > znode.  Apparently some applications don't know about this behavior and try
> > to update the znode while a reassignment is going on, but it has no
> > effect-- other than making what is in ZK misleading if someone checks.
> > This is, again, existing behavior :(
> >
> > It's not a good API by any means.  For example, what if someone else
> > overwrites your znode update before the controller has a chance to read
> > it?  Unfortunate, but there's no really good way to fix this without
> > transitioning away from direct ZooKeeper access.  We'll transition the
> > command line tools immediately, but there will be some external management
> > tools that will lag a bit.
> >
> > >
> > > Actually I'm still a bit confused about one aspect of this proposal. You
> > > are suggesting that we should leave the reassignment out of the Metadata.
> > > That is fine, but what does that mean as far as the consistency of the
> > > metadata we expose to clients while a reassignment is active? Currently
> > the
> > > Metadata includes the following:
> > >
> > > 1. Current leader
> > > 2. Current ISR
> > > 3. Current assigned replicas
> > >
> > > Can you explain how the reassignment will affect this state?  As the
> > target
> > > replicas are coming into sync, they will be added to the ISR. We wouldn't
> > > want it to be the case that the ISR includes replicas which are not
> > > currently assigned. So is the replica set the union of the current and
> > > target replicas?
> >
> > I think we should leave the targetReplicas out of the normal replica set
> > in the client metadata.  The client code that is checking the replica set
> > shouldn't be treating target replicas as if they were regular replicas.
> >  For example, you wouldn't look at a partition with 3 replicas and 2 target
> > replicas, and an ISR of size 3, and think that it was under-replicated.
> >
> > I agree that there is kind of an awkward edge case right when a
> > targetReplica enters the ISR.  Since the replica set and the ISR reside in
> > two separate znodes, they can't be updated atomically (well, maybe they
> > could, but I don't think we will in practice).  When the controller becomes
> > aware that the ISR has changed for the reassigning partition, it will
> > remove the relevant partition from targetReplicas and add it to replicas.
> > This is another ZK write, of course.  I don't think this is really a
> > problem in practice because when we're creating the client metadata
> > message, we can simply enforce that anything in the ISR must be in the
> > replica set and not in the targetReplica set.
> >
> > best,
> > Colin
> >
> > >
> > > Thanks,
> > > Jason
> > >
> > >
> > >
> > >
> > > On Wed, Jun 26, 2019 at 10:42 AM Colin McCabe <[email protected]>
> > wrote:
> > >
> > > > On Tue, Jun 25, 2019, at 18:37, Jason Gustafson wrote:
> > > > > Hi Colin,
> > > > >
> > > > > Took another pass on the KIP. Looks good overall. A few questions
> > below:
> > > > >
> > > > > 1. I wasn't clear why `currentReplicas` is an optional field.
> > Wouldn't we
> > > > > always have a current set of replicas?
> > > >
> > > > Good point.  When I wrote that I was trying to use the same structure
> > for
> > > > both requesting a new reassignment, and describing an existing one.  I
> > just
> > > > posted a new version which uses separate structures for these.
> > > > (CurrentReplicas is only relevant when you are listing an existing
> > > > assignment.)
> > > >
> > > > > 2. Seems the only option is to list all active partition
> > reassignments? I
> > > > > think we have tended to regret these APIs. At least should we be
> > able to
> > > > > specify a subset of topics or partitions perhaps?
> > > >
> > > > I guess the thought process here is that most reassignment tools want
> > to
> > > > know about all the reassignments that are going on.  If you don't know
> > all
> > > > the pending reassignments, then it's hard to say whether adding a new
> > one
> > > > is a good idea, or cancelling an existing one.  So I guess I can't
> > think of
> > > > a case where a reassignment tool would want a partial set rather than
> > the
> > > > full one.
> > > >
> > > > It is kind of unfortunate to be passing all this information to some
> > > > external process, but that's kind of inherent in the idea of
> > reassignment
> > > > as separate from the controller.  There's normally only one or zero
> > > > processes doing and monitoring reassignments, so it's not as bad as
> > > > thousands of clients sending full metadata requests.  It's probably OK?
> > > >
> > > > >
> > > > > 3. Can you elaborate a bit on the handling of
> > /admin/reassign_partitions?
> > > > > Does this alter the target_replicas of the leader and ISR znode?
> > > >
> > > > When the /admin/reassign_partitions znode is empty, we'll listen for
> > > > updates.  When an update is made, we'll treat it like a call to
> > > > AlterPartitionReassignments.  Whenever we have zero pending
> > reassignments,
> > > > we'll delete the /admin/reassign_partitions znode.  If the znode
> > already
> > > > exists, we don't listen on it (same as now).
> > > >
> > > > One thing to note here is that if you upgrade from a pre-KIP-455
> > version
> > > > to a KIP-455 version, and you invoke AlterPartitionReassignments
> > before the
> > > > cluster roll is finished, the new controller won't write the
> > reassignment
> > > > to /admin/reassign_partitions.  So if the controller node bounces to an
> > > > un-upgraded broker during the roll, the reassignment will halt for that
> > > > time period.  This is OK because a cluster roll should be a relatively
> > > > short amount of time-- and also, probably not a time you want to be
> > doing
> > > > reassignments anyway because of the extra load on the cluster.
> > > >
> > > > >
> > > > > 4. I think it would be helpful to provide an example of the rebalance
> > > > > process for a given partition. Specifically I am wondering whether
> > the
> > > > > replica set is updated incrementally or if we follow the current
> > > > behavior.
> > > > > Possibly some decisions can be deferred to implementation, but it
> > would
> > > > be
> > > > > helpful to work through a case of changing the replication factor
> > just to
> > > > > make sure there are reasonable options.
> > > >
> > > > Good question.  It will be the current behavior.  Basically, we
> > > > immediately try to replicate to all the targetReplicas.  As they enter
> > the
> > > > ISR, they leave "targetReplicas" and enter "replicas."  Making this
> > more
> > > > incremental would be a good follow-on improvement.
> > > >
> > > > >
> > > > > 5. Are we changing the semantics of the URP and UnderMinIsr metrics
> > in
> > > > this
> > > > > KIP or in a follow-up?
> > > >
> > > > In a follow-up.
> > > >
> > > > >
> > > > > 6. We have both "TargetBrokers" and "PendingReplicas" as names in the
> > > > > proposal. Perhaps we should try to be consistent?
> > > >
> > > > Fixed.  I set it to TargetReplicas for now
> > > >
> > > > >
> > > > > 7. I am not sure specifying `targetReplicas` as empty is the
> > clearest way
> > > > > to cancel a reassignment. Whether we implement it this way or not in
> > the
> > > > > protocol is a separate issue, but perhaps we should have an explicit
> > > > > `cancelReassignment` method in AdminClient?
> > > >
> > > > I changed the API slightly to take an
> > Optional<NewPartitionReassignment>.
> > > > Does that seem cleaner?
> > > >
> > > > best,
> > > > Colin
> > > >
> > > > >
> > > > > Thanks,
> > > > > Jason
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Jun 19, 2019 at 3:36 AM Stanislav Kozlovski <
> > > > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Hey there Colin,
> > > > > >
> > > > > > Thanks for the work on this KIP. It is a much-needed improvement
> > and
> > > > I'm
> > > > > > excited to see it. Sorry for coming in so late to the discussion, I
> > > > have
> > > > > > one question to better understand the change and a small
> > suggestion.
> > > > > >
> > > > > > I see we allow reassignment cancellation at the partition level -
> > what
> > > > is
> > > > > > the motivation behind that? I think that having the back-end
> > structures
> > > > > > support it is a good choice since it allows us more flexibility in
> > the
> > > > > > future but what are the reasons for allowing a user to cancel at a
> > > > > > partition level? I think allowing it might let users shoot
> > themselves
> > > > in
> > > > > > the foot easier and make tools harder to implement (needing to
> > guard
> > > > > > against it).
> > > > > >
> > > > > > In all regards, what do you think about an ease of use improvement
> > > > where we
> > > > > > allow a user to cancel all reassignments for a topic without
> > > > specifying its
> > > > > > partitions? Essentially, we could cancel all reassignments for a
> > topic
> > > > if
> > > > > > the Partitions field in AlterPartitionAssignmentsRequest is null.
> > > > > >
> > > > > > Best,
> > > > > > Stanislav
> > > > > >
> > > > > > On Mon, May 6, 2019 at 5:42 PM Colin McCabe <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > > On Mon, May 6, 2019, at 07:39, Ismael Juma wrote:
> > > > > > > > Hi Colin,
> > > > > > > >
> > > > > > > > A quick comment.
> > > > > > > >
> > > > > > > > On Sat, May 4, 2019 at 11:18 PM Colin McCabe <
> > [email protected]>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > The big advantage of doing batching on the controller is
> > that the
> > > > > > > > > controller has more information about what is going on in the
> > > > > > > cluster.  So
> > > > > > > > > it can schedule reassignments in a more optimal way.  For
> > > > instance,
> > > > > > it
> > > > > > > can
> > > > > > > > > schedule reassignments so that the load is distributed evenly
> > > > across
> > > > > > > > > nodes.  This advantage is lost if we have to adhere to a
> > rigid
> > > > > > ordering
> > > > > > > > > that is set up in advance.  We don't know exactly when
> > anything
> > > > will
> > > > > > > > > complete in any case.  Just because one partition
> > reassignment
> > > > was
> > > > > > > started
> > > > > > > > > before another doesn't mean it will finish before another.
> > > > > > > >
> > > > > > > >
> > > > > > > > This is not quite true, right? The Controller doesn't know
> > about
> > > > > > > partition
> > > > > > > > sizes, throughput per partition and other such information that
> > > > > > external
> > > > > > > > tools like Cruise Control track.
> > > > > > >
> > > > > > > Hi Ismael,
> > > > > > >
> > > > > > > That's a good point, and one I should have included.
> > > > > > >
> > > > > > > I guess when I think about "do batching in the controller"
> > versus "do
> > > > > > > batching in an external system" I tend to think about the
> > > > information the
> > > > > > > controller could theoretically collect, rather than what it
> > actually
> > > > does
> > > > > > > :)  But certainly, adding this information to the controller
> > would
> > > > be a
> > > > > > > significant change, and maybe one we don't want to do if the
> > external
> > > > > > > systems work well enough.
> > > > > > >
> > > > > > > Thinking about this a little bit more, I can see three
> > advantages to
> > > > > > > controller-side batching.  Firstly, doing batching in the
> > controller
> > > > > > saves
> > > > > > > memory because we don't use a separate JVM, and don't duplicate
> > the
> > > > > > > in-memory map of all the partitions.  Secondly, the information
> > we're
> > > > > > > acting on would also be more up-to-date.  (I'm not sure how
> > important
> > > > > > this
> > > > > > > would be.)  Finally, it's one less thing to deploy.  I don't
> > know if
> > > > > > those
> > > > > > > are really enough to motivate switching now, but in a greenfield
> > > > system I
> > > > > > > would probably choose controller-side rebalancing.
> > > > > > >
> > > > > > > In any case, this KIP is orthogonal to controller-side
> > rebalancing
> > > > versus
> > > > > > > external rebalancing.  That's why the KIP states that we will
> > > > continue to
> > > > > > > perform all the given partition rebalances immediately.  I was
> > just
> > > > > > > responding to the idea that maybe we should have an "ordering" of
> > > > > > > rebalancing partitions.  I don't think we want that, for
> > > > controller-side
> > > > > > > rebalancing or externally batched rebalancing.
> > > > > > >
> > > > > > > best,
> > > > > > > Colin
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best,
> > > > > > Stanislav
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-455: Create an Administrative API for Replica Reassignment

Reply via email to