Re: [DISCUSS] KIP-455: Create an Administrative API for Replica Reassignment

Jason Gustafson Thu, 27 Jun 2019 08:59:05 -0700

>
> We'd remove nodes from targetReplicas just as soon as they entered the
> ISR.  They would become regular replicas at that point.



I think we can save a lot of back and forth by working through an example.
Suppose we have the following initial state:

replicas: [1, 2, 3]
isr: [1, 2, 3]
targetReplicas: []

We want to reassign to [4, 5, 6]. My understanding is that leads to the
following state:

replicas: [1, 2, 3]
isr: [1, 2, 3]
targetReplicas: [4, 5, 6]

Let's say replica 4 comes into the ISR first:

replicas: [1, 2, 3]
isr: [1, 2, 3, 4]
targetReplicas: [4, 5, 6]

What happens now?

(Sorry if I'm being dense, it's just not clear to me exactly what the
expected transitions are from here.)


Thanks,
Jason


On Wed, Jun 26, 2019 at 2:12 PM Colin McCabe <cmcc...@apache.org> wrote:

> On Wed, Jun 26, 2019, at 12:02, Jason Gustafson wrote:
> > Hi Colin,
> >
> > Responses below and another question:
> >
> > > I guess the thought process here is that most reassignment tools want
> to
> > > know about all the reassignments that are going on.  If you don't know
> all
> > > the pending reassignments, then it's hard to say whether adding a new
> one
> > > is a good idea, or cancelling an existing one.  So I guess I can't
> think of
> > > a case where a reassignment tool would want a partial set rather than
> the
> > > full one.
> >
> > UIs often have "drill into" options. If there is a large ongoing
> > reassignment, I can see wanting to limit the scope to a specific topic.
> Any
> > downside that I'm missing?
> >
>
> We could add a mode that only lists a given set of partitions.  To be
> consistent with how we handle topics, that could be a separate "describe"
> method.  I don't think there's any downside, except some extra code to
> write.
>
> > > Good question.  It will be the current behavior.  Basically, we
> immediately
> > > try to replicate to all the targetReplicas.  As they enter the ISR,
> they
> > > leave "targetReplicas" and enter "replicas."  Making this more
> incremental
> > > would be a good follow-on improvement.
> >
> > The current behavior is to wait until all target replicas are in the ISR
> > before making a change. Are you saying that we will keep this behavior?
>
> We'd remove nodes from targetReplicas just as soon as they entered the
> ISR.  They would become regular replicas at that point.
>
> >
> > > When the /admin/reassign_partitions znode is empty, we'll listen for
> > > updates.  When an update is made, we'll treat it like a call to
> > > AlterPartitionReassignments.  Whenever we have zero pending
> reassignments,
> > > we'll delete the /admin/reassign_partitions znode.  If the znode
> already
> > > exists, we don't listen on it (same as now).
> >
> >
> > So just to be a bit more explicit, what you are saying is that we will
> keep
> > the reassignment state under /admin/reassign_partitions as we do
> currently,
> > but we will update the target_replicas field in /partition/state
> following
> > this new logic. Then as soon as the current replica set matches the
> target
> > assignment, we will remove the /admin/reassign_partitions znode. Right?
>
> One clarification: I wasn't proposing that the controller should write to
> /admin/reassign_partitions.  We will just remove the znode when we
> transition to having no ongoing reassignments.  There's no guarantee that
> what is in the znode reflects the current reassignments that are going on.
> The only thing you can know is that if the znode exists, there is at least
> one reassignment going on.  But if someone makes a new reassignment with
> the AlterPartitionReassignments API, it won't appear in the znode.
>
> Another thing to note is that if the znode exists and you overwrite it,
> your updates will be ignored.  This matches the current behavior of this
> znode.  Apparently some applications don't know about this behavior and try
> to update the znode while a reassignment is going on, but it has no
> effect-- other than making what is in ZK misleading if someone checks.
> This is, again, existing behavior :(
>
> It's not a good API by any means.  For example, what if someone else
> overwrites your znode update before the controller has a chance to read
> it?  Unfortunate, but there's no really good way to fix this without
> transitioning away from direct ZooKeeper access.  We'll transition the
> command line tools immediately, but there will be some external management
> tools that will lag a bit.
>
> >
> > Actually I'm still a bit confused about one aspect of this proposal. You
> > are suggesting that we should leave the reassignment out of the Metadata.
> > That is fine, but what does that mean as far as the consistency of the
> > metadata we expose to clients while a reassignment is active? Currently
> the
> > Metadata includes the following:
> >
> > 1. Current leader
> > 2. Current ISR
> > 3. Current assigned replicas
> >
> > Can you explain how the reassignment will affect this state?  As the
> target
> > replicas are coming into sync, they will be added to the ISR. We wouldn't
> > want it to be the case that the ISR includes replicas which are not
> > currently assigned. So is the replica set the union of the current and
> > target replicas?
>
> I think we should leave the targetReplicas out of the normal replica set
> in the client metadata.  The client code that is checking the replica set
> shouldn't be treating target replicas as if they were regular replicas.
>  For example, you wouldn't look at a partition with 3 replicas and 2 target
> replicas, and an ISR of size 3, and think that it was under-replicated.
>
> I agree that there is kind of an awkward edge case right when a
> targetReplica enters the ISR.  Since the replica set and the ISR reside in
> two separate znodes, they can't be updated atomically (well, maybe they
> could, but I don't think we will in practice).  When the controller becomes
> aware that the ISR has changed for the reassigning partition, it will
> remove the relevant partition from targetReplicas and add it to replicas.
> This is another ZK write, of course.  I don't think this is really a
> problem in practice because when we're creating the client metadata
> message, we can simply enforce that anything in the ISR must be in the
> replica set and not in the targetReplica set.
>
> best,
> Colin
>
> >
> > Thanks,
> > Jason
> >
> >
> >
> >
> > On Wed, Jun 26, 2019 at 10:42 AM Colin McCabe <cmcc...@apache.org>
> wrote:
> >
> > > On Tue, Jun 25, 2019, at 18:37, Jason Gustafson wrote:
> > > > Hi Colin,
> > > >
> > > > Took another pass on the KIP. Looks good overall. A few questions
> below:
> > > >
> > > > 1. I wasn't clear why `currentReplicas` is an optional field.
> Wouldn't we
> > > > always have a current set of replicas?
> > >
> > > Good point.  When I wrote that I was trying to use the same structure
> for
> > > both requesting a new reassignment, and describing an existing one.  I
> just
> > > posted a new version which uses separate structures for these.
> > > (CurrentReplicas is only relevant when you are listing an existing
> > > assignment.)
> > >
> > > > 2. Seems the only option is to list all active partition
> reassignments? I
> > > > think we have tended to regret these APIs. At least should we be
> able to
> > > > specify a subset of topics or partitions perhaps?
> > >
> > > I guess the thought process here is that most reassignment tools want
> to
> > > know about all the reassignments that are going on.  If you don't know
> all
> > > the pending reassignments, then it's hard to say whether adding a new
> one
> > > is a good idea, or cancelling an existing one.  So I guess I can't
> think of
> > > a case where a reassignment tool would want a partial set rather than
> the
> > > full one.
> > >
> > > It is kind of unfortunate to be passing all this information to some
> > > external process, but that's kind of inherent in the idea of
> reassignment
> > > as separate from the controller.  There's normally only one or zero
> > > processes doing and monitoring reassignments, so it's not as bad as
> > > thousands of clients sending full metadata requests.  It's probably OK?
> > >
> > > >
> > > > 3. Can you elaborate a bit on the handling of
> /admin/reassign_partitions?
> > > > Does this alter the target_replicas of the leader and ISR znode?
> > >
> > > When the /admin/reassign_partitions znode is empty, we'll listen for
> > > updates.  When an update is made, we'll treat it like a call to
> > > AlterPartitionReassignments.  Whenever we have zero pending
> reassignments,
> > > we'll delete the /admin/reassign_partitions znode.  If the znode
> already
> > > exists, we don't listen on it (same as now).
> > >
> > > One thing to note here is that if you upgrade from a pre-KIP-455
> version
> > > to a KIP-455 version, and you invoke AlterPartitionReassignments
> before the
> > > cluster roll is finished, the new controller won't write the
> reassignment
> > > to /admin/reassign_partitions.  So if the controller node bounces to an
> > > un-upgraded broker during the roll, the reassignment will halt for that
> > > time period.  This is OK because a cluster roll should be a relatively
> > > short amount of time-- and also, probably not a time you want to be
> doing
> > > reassignments anyway because of the extra load on the cluster.
> > >
> > > >
> > > > 4. I think it would be helpful to provide an example of the rebalance
> > > > process for a given partition. Specifically I am wondering whether
> the
> > > > replica set is updated incrementally or if we follow the current
> > > behavior.
> > > > Possibly some decisions can be deferred to implementation, but it
> would
> > > be
> > > > helpful to work through a case of changing the replication factor
> just to
> > > > make sure there are reasonable options.
> > >
> > > Good question.  It will be the current behavior.  Basically, we
> > > immediately try to replicate to all the targetReplicas.  As they enter
> the
> > > ISR, they leave "targetReplicas" and enter "replicas."  Making this
> more
> > > incremental would be a good follow-on improvement.
> > >
> > > >
> > > > 5. Are we changing the semantics of the URP and UnderMinIsr metrics
> in
> > > this
> > > > KIP or in a follow-up?
> > >
> > > In a follow-up.
> > >
> > > >
> > > > 6. We have both "TargetBrokers" and "PendingReplicas" as names in the
> > > > proposal. Perhaps we should try to be consistent?
> > >
> > > Fixed.  I set it to TargetReplicas for now
> > >
> > > >
> > > > 7. I am not sure specifying `targetReplicas` as empty is the
> clearest way
> > > > to cancel a reassignment. Whether we implement it this way or not in
> the
> > > > protocol is a separate issue, but perhaps we should have an explicit
> > > > `cancelReassignment` method in AdminClient?
> > >
> > > I changed the API slightly to take an
> Optional<NewPartitionReassignment>.
> > > Does that seem cleaner?
> > >
> > > best,
> > > Colin
> > >
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > >
> > > >
> > > >
> > > > On Wed, Jun 19, 2019 at 3:36 AM Stanislav Kozlovski <
> > > stanis...@confluent.io>
> > > > wrote:
> > > >
> > > > > Hey there Colin,
> > > > >
> > > > > Thanks for the work on this KIP. It is a much-needed improvement
> and
> > > I'm
> > > > > excited to see it. Sorry for coming in so late to the discussion, I
> > > have
> > > > > one question to better understand the change and a small
> suggestion.
> > > > >
> > > > > I see we allow reassignment cancellation at the partition level -
> what
> > > is
> > > > > the motivation behind that? I think that having the back-end
> structures
> > > > > support it is a good choice since it allows us more flexibility in
> the
> > > > > future but what are the reasons for allowing a user to cancel at a
> > > > > partition level? I think allowing it might let users shoot
> themselves
> > > in
> > > > > the foot easier and make tools harder to implement (needing to
> guard
> > > > > against it).
> > > > >
> > > > > In all regards, what do you think about an ease of use improvement
> > > where we
> > > > > allow a user to cancel all reassignments for a topic without
> > > specifying its
> > > > > partitions? Essentially, we could cancel all reassignments for a
> topic
> > > if
> > > > > the Partitions field in AlterPartitionAssignmentsRequest is null.
> > > > >
> > > > > Best,
> > > > > Stanislav
> > > > >
> > > > > On Mon, May 6, 2019 at 5:42 PM Colin McCabe <cmcc...@apache.org>
> > > wrote:
> > > > >
> > > > > > On Mon, May 6, 2019, at 07:39, Ismael Juma wrote:
> > > > > > > Hi Colin,
> > > > > > >
> > > > > > > A quick comment.
> > > > > > >
> > > > > > > On Sat, May 4, 2019 at 11:18 PM Colin McCabe <
> cmcc...@apache.org>
> > > > > wrote:
> > > > > > >
> > > > > > > > The big advantage of doing batching on the controller is
> that the
> > > > > > > > controller has more information about what is going on in the
> > > > > > cluster.  So
> > > > > > > > it can schedule reassignments in a more optimal way.  For
> > > instance,
> > > > > it
> > > > > > can
> > > > > > > > schedule reassignments so that the load is distributed evenly
> > > across
> > > > > > > > nodes.  This advantage is lost if we have to adhere to a
> rigid
> > > > > ordering
> > > > > > > > that is set up in advance.  We don't know exactly when
> anything
> > > will
> > > > > > > > complete in any case.  Just because one partition
> reassignment
> > > was
> > > > > > started
> > > > > > > > before another doesn't mean it will finish before another.
> > > > > > >
> > > > > > >
> > > > > > > This is not quite true, right? The Controller doesn't know
> about
> > > > > > partition
> > > > > > > sizes, throughput per partition and other such information that
> > > > > external
> > > > > > > tools like Cruise Control track.
> > > > > >
> > > > > > Hi Ismael,
> > > > > >
> > > > > > That's a good point, and one I should have included.
> > > > > >
> > > > > > I guess when I think about "do batching in the controller"
> versus "do
> > > > > > batching in an external system" I tend to think about the
> > > information the
> > > > > > controller could theoretically collect, rather than what it
> actually
> > > does
> > > > > > :)  But certainly, adding this information to the controller
> would
> > > be a
> > > > > > significant change, and maybe one we don't want to do if the
> external
> > > > > > systems work well enough.
> > > > > >
> > > > > > Thinking about this a little bit more, I can see three
> advantages to
> > > > > > controller-side batching.  Firstly, doing batching in the
> controller
> > > > > saves
> > > > > > memory because we don't use a separate JVM, and don't duplicate
> the
> > > > > > in-memory map of all the partitions.  Secondly, the information
> we're
> > > > > > acting on would also be more up-to-date.  (I'm not sure how
> important
> > > > > this
> > > > > > would be.)  Finally, it's one less thing to deploy.  I don't
> know if
> > > > > those
> > > > > > are really enough to motivate switching now, but in a greenfield
> > > system I
> > > > > > would probably choose controller-side rebalancing.
> > > > > >
> > > > > > In any case, this KIP is orthogonal to controller-side
> rebalancing
> > > versus
> > > > > > external rebalancing.  That's why the KIP states that we will
> > > continue to
> > > > > > perform all the given partition rebalances immediately.  I was
> just
> > > > > > responding to the idea that maybe we should have an "ordering" of
> > > > > > rebalancing partitions.  I don't think we want that, for
> > > controller-side
> > > > > > rebalancing or externally batched rebalancing.
> > > > > >
> > > > > > best,
> > > > > > Colin
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best,
> > > > > Stanislav
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-455: Create an Administrative API for Replica Reassignment

Reply via email to