Re: [DISCUSS] KIP-455: Create an Administrative API for Replica Reassignment

Colin McCabe Wed, 26 Jun 2019 10:43:17 -0700

On Tue, Jun 25, 2019, at 18:37, Jason Gustafson wrote:
> Hi Colin,
> 
> Took another pass on the KIP. Looks good overall. A few questions below:
> 
> 1. I wasn't clear why `currentReplicas` is an optional field. Wouldn't we
> always have a current set of replicas?

Good point.  When I wrote that I was trying to use the same structure for both 
requesting a new reassignment, and describing an existing one.  I just posted a 
new version which uses separate structures for these.  (CurrentReplicas is only 
relevant when you are listing an existing assignment.)

> 2. Seems the only option is to list all active partition reassignments? I
> think we have tended to regret these APIs. At least should we be able to
> specify a subset of topics or partitions perhaps?

I guess the thought process here is that most reassignment tools want to know 
about all the reassignments that are going on.  If you don't know all the 
pending reassignments, then it's hard to say whether adding a new one is a good 
idea, or cancelling an existing one.  So I guess I can't think of a case where 
a reassignment tool would want a partial set rather than the full one.

It is kind of unfortunate to be passing all this information to some external 
process, but that's kind of inherent in the idea of reassignment as separate 
from the controller.  There's normally only one or zero processes doing and 
monitoring reassignments, so it's not as bad as thousands of clients sending 
full metadata requests.  It's probably OK?

> 
> 3. Can you elaborate a bit on the handling of /admin/reassign_partitions?
> Does this alter the target_replicas of the leader and ISR znode?

When the /admin/reassign_partitions znode is empty, we'll listen for updates.  
When an update is made, we'll treat it like a call to 
AlterPartitionReassignments.  Whenever we have zero pending reassignments, 
we'll delete the /admin/reassign_partitions znode.  If the znode already 
exists, we don't listen on it (same as now).

One thing to note here is that if you upgrade from a pre-KIP-455 version to a 
KIP-455 version, and you invoke AlterPartitionReassignments before the cluster 
roll is finished, the new controller won't write the reassignment to 
/admin/reassign_partitions.  So if the controller node bounces to an 
un-upgraded broker during the roll, the reassignment will halt for that time 
period.  This is OK because a cluster roll should be a relatively short amount 
of time-- and also, probably not a time you want to be doing reassignments 
anyway because of the extra load on the cluster.

> 
> 4. I think it would be helpful to provide an example of the rebalance
> process for a given partition. Specifically I am wondering whether the
> replica set is updated incrementally or if we follow the current behavior.
> Possibly some decisions can be deferred to implementation, but it would be
> helpful to work through a case of changing the replication factor just to
> make sure there are reasonable options.

Good question.  It will be the current behavior.  Basically, we immediately try 
to replicate to all the targetReplicas.  As they enter the ISR, they leave 
"targetReplicas" and enter "replicas."  Making this more incremental would be a 
good follow-on improvement.

> 
> 5. Are we changing the semantics of the URP and UnderMinIsr metrics in this
> KIP or in a follow-up?

In a follow-up.

> 
> 6. We have both "TargetBrokers" and "PendingReplicas" as names in the
> proposal. Perhaps we should try to be consistent?

Fixed.  I set it to TargetReplicas for now

> 
> 7. I am not sure specifying `targetReplicas` as empty is the clearest way
> to cancel a reassignment. Whether we implement it this way or not in the
> protocol is a separate issue, but perhaps we should have an explicit
> `cancelReassignment` method in AdminClient?

I changed the API slightly to take an Optional<NewPartitionReassignment>.  Does 
that seem cleaner?

best,
Colin

> 
> Thanks,
> Jason
> 
> 
> 
> 
> On Wed, Jun 19, 2019 at 3:36 AM Stanislav Kozlovski <stanis...@confluent.io>
> wrote:
> 
> > Hey there Colin,
> >
> > Thanks for the work on this KIP. It is a much-needed improvement and I'm
> > excited to see it. Sorry for coming in so late to the discussion, I have
> > one question to better understand the change and a small suggestion.
> >
> > I see we allow reassignment cancellation at the partition level - what is
> > the motivation behind that? I think that having the back-end structures
> > support it is a good choice since it allows us more flexibility in the
> > future but what are the reasons for allowing a user to cancel at a
> > partition level? I think allowing it might let users shoot themselves in
> > the foot easier and make tools harder to implement (needing to guard
> > against it).
> >
> > In all regards, what do you think about an ease of use improvement where we
> > allow a user to cancel all reassignments for a topic without specifying its
> > partitions? Essentially, we could cancel all reassignments for a topic if
> > the Partitions field in AlterPartitionAssignmentsRequest is null.
> >
> > Best,
> > Stanislav
> >
> > On Mon, May 6, 2019 at 5:42 PM Colin McCabe <cmcc...@apache.org> wrote:
> >
> > > On Mon, May 6, 2019, at 07:39, Ismael Juma wrote:
> > > > Hi Colin,
> > > >
> > > > A quick comment.
> > > >
> > > > On Sat, May 4, 2019 at 11:18 PM Colin McCabe <cmcc...@apache.org>
> > wrote:
> > > >
> > > > > The big advantage of doing batching on the controller is that the
> > > > > controller has more information about what is going on in the
> > > cluster.  So
> > > > > it can schedule reassignments in a more optimal way.  For instance,
> > it
> > > can
> > > > > schedule reassignments so that the load is distributed evenly across
> > > > > nodes.  This advantage is lost if we have to adhere to a rigid
> > ordering
> > > > > that is set up in advance.  We don't know exactly when anything will
> > > > > complete in any case.  Just because one partition reassignment was
> > > started
> > > > > before another doesn't mean it will finish before another.
> > > >
> > > >
> > > > This is not quite true, right? The Controller doesn't know about
> > > partition
> > > > sizes, throughput per partition and other such information that
> > external
> > > > tools like Cruise Control track.
> > >
> > > Hi Ismael,
> > >
> > > That's a good point, and one I should have included.
> > >
> > > I guess when I think about "do batching in the controller" versus "do
> > > batching in an external system" I tend to think about the information the
> > > controller could theoretically collect, rather than what it actually does
> > > :)  But certainly, adding this information to the controller would be a
> > > significant change, and maybe one we don't want to do if the external
> > > systems work well enough.
> > >
> > > Thinking about this a little bit more, I can see three advantages to
> > > controller-side batching.  Firstly, doing batching in the controller
> > saves
> > > memory because we don't use a separate JVM, and don't duplicate the
> > > in-memory map of all the partitions.  Secondly, the information we're
> > > acting on would also be more up-to-date.  (I'm not sure how important
> > this
> > > would be.)  Finally, it's one less thing to deploy.  I don't know if
> > those
> > > are really enough to motivate switching now, but in a greenfield system I
> > > would probably choose controller-side rebalancing.
> > >
> > > In any case, this KIP is orthogonal to controller-side rebalancing versus
> > > external rebalancing.  That's why the KIP states that we will continue to
> > > perform all the given partition rebalances immediately.  I was just
> > > responding to the idea that maybe we should have an "ordering" of
> > > rebalancing partitions.  I don't think we want that, for controller-side
> > > rebalancing or externally batched rebalancing.
> > >
> > > best,
> > > Colin
> > >
> >
> >
> > --
> > Best,
> > Stanislav
> >
>

Re: [DISCUSS] KIP-455: Create an Administrative API for Replica Reassignment

Reply via email to