> > We'd remove nodes from targetReplicas just as soon as they entered the > ISR. They would become regular replicas at that point.
I think we can save a lot of back and forth by working through an example. Suppose we have the following initial state: replicas: [1, 2, 3] isr: [1, 2, 3] targetReplicas: [] We want to reassign to [4, 5, 6]. My understanding is that leads to the following state: replicas: [1, 2, 3] isr: [1, 2, 3] targetReplicas: [4, 5, 6] Let's say replica 4 comes into the ISR first: replicas: [1, 2, 3] isr: [1, 2, 3, 4] targetReplicas: [4, 5, 6] What happens now? (Sorry if I'm being dense, it's just not clear to me exactly what the expected transitions are from here.) Thanks, Jason On Wed, Jun 26, 2019 at 2:12 PM Colin McCabe <cmcc...@apache.org> wrote: > On Wed, Jun 26, 2019, at 12:02, Jason Gustafson wrote: > > Hi Colin, > > > > Responses below and another question: > > > > > I guess the thought process here is that most reassignment tools want > to > > > know about all the reassignments that are going on. If you don't know > all > > > the pending reassignments, then it's hard to say whether adding a new > one > > > is a good idea, or cancelling an existing one. So I guess I can't > think of > > > a case where a reassignment tool would want a partial set rather than > the > > > full one. > > > > UIs often have "drill into" options. If there is a large ongoing > > reassignment, I can see wanting to limit the scope to a specific topic. > Any > > downside that I'm missing? > > > > We could add a mode that only lists a given set of partitions. To be > consistent with how we handle topics, that could be a separate "describe" > method. I don't think there's any downside, except some extra code to > write. > > > > Good question. It will be the current behavior. Basically, we > immediately > > > try to replicate to all the targetReplicas. As they enter the ISR, > they > > > leave "targetReplicas" and enter "replicas." Making this more > incremental > > > would be a good follow-on improvement. > > > > The current behavior is to wait until all target replicas are in the ISR > > before making a change. Are you saying that we will keep this behavior? > > We'd remove nodes from targetReplicas just as soon as they entered the > ISR. They would become regular replicas at that point. > > > > > > When the /admin/reassign_partitions znode is empty, we'll listen for > > > updates. When an update is made, we'll treat it like a call to > > > AlterPartitionReassignments. Whenever we have zero pending > reassignments, > > > we'll delete the /admin/reassign_partitions znode. If the znode > already > > > exists, we don't listen on it (same as now). > > > > > > So just to be a bit more explicit, what you are saying is that we will > keep > > the reassignment state under /admin/reassign_partitions as we do > currently, > > but we will update the target_replicas field in /partition/state > following > > this new logic. Then as soon as the current replica set matches the > target > > assignment, we will remove the /admin/reassign_partitions znode. Right? > > One clarification: I wasn't proposing that the controller should write to > /admin/reassign_partitions. We will just remove the znode when we > transition to having no ongoing reassignments. There's no guarantee that > what is in the znode reflects the current reassignments that are going on. > The only thing you can know is that if the znode exists, there is at least > one reassignment going on. But if someone makes a new reassignment with > the AlterPartitionReassignments API, it won't appear in the znode. > > Another thing to note is that if the znode exists and you overwrite it, > your updates will be ignored. This matches the current behavior of this > znode. Apparently some applications don't know about this behavior and try > to update the znode while a reassignment is going on, but it has no > effect-- other than making what is in ZK misleading if someone checks. > This is, again, existing behavior :( > > It's not a good API by any means. For example, what if someone else > overwrites your znode update before the controller has a chance to read > it? Unfortunate, but there's no really good way to fix this without > transitioning away from direct ZooKeeper access. We'll transition the > command line tools immediately, but there will be some external management > tools that will lag a bit. > > > > > Actually I'm still a bit confused about one aspect of this proposal. You > > are suggesting that we should leave the reassignment out of the Metadata. > > That is fine, but what does that mean as far as the consistency of the > > metadata we expose to clients while a reassignment is active? Currently > the > > Metadata includes the following: > > > > 1. Current leader > > 2. Current ISR > > 3. Current assigned replicas > > > > Can you explain how the reassignment will affect this state? As the > target > > replicas are coming into sync, they will be added to the ISR. We wouldn't > > want it to be the case that the ISR includes replicas which are not > > currently assigned. So is the replica set the union of the current and > > target replicas? > > I think we should leave the targetReplicas out of the normal replica set > in the client metadata. The client code that is checking the replica set > shouldn't be treating target replicas as if they were regular replicas. > For example, you wouldn't look at a partition with 3 replicas and 2 target > replicas, and an ISR of size 3, and think that it was under-replicated. > > I agree that there is kind of an awkward edge case right when a > targetReplica enters the ISR. Since the replica set and the ISR reside in > two separate znodes, they can't be updated atomically (well, maybe they > could, but I don't think we will in practice). When the controller becomes > aware that the ISR has changed for the reassigning partition, it will > remove the relevant partition from targetReplicas and add it to replicas. > This is another ZK write, of course. I don't think this is really a > problem in practice because when we're creating the client metadata > message, we can simply enforce that anything in the ISR must be in the > replica set and not in the targetReplica set. > > best, > Colin > > > > > Thanks, > > Jason > > > > > > > > > > On Wed, Jun 26, 2019 at 10:42 AM Colin McCabe <cmcc...@apache.org> > wrote: > > > > > On Tue, Jun 25, 2019, at 18:37, Jason Gustafson wrote: > > > > Hi Colin, > > > > > > > > Took another pass on the KIP. Looks good overall. A few questions > below: > > > > > > > > 1. I wasn't clear why `currentReplicas` is an optional field. > Wouldn't we > > > > always have a current set of replicas? > > > > > > Good point. When I wrote that I was trying to use the same structure > for > > > both requesting a new reassignment, and describing an existing one. I > just > > > posted a new version which uses separate structures for these. > > > (CurrentReplicas is only relevant when you are listing an existing > > > assignment.) > > > > > > > 2. Seems the only option is to list all active partition > reassignments? I > > > > think we have tended to regret these APIs. At least should we be > able to > > > > specify a subset of topics or partitions perhaps? > > > > > > I guess the thought process here is that most reassignment tools want > to > > > know about all the reassignments that are going on. If you don't know > all > > > the pending reassignments, then it's hard to say whether adding a new > one > > > is a good idea, or cancelling an existing one. So I guess I can't > think of > > > a case where a reassignment tool would want a partial set rather than > the > > > full one. > > > > > > It is kind of unfortunate to be passing all this information to some > > > external process, but that's kind of inherent in the idea of > reassignment > > > as separate from the controller. There's normally only one or zero > > > processes doing and monitoring reassignments, so it's not as bad as > > > thousands of clients sending full metadata requests. It's probably OK? > > > > > > > > > > > 3. Can you elaborate a bit on the handling of > /admin/reassign_partitions? > > > > Does this alter the target_replicas of the leader and ISR znode? > > > > > > When the /admin/reassign_partitions znode is empty, we'll listen for > > > updates. When an update is made, we'll treat it like a call to > > > AlterPartitionReassignments. Whenever we have zero pending > reassignments, > > > we'll delete the /admin/reassign_partitions znode. If the znode > already > > > exists, we don't listen on it (same as now). > > > > > > One thing to note here is that if you upgrade from a pre-KIP-455 > version > > > to a KIP-455 version, and you invoke AlterPartitionReassignments > before the > > > cluster roll is finished, the new controller won't write the > reassignment > > > to /admin/reassign_partitions. So if the controller node bounces to an > > > un-upgraded broker during the roll, the reassignment will halt for that > > > time period. This is OK because a cluster roll should be a relatively > > > short amount of time-- and also, probably not a time you want to be > doing > > > reassignments anyway because of the extra load on the cluster. > > > > > > > > > > > 4. I think it would be helpful to provide an example of the rebalance > > > > process for a given partition. Specifically I am wondering whether > the > > > > replica set is updated incrementally or if we follow the current > > > behavior. > > > > Possibly some decisions can be deferred to implementation, but it > would > > > be > > > > helpful to work through a case of changing the replication factor > just to > > > > make sure there are reasonable options. > > > > > > Good question. It will be the current behavior. Basically, we > > > immediately try to replicate to all the targetReplicas. As they enter > the > > > ISR, they leave "targetReplicas" and enter "replicas." Making this > more > > > incremental would be a good follow-on improvement. > > > > > > > > > > > 5. Are we changing the semantics of the URP and UnderMinIsr metrics > in > > > this > > > > KIP or in a follow-up? > > > > > > In a follow-up. > > > > > > > > > > > 6. We have both "TargetBrokers" and "PendingReplicas" as names in the > > > > proposal. Perhaps we should try to be consistent? > > > > > > Fixed. I set it to TargetReplicas for now > > > > > > > > > > > 7. I am not sure specifying `targetReplicas` as empty is the > clearest way > > > > to cancel a reassignment. Whether we implement it this way or not in > the > > > > protocol is a separate issue, but perhaps we should have an explicit > > > > `cancelReassignment` method in AdminClient? > > > > > > I changed the API slightly to take an > Optional<NewPartitionReassignment>. > > > Does that seem cleaner? > > > > > > best, > > > Colin > > > > > > > > > > > Thanks, > > > > Jason > > > > > > > > > > > > > > > > > > > > On Wed, Jun 19, 2019 at 3:36 AM Stanislav Kozlovski < > > > stanis...@confluent.io> > > > > wrote: > > > > > > > > > Hey there Colin, > > > > > > > > > > Thanks for the work on this KIP. It is a much-needed improvement > and > > > I'm > > > > > excited to see it. Sorry for coming in so late to the discussion, I > > > have > > > > > one question to better understand the change and a small > suggestion. > > > > > > > > > > I see we allow reassignment cancellation at the partition level - > what > > > is > > > > > the motivation behind that? I think that having the back-end > structures > > > > > support it is a good choice since it allows us more flexibility in > the > > > > > future but what are the reasons for allowing a user to cancel at a > > > > > partition level? I think allowing it might let users shoot > themselves > > > in > > > > > the foot easier and make tools harder to implement (needing to > guard > > > > > against it). > > > > > > > > > > In all regards, what do you think about an ease of use improvement > > > where we > > > > > allow a user to cancel all reassignments for a topic without > > > specifying its > > > > > partitions? Essentially, we could cancel all reassignments for a > topic > > > if > > > > > the Partitions field in AlterPartitionAssignmentsRequest is null. > > > > > > > > > > Best, > > > > > Stanislav > > > > > > > > > > On Mon, May 6, 2019 at 5:42 PM Colin McCabe <cmcc...@apache.org> > > > wrote: > > > > > > > > > > > On Mon, May 6, 2019, at 07:39, Ismael Juma wrote: > > > > > > > Hi Colin, > > > > > > > > > > > > > > A quick comment. > > > > > > > > > > > > > > On Sat, May 4, 2019 at 11:18 PM Colin McCabe < > cmcc...@apache.org> > > > > > wrote: > > > > > > > > > > > > > > > The big advantage of doing batching on the controller is > that the > > > > > > > > controller has more information about what is going on in the > > > > > > cluster. So > > > > > > > > it can schedule reassignments in a more optimal way. For > > > instance, > > > > > it > > > > > > can > > > > > > > > schedule reassignments so that the load is distributed evenly > > > across > > > > > > > > nodes. This advantage is lost if we have to adhere to a > rigid > > > > > ordering > > > > > > > > that is set up in advance. We don't know exactly when > anything > > > will > > > > > > > > complete in any case. Just because one partition > reassignment > > > was > > > > > > started > > > > > > > > before another doesn't mean it will finish before another. > > > > > > > > > > > > > > > > > > > > > This is not quite true, right? The Controller doesn't know > about > > > > > > partition > > > > > > > sizes, throughput per partition and other such information that > > > > > external > > > > > > > tools like Cruise Control track. > > > > > > > > > > > > Hi Ismael, > > > > > > > > > > > > That's a good point, and one I should have included. > > > > > > > > > > > > I guess when I think about "do batching in the controller" > versus "do > > > > > > batching in an external system" I tend to think about the > > > information the > > > > > > controller could theoretically collect, rather than what it > actually > > > does > > > > > > :) But certainly, adding this information to the controller > would > > > be a > > > > > > significant change, and maybe one we don't want to do if the > external > > > > > > systems work well enough. > > > > > > > > > > > > Thinking about this a little bit more, I can see three > advantages to > > > > > > controller-side batching. Firstly, doing batching in the > controller > > > > > saves > > > > > > memory because we don't use a separate JVM, and don't duplicate > the > > > > > > in-memory map of all the partitions. Secondly, the information > we're > > > > > > acting on would also be more up-to-date. (I'm not sure how > important > > > > > this > > > > > > would be.) Finally, it's one less thing to deploy. I don't > know if > > > > > those > > > > > > are really enough to motivate switching now, but in a greenfield > > > system I > > > > > > would probably choose controller-side rebalancing. > > > > > > > > > > > > In any case, this KIP is orthogonal to controller-side > rebalancing > > > versus > > > > > > external rebalancing. That's why the KIP states that we will > > > continue to > > > > > > perform all the given partition rebalances immediately. I was > just > > > > > > responding to the idea that maybe we should have an "ordering" of > > > > > > rebalancing partitions. I don't think we want that, for > > > controller-side > > > > > > rebalancing or externally batched rebalancing. > > > > > > > > > > > > best, > > > > > > Colin > > > > > > > > > > > > > > > > > > > > > -- > > > > > Best, > > > > > Stanislav > > > > > > > > > > > > > > >