Hey there everybody, I've edited the KIP. Here is a diff of the changes - https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=112820260&selectedPageVersions=13&selectedPageVersions=11 Specifically, - new AdminAPI for listing ongoing reassignments for _a given list of topics_. - made AlterPartitionReassignmentsRequest#Topics and AlterPartitionReassignmentsRequest#Partitions nullable fields, in order to make the API more flexible in large cancellations - made ListPartitionReassignmentsRequest#PartitionIds nullable as well, to support listing every reassigning partition for a given topic without the need to specify each partition - explicitly defined what a reassignment cancellation means - laid out the algorithm for changing the state in ZK (this was spelled out by Colin in the previous thread) - mention expected behavior when reassigning during software upgrade/downgrades - mention the known edge case of using both APIs at once during a controller failover
I look forward to feedback. Hey George, > Regardless of KIP-236 or KIP-455, I would like stress the importance of > keeping the original replicas info before reassignments are kicked off. This > original replicas info will allow us to distinguish what replicas are > currently being reassigned, so we can rollback to its original state. > Also this will opens up possibility to separate the ReplicaFetcher traffic of > normal follower traffic from Reassignment traffic, also the metrics > reporting URP, MaxLag, TotalLag, etc. right now, Reassignment traffic and > normal follower traffic shared the same ReplicaFetcher threads pool. Thanks for the reminder. A lot of your suggestions are outlined in the "Future Work" section of KIP-455. The pointer towards different ReplicaFetcher thread pools is interesting -- do you think there's much value in that? My intuition is that having appropriate quotas for the reassignment traffic is the better way to separate the two, whereas a separate thread pool might provide less of a benefit. With regards to keeping the original replicas info before reassignments are kicked off - this KIP proposes that we store the `targetReplicas` in a different collection and thus preserve the original replicas info until the reassignment is fully complete. It should allow you to implement rollback functionality. Please take a look at the KIP and confirm if that is the case. It would be good to synergize both KIPs. Thanks, Stanislav On Tue, Jul 9, 2019 at 12:43 AM George Li <sql_consult...@yahoo.com.invalid> wrote: > > > Now that we support multiple reassignment requests, users may add execute> > them incrementally. Suppose something goes horribly wrong and they want to> > revert as quickly as possible - they would need to run the tool with> > multiple rollback JSONs. I think that it would be useful to have an easy> > way to stop all ongoing reassignments for emergency situations. > > KIP-236: Interruptible Partition Reassignment is exactly trying to cancel the > pending reassignments cleanly/safely in a timely fashion. It's possible to > cancel/rollback the reassignments not yet completed if the original replicas > before reassignment is saved somewhere. e.g. the /admin/reassign_partitions > znode, the Controller's ReassignmentContext memory struct. > > I think a command line option like "kafka-reassign-partitions.sh --cancel" > would be easier for the user to cancel whatever pending reassignments going > on right now. no need to find the rollback json files and re-submit them as > reassignments. > > Regardless of KIP-236 or KIP-455, I would like stress the importance of > keeping the original replicas info before reassignments are kicked off. This > original replicas info will allow us to distinguish what replicas are > currently being reassigned, so we can rollback to its original state. Also > this will opens up possibility to separate the ReplicaFetcher traffic of > normal follower traffic from Reassignment traffic, also the metrics > reporting URP, MaxLag, TotalLag, etc. right now, Reassignment traffic and > normal follower traffic shared the same ReplicaFetcher threads pool. > > Thanks, > George > > On Tuesday, July 2, 2019, 10:47:55 AM PDT, Stanislav Kozlovski > <stanis...@confluent.io> wrote: > > Hey there, I need to start a new thread on KIP-455. I think there might be > an issue with the mailing server. For some reason, my replies to the > previous discussion thread could not be seen by others. After numerous > attempts, Colin suggested I start a new thread. > > Original Discussion Thread: > https://sematext.com/opensee/m/Kafka/uyzND1Yl7Er128CQu1?subj=+DISCUSS+KIP+455+Create+an+Administrative+API+for+Replica+Reassignment > KIP: > https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A+Create+an+Administrative+API+for+Replica+Reassignment > Last Reply of Previous Thread: > http://mail-archives.apache.org/mod_mbox/kafka-dev/201906.mbox/%3C679a4c5b-3da6-4556-bb89-e680d8cbb705%40www.fastmail.com%3E > > The following is my reply: > ---- > Hi again, > > This has been a great discussion on a tricky KIP. I appreciate everybody's > involvement in improving this crucial API. > That being said, I wanted to apologize for my first comment, it was a bit > rushed and not thought out. > > I've got a few questions now that I dove into this better: > > 1. Does it make sense to have an easy way to cancel all ongoing > reassignments? To cancel all ongoing reassignments, users had the crude > option of deleting the znode, bouncing the controller and running the > rollback JSON assignment that kafka-reassign-partitions.sh gave them > (KAFKA-6304). > Now that we support multiple reassignment requests, users may add execute > them incrementally. Suppose something goes horribly wrong and they want to > revert as quickly as possible - they would need to run the tool with > multiple rollback JSONs. I think that it would be useful to have an easy > way to stop all ongoing reassignments for emergency situations. > > --------- > > 2. Our kafka-reassign-partitions.sh tool doesn't seem to currently let you > figure out the ongoing assignments - I guess we expect people to use > kafka-topics.sh for that. I am not sure how well that would continue to > work now that we update the replica set only after the new replica joins > the ISR. > Do you think it makes sense to add an option for listing the current > reassignments to the reassign tool as part of this KIP? > > We might want to think whether we want to show the TargetReplicas > information in the kafka-topics command for completeness as well. That > might involve the need to update the DescribeTopicsResponse. Personally I > can't see a downside but I haven't given it too much thought. I fully agree > that we don't want to add the target replicas to the full replica set and > nothing useful comes out of telling users they have a replica that might > not have copied a single byte. Yet, telling them that we have the intention > of copying bytes sounds useful so maybe having a separate column in > kafka-topics.sh would provide better clarity? > > --------- > > 3. What happens if we do another reassignment to a partition while one is > in progress? Do we overwrite the TargetReplicas? > In the example sequence you gave: > R: [1, 2, 3, 4, 5, 6], I: [1, 2, 3, 4, 5, 6], T: [4, 5, 6] > What would the behavior be if a new reassign request came with > TargetReplicas of [7, 8, 9] for that partition? > > To avoid complexity and potential race conditions, would it make sense to > reject a reassign request once one is in progress for the specific > partition, essentially forcing the user to cancel it first? > Forcing the user to cancel has the benefit of being explicit and guarding > against human mistakes. The downside I can think of is that in some > scenarios it might be inefficient, e.g > R: [1, 2, 3, 4, 5, 6], I: [1, 2, 3, 4, 5, 6], T: [4, 5, 6] > Cancel request sent out. Followed by a new reassign request with > TargetReplicas of [5, 6, 7] (note that 5 and 6 already fully copied the > partition). Becomes a bit of a race condition of whether we deleted the > partitions in between requests or not - I assume in practice this won't be > an issue. I still feel like I prefer the explicit cancellation step > > --------- > > 4. My biggest concern - I want to better touch on the interaction between > the new API and the current admin/reassign_partitions znode, the > compatibility and our strategy there. > The KIP says: > > > For compatibility purposes, we will continue to allow assignments to be > > submitted through the /admin/reassign_partitions node. Just as with the > > current code, this will only be possible if there are no current > > assignments. In other words, the znode has two states: empty and waiting > > for a write, and non-empty because there are assignments in progress. Once > > the znode is non-empty, further writes to it will be ignored. > > Given the current proposal, I can think of 4 scenarios I want to get a > better understanding of: > > *(i, ii, iii, iiii talk about the reassignment of the same one partition > only - partitionA)* > > i. znode is empty, new reassignment triggered via API, znode is updated > When the new reassignment is triggered via the API, do we create the znode > or do we allow a separate tool to trigger another reassignment through it? > > ii. (assuming we allow creating the znode as with scenario "i"): znode is > empty, new reassignment triggered via API, znode is updated, znode is > DELETED > My understand is that deleting the znode does not do anything until the > Controller is bounced - is that correct? > If so, this means that nothing will happen. If the Controller is bounced, > the reassignment state will still be live in the [partitionId]/state znode > > iii. znode is updated, new reassignment triggered via API > We override the reassignment for partitionA. The reassign_partitions znode > is showing stale data, correct? > > iiii. znode is updated, new reassignment triggered via API, controller > failover > What does the controller believe - the [partitionId]/state znode or the > /reassign_partitions ? I would assume the [partitionId]/state znode since > in this case we want the reassignment API call to be the correct one. I > think that opens up the possibility of missing a freshly-set > /reassign_partitions though (e.g if it was empty and was set right during > controller failover) > > iiiii. znode is updated to move partitionA, new reassignment triggered via > API for partitionB, partitionA move finishes > At this point, do we delete the znode or do we wait until the partitionB > move finishes as well? > > From the discussion here: > > > There's no guarantee that what is in the znode reflects the current > > reassignments that are going on. The only thing you can know is that if > > the znode exists, there is at least one reassignment going on. > > This is changing the expected behavior of a tool that obeys Kafka's current > behavior though. It is true that updating the znode while a reassignment is > in progress has no effect but make ZK misleading but tools might have grown > to follow that rule and only update the znode once it is empty. I think we > might want to be more explicit when making such changes - I had seen > discontentment in the community from the fact that we had changed the znode > updating behavior in a MINOR pull request. > > I feel it is complex to support both APIs and make sure we don't have > unhandled edge cases. I liked Bob's suggestion on potentially allowing only > one via a feature flag: > > > Could we temporarily support > > both, with a config enabling the new behavior to prevent users from trying > > to use both mechanisms (if the config is true, the old znode is ignored; if > > the config is false, the Admin Client API returns an error indicating that > > it is not enabled)? > > Perhaps it makes sense to discuss that possibility a bit more? > > --------- > > 5. ListPartitionReassignments filtering > > I guess the thought process here is that most reassignment tools want to > > know about all the reassignments that are going on. If you don't know all > > the pending reassignments, then it's hard to say whether adding a new one > > is a good idea, or cancelling an existing one. So I guess I can't think of > > a case where a reassignment tool would want a partial set rather than the > > full one. > > > I agree with Jason about the UIs having "drill into" options. I believe we > should support some sort of ongoing reassignment filtering at the topic > level (that's the administrative concept people care about). > An example of a tool that might leverage it is our own > kafka-reassign-partitions.sh. You can ask that tool to generate a > reassignment for you from a given list of topics. It currently uses > `KafkaZkClient#getReplicaAssignmentForTopics()` to get the current > assignment for the given topics. It would be better if it could use the new > ListPartitionsReassignments API to both figure out the current replica > assignments and whether or not those topics are being reassigned (it could > log a warning that a reassignment is in progress for those topics). > > --------- > > and a small nit: We also need to update > the ListPartitionReassignmentsResponse with the decided > current/targetReplicas naming > > Thanks, > Stanislav > -- Best, Stanislav