Re: [DISCUSS] KIP-455 Create an Admin API for Replica Reassignments

Stanislav Kozlovski Tue, 09 Jul 2019 03:16:17 -0700

Hey there everybody,

I've edited the KIP. Here is a diff of the changes -
https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=112820260&selectedPageVersions=13&selectedPageVersions=11
Specifically,
- new AdminAPI for listing ongoing reassignments for _a given list of topics_.
- made AlterPartitionReassignmentsRequest#Topics and
AlterPartitionReassignmentsRequest#Partitions nullable fields, in
order to make the API more flexible in large cancellations
- made ListPartitionReassignmentsRequest#PartitionIds nullable as
well, to support listing every reassigning partition for a given topic
without the need to specify each partition
- explicitly defined what a reassignment cancellation means
- laid out the algorithm for changing the state in ZK (this was
spelled out by Colin in the previous thread)
- mention expected behavior when reassigning during software upgrade/downgrades
- mention the known edge case of using both APIs at once during a
controller failover


I look forward to feedback.

Hey George,
> Regardless of KIP-236 or KIP-455,  I would like stress the importance of 
> keeping the original replicas info before reassignments are kicked off.  This 
> original replicas info will allow us to distinguish what replicas are 
> currently being reassigned, so we can rollback to its original state.
> Also this will opens up possibility to separate the ReplicaFetcher traffic of 
> normal follower traffic from Reassignment traffic,  also the metrics 
> reporting URP, MaxLag, TotalLag, etc. right now, Reassignment traffic and 
> normal follower traffic shared the same ReplicaFetcher threads pool.

Thanks for the reminder. A lot of your suggestions are outlined in the
"Future Work" section of KIP-455. The pointer towards different
ReplicaFetcher thread pools is interesting -- do you think there's
much value in that? My intuition is that having appropriate quotas for
the reassignment traffic is the better way to separate the two,
whereas a separate thread pool might provide less of a benefit.

With regards to keeping the original replicas info before
reassignments are kicked off - this KIP proposes that we store the
`targetReplicas` in a different collection and thus preserve the
original replicas info until the reassignment is fully complete. It
should allow you to implement rollback functionality. Please take a
look at the KIP and confirm if that is the case. It would be good to
synergize both KIPs.


Thanks,
Stanislav


On Tue, Jul 9, 2019 at 12:43 AM George Li
<sql_consult...@yahoo.com.invalid> wrote:
>
>  > Now that we support multiple reassignment requests, users may add execute> 
> them incrementally. Suppose something goes horribly wrong and they want to> 
> revert as quickly as possible - they would need to run the tool with> 
> multiple rollback JSONs.  I think that it would be useful to have an easy> 
> way to stop all ongoing reassignments for emergency situations.
>
> KIP-236: Interruptible Partition Reassignment is exactly trying to cancel the 
> pending reassignments cleanly/safely in a timely fashion.  It's possible to 
> cancel/rollback the reassignments not yet completed if the original replicas 
> before reassignment is saved somewhere. e.g. the /admin/reassign_partitions 
> znode, the Controller's ReassignmentContext memory struct.
>
> I think a command line option like "kafka-reassign-partitions.sh --cancel" 
> would be easier for the user to cancel whatever pending reassignments going 
> on right now.  no need to find the rollback json files and re-submit them as 
> reassignments.
>
> Regardless of KIP-236 or KIP-455,  I would like stress the importance of 
> keeping the original replicas info before reassignments are kicked off.  This 
> original replicas info will allow us to distinguish what replicas are 
> currently being reassigned, so we can rollback to its original state.  Also 
> this will opens up possibility to separate the ReplicaFetcher traffic of 
> normal follower traffic from Reassignment traffic,  also the metrics 
> reporting URP, MaxLag, TotalLag, etc. right now, Reassignment traffic and 
> normal follower traffic shared the same ReplicaFetcher threads pool.
>
> Thanks,
> George
>
>     On Tuesday, July 2, 2019, 10:47:55 AM PDT, Stanislav Kozlovski 
> <stanis...@confluent.io> wrote:
>
>  Hey there, I need to start a new thread on KIP-455. I think there might be
> an issue with the mailing server. For some reason, my replies to the
> previous discussion thread could not be seen by others. After numerous
> attempts, Colin suggested I start a new thread.
>
> Original Discussion Thread:
> https://sematext.com/opensee/m/Kafka/uyzND1Yl7Er128CQu1?subj=+DISCUSS+KIP+455+Create+an+Administrative+API+for+Replica+Reassignment
> KIP:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A+Create+an+Administrative+API+for+Replica+Reassignment
> Last Reply of Previous Thread:
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201906.mbox/%3C679a4c5b-3da6-4556-bb89-e680d8cbb705%40www.fastmail.com%3E
>
> The following is my reply:
> ----
> Hi again,
>
> This has been a great discussion on a tricky KIP. I appreciate everybody's
> involvement in improving this crucial API.
> That being said, I wanted to apologize for my first comment, it was a bit
> rushed and not thought out.
>
> I've got a few questions now that I dove into this better:
>
> 1. Does it make sense to have an easy way to cancel all ongoing
> reassignments? To cancel all ongoing reassignments, users had the crude
> option of deleting the znode, bouncing the controller and running the
> rollback JSON assignment that kafka-reassign-partitions.sh gave them
> (KAFKA-6304).
> Now that we support multiple reassignment requests, users may add execute
> them incrementally. Suppose something goes horribly wrong and they want to
> revert as quickly as possible - they would need to run the tool with
> multiple rollback JSONs.  I think that it would be useful to have an easy
> way to stop all ongoing reassignments for emergency situations.
>
> ---------
>
> 2. Our kafka-reassign-partitions.sh tool doesn't seem to currently let you
> figure out the ongoing assignments - I guess we expect people to use
> kafka-topics.sh for that. I am not sure how well that would continue to
> work now that we update the replica set only after the new replica joins
> the ISR.
> Do you think it makes sense to add an option for listing the current
> reassignments to the reassign tool as part of this KIP?
>
> We might want to think whether we want to show the TargetReplicas
> information in the kafka-topics command for completeness as well. That
> might involve the need to update the DescribeTopicsResponse. Personally I
> can't see a downside but I haven't given it too much thought. I fully agree
> that we don't want to add the target replicas to the full replica set and
> nothing useful comes out of telling users they have a replica that might
> not have copied a single byte. Yet, telling them that we have the intention
> of copying bytes sounds useful so maybe having a separate column in
> kafka-topics.sh would provide better clarity?
>
> ---------
>
> 3. What happens if we do another reassignment to a partition while one is
> in progress? Do we overwrite the TargetReplicas?
> In the example sequence you gave:
> R: [1, 2, 3, 4, 5, 6], I: [1, 2, 3, 4, 5, 6], T: [4, 5, 6]
> What would the behavior be if a new reassign request came with
> TargetReplicas of [7, 8, 9] for that partition?
>
> To avoid complexity and potential race conditions, would it make sense to
> reject a reassign request once one is in progress for the specific
> partition, essentially forcing the user to cancel it first?
> Forcing the user to cancel has the benefit of being explicit and guarding
> against human mistakes. The downside I can think of is that in some
> scenarios it might be inefficient, e.g
> R: [1, 2, 3, 4, 5, 6], I: [1, 2, 3, 4, 5, 6], T: [4, 5, 6]
> Cancel request sent out. Followed by a new reassign request with
> TargetReplicas of [5, 6, 7] (note that 5 and 6 already fully copied the
> partition). Becomes a bit of a race condition of whether we deleted the
> partitions in between requests or not - I assume in practice this won't be
> an issue. I still feel like I prefer the explicit cancellation step
>
> ---------
>
> 4. My biggest concern - I want to better touch on the interaction between
> the new API and the current admin/reassign_partitions znode, the
> compatibility and our strategy there.
> The KIP says:
>
> > For compatibility purposes, we will continue to allow assignments to be
> > submitted through the /admin/reassign_partitions node. Just as with the
> > current code, this will only be possible if there are no current
> > assignments. In other words, the znode has two states: empty and waiting
> > for a write, and non-empty because there are assignments in progress. Once
> > the znode is non-empty, further writes to it will be ignored.
>
> Given the current proposal, I can think of 4 scenarios I want to get a
> better understanding of:
>
> *(i, ii, iii, iiii talk about the reassignment of the same one partition
> only - partitionA)*
>
> i. znode is empty, new reassignment triggered via API, znode is updated
> When the new reassignment is triggered via the API, do we create the znode
> or do we allow a separate tool to trigger another reassignment through it?
>
> ii. (assuming we allow creating the znode as with scenario "i"): znode is
> empty, new reassignment triggered via API, znode is updated, znode is
> DELETED
> My understand is that deleting the znode does not do anything until the
> Controller is bounced - is that correct?
> If so, this means that nothing will happen. If the Controller is bounced,
> the reassignment state will still be live in the [partitionId]/state znode
>
> iii. znode is updated, new reassignment triggered via API
> We override the reassignment for partitionA. The reassign_partitions znode
> is showing stale data, correct?
>
> iiii. znode is updated, new reassignment triggered via API, controller
> failover
> What does the controller believe - the [partitionId]/state znode or the
> /reassign_partitions ? I would assume the [partitionId]/state znode since
> in this case we want the reassignment API call to be the correct one. I
> think that opens up the possibility of missing a freshly-set
> /reassign_partitions though (e.g if it was empty and was set right during
> controller failover)
>
> iiiii. znode is updated to move partitionA, new reassignment triggered via
> API for partitionB, partitionA move finishes
> At this point, do we delete the znode or do we wait until the partitionB
> move finishes as well?
>
> From the discussion here:
>
> > There's no guarantee that what is in the znode reflects the current
> > reassignments that are going on.  The only thing you can know is that if
> > the znode exists, there is at least one reassignment going on.
>
> This is changing the expected behavior of a tool that obeys Kafka's current
> behavior though. It is true that updating the znode while a reassignment is
> in progress has no effect but make ZK misleading but tools might have grown
> to follow that rule and only update the znode once it is empty. I think we
> might want to be more explicit when making such changes - I had seen
> discontentment in the community from the fact that we had changed the znode
> updating behavior in a MINOR pull request.
>
> I feel it is complex to support both APIs and make sure we don't have
> unhandled edge cases. I liked Bob's suggestion on potentially allowing only
> one via a feature flag:
>
> > Could we temporarily support
> > both, with a config enabling the new behavior to prevent users from trying
> > to use both mechanisms (if the config is true, the old znode is ignored; if
> > the config is false, the Admin Client API returns an error indicating that
> > it is not enabled)?
>
> Perhaps it makes sense to discuss that possibility a bit more?
>
> ---------
>
> 5. ListPartitionReassignments filtering
>
> I guess the thought process here is that most reassignment tools want to
> > know about all the reassignments that are going on.  If you don't know all
> > the pending reassignments, then it's hard to say whether adding a new one
> > is a good idea, or cancelling an existing one.  So I guess I can't think of
> > a case where a reassignment tool would want a partial set rather than the
> > full one.
>
>
> I agree with Jason about the UIs having "drill into" options. I believe we
> should support some sort of ongoing reassignment filtering at the topic
> level (that's the administrative concept people care about).
> An example of a tool that might leverage it is our own
> kafka-reassign-partitions.sh. You can ask that tool to generate a
> reassignment for you from a given list of topics. It currently uses
> `KafkaZkClient#getReplicaAssignmentForTopics()` to get the current
> assignment for the given topics. It would be better if it could use the new
> ListPartitionsReassignments API to both figure out the current replica
> assignments and whether or not those topics are being reassigned (it could
> log a warning that a reassignment is in progress for those topics).
>
> ---------
>
> and a small nit: We also need to update
> the ListPartitionReassignmentsResponse with the decided
> current/targetReplicas naming
>
> Thanks,
> Stanislav
>



--
Best,
Stanislav

Re: [DISCUSS] KIP-455 Create an Admin API for Replica Reassignments

Reply via email to