[ https://issues.apache.org/jira/browse/KUDU-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexey Serbin resolved KUDU-3487. --------------------------------- Fix Version/s: 1.17.0 Resolution: Fixed Thank you for reporting and addressing the issue, [~Song Jiacheng]! > Rebalancer: Balance for 1 replication factor tablet might stuck for leader > step down too early > ---------------------------------------------------------------------------------------------- > > Key: KUDU-3487 > URL: https://issues.apache.org/jira/browse/KUDU-3487 > Project: Kudu > Issue Type: Bug > Affects Versions: 1.14.0 > Reporter: Song Jiacheng > Priority: Major > Fix For: 1.17.0 > > Attachments: > Fix_a_bug_that_replace_balance_for_1_replication_factor_tablet_might_stuck_for_leader_step.patch, > image-2023-07-25-15-04-37-930.png, image-2023-07-25-15-11-16-505.png, > image-2023-07-25-15-11-55-381.png > > > Function CheckCompleteReplace in replace rebalance will try to make the > leader step down if the replica, which should be removed, is leader, but this > may stuck for a while if the replication factor of the table is 1, since > there is no voter to transfer leadership. > So it will be ok if we make sure voter num of the tablet is greater than 1 > before sending the LeaderStepDown request. > Here's a example: > I execute the following commands to move all the tablets of a tablet server > out. > kudu tserver state enter_maintenance ta1 f853d8ab20344c23826716c67fb13ebe > kudu cluster rebalance master1,master2,master3 -ignored_tservers > f853d8ab20344c23826716c67fb13ebe -move_replicas_from_ignored_tservers . > And it will stuck at a certain tablet for a while. > it has been stuck for more than 10 minutes. > !image-2023-07-25-15-04-37-930.png! > The reason is that the tablet do leader step too early and stay in > leader_transfer_in_progress_ status. Then master tries to send change config > to add a peer but get refused by tablet server because of the > leader_transfer_in_progress_ status. > !image-2023-07-25-15-11-16-505.png! > !image-2023-07-25-15-11-55-381.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)