[ 
https://issues.apache.org/jira/browse/KUDU-3487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17746820#comment-17746820
 ] 

Song Jiacheng edited comment on KUDU-3487 at 7/25/23 8:51 AM:
--------------------------------------------------------------

Hi, [~aserbin].

Could you please help me review this?

Thanks!


was (Author: song jiacheng):
Hi, [~aserbin].

Could you please help me review this?

> Rebalancer: Balance for 1 replication factor tablet might stuck for leader 
> step down too early
> ----------------------------------------------------------------------------------------------
>
>                 Key: KUDU-3487
>                 URL: https://issues.apache.org/jira/browse/KUDU-3487
>             Project: Kudu
>          Issue Type: Bug
>    Affects Versions: 1.14.0
>            Reporter: Song Jiacheng
>            Priority: Major
>         Attachments: 
> Fix_a_bug_that_replace_balance_for_1_replication_factor_tablet_might_stuck_for_leader_step.patch,
>  image-2023-07-25-15-04-37-930.png, image-2023-07-25-15-11-16-505.png, 
> image-2023-07-25-15-11-55-381.png
>
>
> Function CheckCompleteReplace in replace rebalance will try to make the 
> leader step down if the replica, which should be removed, is leader, but this 
> may stuck for a while if the replication factor of the table is 1, since 
> there is no voter to transfer leadership.
> So it will be ok if we make sure voter num of the tablet is greater than 1 
> before sending the LeaderStepDown request.
> Here's a example:
> I execute the following commands to move all the tablets of a tablet server 
> out.
> kudu tserver state enter_maintenance ta1 f853d8ab20344c23826716c67fb13ebe
> kudu cluster rebalance master1,master2,master3  -ignored_tservers 
> f853d8ab20344c23826716c67fb13ebe -move_replicas_from_ignored_tservers .
> And it will stuck at a certain tablet for a while. 
> it has been stuck for more than 10 minutes.
> !image-2023-07-25-15-04-37-930.png!
> The reason is that the tablet do leader step too early and stay in 
> leader_transfer_in_progress_ status. Then master tries to send change config 
> to add a peer but get refused by tablet server because of the 
> leader_transfer_in_progress_ status.
> !image-2023-07-25-15-11-16-505.png!
> !image-2023-07-25-15-11-55-381.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to