[ 
https://issues.apache.org/jira/browse/KUDU-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414832#comment-16414832
 ] 

Alexey Serbin commented on KUDU-2354:
-------------------------------------

And another issue to look at: do follower masters continue to retry those tasks 
once then switched from the leader to the follower role?

> In case of 3-4-3 scheme and 3 tablet servers, catalog manager endlessly 
> retries operations to add a replacement replica even if replacement is no 
> longer needed
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KUDU-2354
>                 URL: https://issues.apache.org/jira/browse/KUDU-2354
>             Project: Kudu
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.7.0
>         Environment: 3 tservers in the cluster, single master (?)
>            Reporter: Alexey Serbin
>            Priority: Major
>
> In a scenario reported by [~adar], 100 iterations of the following command 
> were run:
> {noformat}
> kudu perf loadgen --keep-auto-table --table-num-buckets=40 
> --num-rows-per-thread=1 --table-num-replicas=3
> {noformat}
> That took about 10-15 minutes to complete, and for some reason ksck reported 
> UNAVAILABLE tablets for 5-10 minutes after that.  Most likely, due to the 
> spike of IO activity, tablet leaders didn't receive heartbeats from some 
> replicas and tried to replace those.  After some time, the cluster has 
> stabilized (no problems reported by ksck), but in the master's log the 
> following messages continued to appear:
> {noformat}
> I0315 13:52:00.871310 106157 catalog_manager.cc:3234] Sending 
> ChangeConfig:ADD_PEER:NON_VOTER on tablet 2776eb10c241426e90ddf7354260ee04 
> (attempt 22)
> I0315 13:52:00.871354 106157 catalog_manager.cc:2700] Scheduling retry of 
> ChangeConfig:ADD_PEER:NON_VOTER RPC for tablet 
> 2776eb10c241426e90ddf7354260ee04 with cas_config_opid_index -1 with a delay 
> of 60018 ms (attempt = 22)
> {noformat}
> Of course, in case of just 3 tservers in the cluster not a single attempt to 
> add a replacement non-voter replica would succeed, but it would make sense to 
> stop retrying those operations when a tablet's OpId index is far ahead of the 
> cas_config_opid_index of the operation being retried.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to