[ https://issues.apache.org/jira/browse/KUDU-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Henke updated KUDU-2354: ------------------------------ Target Version/s: (was: 1.8.0) > In case of 3-4-3 scheme and 3 tablet servers, catalog manager endlessly > retries operations to add a replacement replica even if replacement is no > longer needed > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: KUDU-2354 > URL: https://issues.apache.org/jira/browse/KUDU-2354 > Project: Kudu > Issue Type: Bug > Components: master > Affects Versions: 1.7.0 > Environment: 3 tservers in the cluster, single master (?) > Reporter: Alexey Serbin > Priority: Major > > In a scenario reported by [~adar], 100 iterations of the following command > were run: > {noformat} > kudu perf loadgen --keep-auto-table --table-num-buckets=40 > --num-rows-per-thread=1 --table-num-replicas=3 > {noformat} > That took about 10-15 minutes to complete, and for some reason ksck reported > UNAVAILABLE tablets for 5-10 minutes after that. Most likely, due to the > spike of IO activity, tablet leaders didn't receive heartbeats from some > replicas and tried to replace those. After some time, the cluster has > stabilized (no problems reported by ksck), but in the master's log the > following messages continued to appear: > {noformat} > I0315 13:52:00.871310 106157 catalog_manager.cc:3234] Sending > ChangeConfig:ADD_PEER:NON_VOTER on tablet 2776eb10c241426e90ddf7354260ee04 > (attempt 22) > I0315 13:52:00.871354 106157 catalog_manager.cc:2700] Scheduling retry of > ChangeConfig:ADD_PEER:NON_VOTER RPC for tablet > 2776eb10c241426e90ddf7354260ee04 with cas_config_opid_index -1 with a delay > of 60018 ms (attempt = 22) > {noformat} > Of course, in case of just 3 tservers in the cluster not a single attempt to > add a replacement non-voter replica would succeed, but it would make sense to > stop retrying those operations when a tablet's OpId index is far ahead of the > cas_config_opid_index of the operation being retried. -- This message was sent by Atlassian Jira (v8.3.4#803005)