Alexey Serbin created KUDU-2354:
-----------------------------------

             Summary: In case of 3-4-3 scheme and 3 tablet servers, catalog 
manager endlessly retries operations to add a replacement replica even if 
replacement is no longer needed
                 Key: KUDU-2354
                 URL: https://issues.apache.org/jira/browse/KUDU-2354
             Project: Kudu
          Issue Type: Bug
          Components: master
    Affects Versions: 1.7.0
         Environment: 3 tservers in the cluster, single master (?)
            Reporter: Alexey Serbin


In a scenario reported by [~adar], 100 iterations of the following command were 
run:

{noformat}
kudu perf loadgen --keep-auto-table --tablet-num-buckets 40 
--num-rows-per-thread=1 --tablet-num-replicas=3
{noformat}

That took about 10-15 minutes to complete, and for some reason ksck reported 
UNAVAILABLE tablets for 5-10 minutes after that.  Most likely, due to the spike 
of IO activity, tablet leaders didn't receive heartbeats from some replicas and 
tried to replace those.  After some time, the cluster has stabilized (no 
problems reported by ksck), but in the master's log the following messages 
continued to appear:

{noformat}
I0315 13:52:00.871310 106157 catalog_manager.cc:3234] Sending 
ChangeConfig:ADD_PEER:NON_VOTER on tablet 2776eb10c241426e90ddf7354260ee04 
(attempt 22)
I0315 13:52:00.871354 106157 catalog_manager.cc:2700] Scheduling retry of 
ChangeConfig:ADD_PEER:NON_VOTER RPC for tablet 2776eb10c241426e90ddf7354260ee04 
with cas_config_opid_index -1 with a delay of 60018 ms (attempt = 22)
{noformat}

Of course, in case of just 3 tservers in the cluster not a single attempt to 
add a replacement non-voter replica would succeed, but it would make sense to 
stop retrying those operations when a tablet's OpId index is far ahead of the 
cas_config_opid_index of the operation being retried.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to