YifanZhang created KUDU-3082:
--------------------------------

             Summary: tablets in "CONSENSUS_MISMATCH" state for a long time
                 Key: KUDU-3082
                 URL: https://issues.apache.org/jira/browse/KUDU-3082
             Project: Kudu
          Issue Type: Bug
            Reporter: YifanZhang


Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
output is like:

 
{code:java}
Tablet Summary
Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 replicas' 
active configs disagree with the leader master's
  7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
  d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
  47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
All reported replicas are:
  A = 7380d797d2ea49e88d71091802fb1c81
  B = d1952499f94a4e6087bee28466fcb09f
  C = 47af52df1adc47e1903eb097e9c88f2e
  D = 08beca5ed4d04003b6979bf8bac378d2
The consensus matrix is:
 Config source |     Replicas     | Current term | Config index | Committed?
---------------+------------------+--------------+--------------+------------
 master        | A   B   C*       |              |              | Yes
 A             | A   B   C*       | 5            | -1           | Yes
 B             | A   B   C        | 5            | -1           | Yes
 C             | A   B   C*  D~   | 5            | 54649        | NoTablet 
6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 replicas' active 
configs disagree with the leader master's
  d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
  47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
  5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
All reported replicas are:
  A = d1952499f94a4e6087bee28466fcb09f
  B = 47af52df1adc47e1903eb097e9c88f2e
  C = 5a8aeadabdd140c29a09dabcae919b31
  D = 14632cdbb0d04279bc772f64e06389f9
The consensus matrix is:
 Config source |     Replicas     | Current term | Config index | Committed?
---------------+------------------+--------------+--------------+------------
 master        | A   B*  C        |              |              | Yes
 A             | A   B*  C        | 5            | 5            | Yes
 B             | A   B*  C   D~   | 5            | 96176        | No
 C             | A   B*  C        | 5            | 5            | Yes
Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 replicas' 
active configs disagree with the leader master's
  a9eaff3cf1ed483aae849549999d649a (kudu-ts23): RUNNING
  f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
  47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
All reported replicas are:
  A = a9eaff3cf1ed483aae849549999d649a
  B = f75df4a6b5ce404884313af5f906b392
  C = 47af52df1adc47e1903eb097e9c88f2e
  D = d1952499f94a4e6087bee28466fcb09f
The consensus matrix is:
 Config source |     Replicas     | Current term | Config index | Committed?
---------------+------------------+--------------+--------------+------------
 master        | A   B   C*       |              |              | Yes
 A             | A   B   C*       | 1            | -1           | Yes
 B             | A   B   C*       | 1            | -1           | Yes
 C             | A   B   C*  D~   | 1            | 2            | NoTablet 
3190a310857e4c64997adb477131488a of table '' is conflicted: 3 replicas' active 
configs disagree with the leader master's
  47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
  f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
  f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
All reported replicas are:
  A = 47af52df1adc47e1903eb097e9c88f2e
  B = f0f7b2f4b9d344e6929105f48365f38e
  C = f75df4a6b5ce404884313af5f906b392
  D = d1952499f94a4e6087bee28466fcb09f
The consensus matrix is:
 Config source |     Replicas     | Current term | Config index | Committed?
---------------+------------------+--------------+--------------+------------
 master        | A*  B   C        |              |              | Yes
 A             | A*  B   C   D~   | 1            | 1991         | No
 B             | A*  B   C        | 1            | 4            | Yes
 C             | A*  B   C        | 1            | 4            | Yes{code}
These tablets couldn't recover for a couple of days until we restart kudu-ts27.

I found so many duplicated logs in kudu-ts27 are like:
{code:java}
I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 
7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 
LEADER]: attempt to promote peer 08beca5ed4d04003b6979bf8bac378d2: there is 
already a config change operation in progress. Unable to promote follower until 
it completes. Doing nothing.
I0314 04:38:41.751009 65453 raft_consensus.cc:937] T 
6d9d3fb034314fa7bee9cfbf602bcdc8 P 47af52df1adc47e1903eb097e9c88f2e [term 5 
LEADER]: attempt to promote peer 14632cdbb0d04279bc772f64e06389f9: there is 
already a config change operation in progress. Unable to promote follower until 
it completes. Doing nothing.

{code}
There seems to be some RaftConfig change operations that somehow cannot 
complete.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to