[ 
https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069389#comment-17069389
 ] 

YifanZhang commented on KUDU-3082:
----------------------------------

Unfortunately, most logs were cleaned up due to expiration before I want to 
analyze them. Now I get partial logs about tablet 
7404240f458f462d92b6588d07583a52(full logs on ts26 and partial logs on ts25). 
I'll attach them in a moment. The logs on ts27 and the leader master before 
ts27 restart are completely cleaned up:( I also keep some fragmented logs on 
the master and I'm not sure if it is helpful.

I think the state of ts27 was abnormal when the problem occurs because some 
replicas could't communicate with their leader on ts27.
{code:java}
I0313 03:50:14.118202 99494 raft_consensus.cc:1149] T 
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [term 2 
LEADER]: Rejecting Update request from peer 47af52df1adc47e1903eb097e9c88f2e 
for earlier term 1. Current term is 2. Ops: []
I0313 03:50:14.250483 56182 consensus_queue.cc:984] T 
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: 
Connected to new peer: Peer: permanent_uuid: "47af52df1adc47e1903eb097e9c88f2e" 
member_type: VOTER last_known_addr { host: "kudu-ts27" port: 14100 }, Status: 
LMP_MISMATCH, Last received: 0.0, Next index: 55446, Last known committed idx: 
55445, Time since last communication: 0.000s
I0313 03:50:14.327806 56430 consensus_queue.cc:984] T 
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: 
Connected to new peer: Peer: permanent_uuid: "d1952499f94a4e6087bee28466fcb09f" 
member_type: VOTER last_known_addr { host: "kudu-ts25" port: 14100 }, Status: 
LMP_MISMATCH, Last received: 0.0, Next index: 55446, Last known committed idx: 
54648, Time since last communication: 0.000s
I0313 03:50:14.330118 56430 consensus_queue.cc:689] T 
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: 
The logs necessary to catch up peer d1952499f94a4e6087bee28466fcb09f have been 
garbage collected. The follower will never be able to catch up (Not found: 
Failed to read ops 54649..55444: Segment 157 which contained index 54649 has 
been GCed)
I0313 03:50:14.330137 56430 consensus_queue.cc:544] T 
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: 
The logs necessary to catch up peer d1952499f94a4e6087bee28466fcb09f have been 
garbage collected. The replica will never be able to catch up
I0313 03:50:14.335949 99494 consensus_queue.cc:206] T 
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: 
Queue going to LEADER mode. State: All replicated index: 0, Majority replicated 
index: 55446, Committed index: 55446, Last appended: 2.55446, Last appended by 
leader: 55445, Current term: 2, Majority size: 2, State: 0, Mode: LEADER, 
active raft config: opid_index: 55447 OBSOLETE_local: false peers { 
permanent_uuid: "7380d797d2ea49e88d71091802fb1c81" member_type: VOTER 
last_known_addr { host: "kudu-ts26" port: 14100 } } peers { permanent_uuid: 
"47af52df1adc47e1903eb097e9c88f2e" member_type: VOTER last_known_addr { host: 
"kudu-ts27" port: 14100 } }
I0313 03:50:14.336225 56182 consensus_queue.cc:984] T 
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: 
Connected to new peer: Peer: permanent_uuid: "47af52df1adc47e1903eb097e9c88f2e" 
member_type: VOTER last_known_addr { host: "kudu-ts27" port: 14100 }, Status: 
LMP_MISMATCH, Last received: 0.0, Next index: 55447, Last known committed idx: 
55446, Time since last communication: 0.000s
W0313 03:50:14.336508 98349 consensus_peers.cc:458] T 
7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 -> Peer 
47af52df1adc47e1903eb097e9c88f2e (kudu-ts27:14100): Couldn't send request to 
peer 47af52df1adc47e1903eb097e9c88f2e. Status: Illegal state: Rejecting Update 
request from peer 7380d797d2ea49e88d71091802fb1c81 for term 2. Could not 
prepare a single transaction due to: Illegal state: RaftConfig change currently 
pending. Only one is allowed at a time.
{code}
Judging from the above logs on ts26, it reject the update request from peer 
47af52d and it also send update request to this peer but failed. Maybe it means 
the config change operation of replica 47af52d is failed but the pending config 
isn't cleared. This case maybe similar to KUDU-1338.

 

> tablets in "CONSENSUS_MISMATCH" state for a long time
> -----------------------------------------------------
>
>                 Key: KUDU-3082
>                 URL: https://issues.apache.org/jira/browse/KUDU-3082
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 1.10.1
>            Reporter: YifanZhang
>            Priority: Major
>
> Lately we found a few tablets in one of our clusters are unhealthy, the ksck 
> output is like:
>  
> {code:java}
> Tablet Summary
> Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = 7380d797d2ea49e88d71091802fb1c81
>   B = d1952499f94a4e6087bee28466fcb09f
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = 08beca5ed4d04003b6979bf8bac378d2
> The consensus matrix is:
>  Config source |     Replicas     | Current term | Config index | Committed?
> ---------------+------------------+--------------+--------------+------------
>  master        | A   B   C*       |              |              | Yes
>  A             | A   B   C*       | 5            | -1           | Yes
>  B             | A   B   C        | 5            | -1           | Yes
>  C             | A   B   C*  D~   | 5            | 54649        | No
> Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 
> replicas' active configs disagree with the leader master's
>   d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING
> All reported replicas are:
>   A = d1952499f94a4e6087bee28466fcb09f
>   B = 47af52df1adc47e1903eb097e9c88f2e
>   C = 5a8aeadabdd140c29a09dabcae919b31
>   D = 14632cdbb0d04279bc772f64e06389f9
> The consensus matrix is:
>  Config source |     Replicas     | Current term | Config index | Committed?
> ---------------+------------------+--------------+--------------+------------
>  master        | A   B*  C        |              |              | Yes
>  A             | A   B*  C        | 5            | 5            | Yes
>  B             | A   B*  C   D~   | 5            | 96176        | No
>  C             | A   B*  C        | 5            | 5            | Yes
> Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 
> replicas' active configs disagree with the leader master's
>   a9eaff3cf1ed483aae849549999d649a (kudu-ts23): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
> All reported replicas are:
>   A = a9eaff3cf1ed483aae849549999d649a
>   B = f75df4a6b5ce404884313af5f906b392
>   C = 47af52df1adc47e1903eb097e9c88f2e
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source |     Replicas     | Current term | Config index | Committed?
> ---------------+------------------+--------------+--------------+------------
>  master        | A   B   C*       |              |              | Yes
>  A             | A   B   C*       | 1            | -1           | Yes
>  B             | A   B   C*       | 1            | -1           | Yes
>  C             | A   B   C*  D~   | 1            | 2            | No
> Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 
> replicas' active configs disagree with the leader master's
>   47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER]
>   f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING
>   f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING
> All reported replicas are:
>   A = 47af52df1adc47e1903eb097e9c88f2e
>   B = f0f7b2f4b9d344e6929105f48365f38e
>   C = f75df4a6b5ce404884313af5f906b392
>   D = d1952499f94a4e6087bee28466fcb09f
> The consensus matrix is:
>  Config source |     Replicas     | Current term | Config index | Committed?
> ---------------+------------------+--------------+--------------+------------
>  master        | A*  B   C        |              |              | Yes
>  A             | A*  B   C   D~   | 1            | 1991         | No
>  B             | A*  B   C        | 1            | 4            | Yes
>  C             | A*  B   C        | 1            | 4            | Yes{code}
> These tablets couldn't recover for a couple of days until we restart 
> kudu-ts27.
> I found so many duplicated logs in kudu-ts27 are like:
> {code:java}
> I0314 04:38:41.511279 65731 raft_consensus.cc:937] T 
> 7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 
> LEADER]: attempt to promote peer 08beca5ed4d04003b6979bf8bac378d2: there is 
> already a config change operation in progress. Unable to promote follower 
> until it completes. Doing nothing.
> I0314 04:38:41.751009 65453 raft_consensus.cc:937] T 
> 6d9d3fb034314fa7bee9cfbf602bcdc8 P 47af52df1adc47e1903eb097e9c88f2e [term 5 
> LEADER]: attempt to promote peer 14632cdbb0d04279bc772f64e06389f9: there is 
> already a config change operation in progress. Unable to promote follower 
> until it completes. Doing nothing.
> {code}
> There seems to be some RaftConfig change operations that somehow cannot 
> complete.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to