[ https://issues.apache.org/jira/browse/KUDU-3082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17069389#comment-17069389 ]
YifanZhang commented on KUDU-3082: ---------------------------------- Unfortunately, most logs were cleaned up due to expiration before I want to analyze them. Now I get partial logs about tablet 7404240f458f462d92b6588d07583a52(full logs on ts26 and partial logs on ts25). I'll attach them in a moment. The logs on ts27 and the leader master before ts27 restart are completely cleaned up:( I also keep some fragmented logs on the master and I'm not sure if it is helpful. I think the state of ts27 was abnormal when the problem occurs because some replicas could't communicate with their leader on ts27. {code:java} I0313 03:50:14.118202 99494 raft_consensus.cc:1149] T 7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [term 2 LEADER]: Rejecting Update request from peer 47af52df1adc47e1903eb097e9c88f2e for earlier term 1. Current term is 2. Ops: [] I0313 03:50:14.250483 56182 consensus_queue.cc:984] T 7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: Connected to new peer: Peer: permanent_uuid: "47af52df1adc47e1903eb097e9c88f2e" member_type: VOTER last_known_addr { host: "kudu-ts27" port: 14100 }, Status: LMP_MISMATCH, Last received: 0.0, Next index: 55446, Last known committed idx: 55445, Time since last communication: 0.000s I0313 03:50:14.327806 56430 consensus_queue.cc:984] T 7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: Connected to new peer: Peer: permanent_uuid: "d1952499f94a4e6087bee28466fcb09f" member_type: VOTER last_known_addr { host: "kudu-ts25" port: 14100 }, Status: LMP_MISMATCH, Last received: 0.0, Next index: 55446, Last known committed idx: 54648, Time since last communication: 0.000s I0313 03:50:14.330118 56430 consensus_queue.cc:689] T 7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: The logs necessary to catch up peer d1952499f94a4e6087bee28466fcb09f have been garbage collected. The follower will never be able to catch up (Not found: Failed to read ops 54649..55444: Segment 157 which contained index 54649 has been GCed) I0313 03:50:14.330137 56430 consensus_queue.cc:544] T 7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: The logs necessary to catch up peer d1952499f94a4e6087bee28466fcb09f have been garbage collected. The replica will never be able to catch up I0313 03:50:14.335949 99494 consensus_queue.cc:206] T 7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: Queue going to LEADER mode. State: All replicated index: 0, Majority replicated index: 55446, Committed index: 55446, Last appended: 2.55446, Last appended by leader: 55445, Current term: 2, Majority size: 2, State: 0, Mode: LEADER, active raft config: opid_index: 55447 OBSOLETE_local: false peers { permanent_uuid: "7380d797d2ea49e88d71091802fb1c81" member_type: VOTER last_known_addr { host: "kudu-ts26" port: 14100 } } peers { permanent_uuid: "47af52df1adc47e1903eb097e9c88f2e" member_type: VOTER last_known_addr { host: "kudu-ts27" port: 14100 } } I0313 03:50:14.336225 56182 consensus_queue.cc:984] T 7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 [LEADER]: Connected to new peer: Peer: permanent_uuid: "47af52df1adc47e1903eb097e9c88f2e" member_type: VOTER last_known_addr { host: "kudu-ts27" port: 14100 }, Status: LMP_MISMATCH, Last received: 0.0, Next index: 55447, Last known committed idx: 55446, Time since last communication: 0.000s W0313 03:50:14.336508 98349 consensus_peers.cc:458] T 7404240f458f462d92b6588d07583a52 P 7380d797d2ea49e88d71091802fb1c81 -> Peer 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27:14100): Couldn't send request to peer 47af52df1adc47e1903eb097e9c88f2e. Status: Illegal state: Rejecting Update request from peer 7380d797d2ea49e88d71091802fb1c81 for term 2. Could not prepare a single transaction due to: Illegal state: RaftConfig change currently pending. Only one is allowed at a time. {code} Judging from the above logs on ts26, it reject the update request from peer 47af52d and it also send update request to this peer but failed. Maybe it means the config change operation of replica 47af52d is failed but the pending config isn't cleared. This case maybe similar to KUDU-1338. > tablets in "CONSENSUS_MISMATCH" state for a long time > ----------------------------------------------------- > > Key: KUDU-3082 > URL: https://issues.apache.org/jira/browse/KUDU-3082 > Project: Kudu > Issue Type: Bug > Components: consensus > Affects Versions: 1.10.1 > Reporter: YifanZhang > Priority: Major > > Lately we found a few tablets in one of our clusters are unhealthy, the ksck > output is like: > > {code:java} > Tablet Summary > Tablet 7404240f458f462d92b6588d07583a52 of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 7380d797d2ea49e88d71091802fb1c81 (kudu-ts26): RUNNING > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = 7380d797d2ea49e88d71091802fb1c81 > B = d1952499f94a4e6087bee28466fcb09f > C = 47af52df1adc47e1903eb097e9c88f2e > D = 08beca5ed4d04003b6979bf8bac378d2 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---------------+------------------+--------------+--------------+------------ > master | A B C* | | | Yes > A | A B C* | 5 | -1 | Yes > B | A B C | 5 | -1 | Yes > C | A B C* D~ | 5 | 54649 | No > Tablet 6d9d3fb034314fa7bee9cfbf602bcdc8 of table '' is conflicted: 2 > replicas' active configs disagree with the leader master's > d1952499f94a4e6087bee28466fcb09f (kudu-ts25): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > 5a8aeadabdd140c29a09dabcae919b31 (kudu-ts21): RUNNING > All reported replicas are: > A = d1952499f94a4e6087bee28466fcb09f > B = 47af52df1adc47e1903eb097e9c88f2e > C = 5a8aeadabdd140c29a09dabcae919b31 > D = 14632cdbb0d04279bc772f64e06389f9 > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---------------+------------------+--------------+--------------+------------ > master | A B* C | | | Yes > A | A B* C | 5 | 5 | Yes > B | A B* C D~ | 5 | 96176 | No > C | A B* C | 5 | 5 | Yes > Tablet bf1ec7d693b94632b099dc0928e76363 of table '' is conflicted: 1 > replicas' active configs disagree with the leader master's > a9eaff3cf1ed483aae849549999d649a (kudu-ts23): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > All reported replicas are: > A = a9eaff3cf1ed483aae849549999d649a > B = f75df4a6b5ce404884313af5f906b392 > C = 47af52df1adc47e1903eb097e9c88f2e > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---------------+------------------+--------------+--------------+------------ > master | A B C* | | | Yes > A | A B C* | 1 | -1 | Yes > B | A B C* | 1 | -1 | Yes > C | A B C* D~ | 1 | 2 | No > Tablet 3190a310857e4c64997adb477131488a of table '' is conflicted: 3 > replicas' active configs disagree with the leader master's > 47af52df1adc47e1903eb097e9c88f2e (kudu-ts27): RUNNING [LEADER] > f0f7b2f4b9d344e6929105f48365f38e (kudu-ts24): RUNNING > f75df4a6b5ce404884313af5f906b392 (kudu-ts19): RUNNING > All reported replicas are: > A = 47af52df1adc47e1903eb097e9c88f2e > B = f0f7b2f4b9d344e6929105f48365f38e > C = f75df4a6b5ce404884313af5f906b392 > D = d1952499f94a4e6087bee28466fcb09f > The consensus matrix is: > Config source | Replicas | Current term | Config index | Committed? > ---------------+------------------+--------------+--------------+------------ > master | A* B C | | | Yes > A | A* B C D~ | 1 | 1991 | No > B | A* B C | 1 | 4 | Yes > C | A* B C | 1 | 4 | Yes{code} > These tablets couldn't recover for a couple of days until we restart > kudu-ts27. > I found so many duplicated logs in kudu-ts27 are like: > {code:java} > I0314 04:38:41.511279 65731 raft_consensus.cc:937] T > 7404240f458f462d92b6588d07583a52 P 47af52df1adc47e1903eb097e9c88f2e [term 3 > LEADER]: attempt to promote peer 08beca5ed4d04003b6979bf8bac378d2: there is > already a config change operation in progress. Unable to promote follower > until it completes. Doing nothing. > I0314 04:38:41.751009 65453 raft_consensus.cc:937] T > 6d9d3fb034314fa7bee9cfbf602bcdc8 P 47af52df1adc47e1903eb097e9c88f2e [term 5 > LEADER]: attempt to promote peer 14632cdbb0d04279bc772f64e06389f9: there is > already a config change operation in progress. Unable to promote follower > until it completes. Doing nothing. > {code} > There seems to be some RaftConfig change operations that somehow cannot > complete. > -- This message was sent by Atlassian Jira (v8.3.4#803005)