Mike Percy created KUDU-2740: -------------------------------- Summary: TabletCopyITest.TestTabletCopyingDeletedTabletFails flaky due to lack of leader election retries Key: KUDU-2740 URL: https://issues.apache.org/jira/browse/KUDU-2740 Project: Kudu Issue Type: Bug Affects Versions: 1.9.0 Reporter: Mike Percy
This test can be flaky because we disable failure detection and neglect to retry the leader election. An example error looks like this: {code:java} I0307 01:24:56.238428 5333 tablet_service.cc:1239] Received Run Leader Election RPC: tablet_id: "e75f819cfb0a45c483899e2396b3a07a" dest_uuid: "89a95beac49a43d0b02b662e1a228337" from {username='slave'} at 127.0.0.1:48832 I0307 01:24:56.238809 5333 raft_consensus.cc:472] T e75f819cfb0a45c483899e2396b3a07a P 89a95beac49a43d0b02b662e1a228337 [term 0 FOLLOWER]: Starting forced leader election (received explicit request) I0307 01:24:56.238982 5333 raft_consensus.cc:2886] T e75f819cfb0a45c483899e2396b3a07a P 89a95beac49a43d0b02b662e1a228337 [term 0 FOLLOWER]: Advancing to term 1 W0307 01:24:56.255393 5500 tablet.cc:1786] T e75f819cfb0a45c483899e2396b3a07a P 7509a1eca3f14f45903715fdb6a20f77: Can't schedule compaction. Clean time has not been advanced past its initial value. W0307 01:24:56.261915 5377 tablet.cc:1786] T e75f819cfb0a45c483899e2396b3a07a P 89a95beac49a43d0b02b662e1a228337: Can't schedule compaction. Clean time has not been advanced past its initial value. W0307 01:24:56.291085 5622 tablet.cc:1786] T e75f819cfb0a45c483899e2396b3a07a P 0caf13c7f5a64af781811ca30ab3656d: Can't schedule compaction. Clean time has not been advanced past its initial value. W0307 01:24:58.477632 5333 consensus_meta.cc:220] T e75f819cfb0a45c483899e2396b3a07a P 89a95beac49a43d0b02b662e1a228337: Time spent flushing consensus metadata: real 2.238s user 0.003s sys 0.000s I0307 01:24:58.477829 5333 raft_consensus.cc:494] T e75f819cfb0a45c483899e2396b3a07a P 89a95beac49a43d0b02b662e1a228337 [term 1 FOLLOWER]: Starting forced leader election with config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: "89a95beac49a43d0b02b662e1a228337" member_type: VOTER last_known_addr { host: "127.4.141.65" port: 44695 } } peers { permanent_uuid: "0caf13c7f5a64af781811ca30ab3656d" member_type: VOTER last_known_addr { host: "127.4.141.67" port: 32845 } } peers { permanent_uuid: "7509a1eca3f14f45903715fdb6a20f77" member_type: VOTER last_known_addr { host: "127.4.141.66" port: 35595 } } I0307 01:24:58.479737 5333 leader_election.cc:296] T e75f819cfb0a45c483899e2396b3a07a P 89a95beac49a43d0b02b662e1a228337 [CANDIDATE]: Term 1 election: Requested vote from peers 0caf13c7f5a64af781811ca30ab3656d (127.4.141.67:32845), 7509a1eca3f14f45903715fdb6a20f77 (127.4.141.66:35595) I0307 01:24:58.480012 5333 rpcz_store.cc:269] Call kudu.consensus.ConsensusService.RunLeaderElection from 127.0.0.1:48832 (request call id 3) took 2241ms. Request Metrics: {"dns_us":93} I0307 01:24:58.487798 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:24:58.493844 5578 tablet_service.cc:1122] Received RequestConsensusVote() RPC: tablet_id: "e75f819cfb0a45c483899e2396b3a07a" candidate_uuid: "89a95beac49a43d0b02b662e1a228337" candidate_term: 1 candidate_status { last_received { term: 0 index: 0 } } ignore_live_leader: true dest_uuid: "0caf13c7f5a64af781811ca30ab3656d" I0307 01:24:58.494168 5578 raft_consensus.cc:2886] T e75f819cfb0a45c483899e2396b3a07a P 0caf13c7f5a64af781811ca30ab3656d [term 0 FOLLOWER]: Advancing to term 1 I0307 01:24:58.494354 5456 tablet_service.cc:1122] Received RequestConsensusVote() RPC: tablet_id: "e75f819cfb0a45c483899e2396b3a07a" candidate_uuid: "89a95beac49a43d0b02b662e1a228337" candidate_term: 1 candidate_status { last_received { term: 0 index: 0 } } ignore_live_leader: true dest_uuid: "7509a1eca3f14f45903715fdb6a20f77" I0307 01:24:58.494655 5456 raft_consensus.cc:2886] T e75f819cfb0a45c483899e2396b3a07a P 7509a1eca3f14f45903715fdb6a20f77 [term 0 FOLLOWER]: Advancing to term 1 W0307 01:24:59.988574 5267 leader_election.cc:341] T e75f819cfb0a45c483899e2396b3a07a P 89a95beac49a43d0b02b662e1a228337 [CANDIDATE]: Term 1 election: RPC error from VoteRequest() call to peer 7509a1eca3f14f45903715fdb6a20f77 (127.4.141.66:35595): Timed out: RequestConsensusVote RPC to 127.4.141.66:35595 timed out after 1.507s (SENT) W0307 01:24:59.988920 5266 leader_election.cc:341] T e75f819cfb0a45c483899e2396b3a07a P 89a95beac49a43d0b02b662e1a228337 [CANDIDATE]: Term 1 election: RPC error from VoteRequest() call to peer 0caf13c7f5a64af781811ca30ab3656d (127.4.141.67:32845): Timed out: RequestConsensusVote RPC to 127.4.141.67:32845 timed out after 1.507s (SENT) I0307 01:24:59.989068 5266 leader_election.cc:310] T e75f819cfb0a45c483899e2396b3a07a P 89a95beac49a43d0b02b662e1a228337 [CANDIDATE]: Term 1 election: Election decided. Result: candidate lost. Election summary: received 3 responses out of 3 voters: 1 yes votes; 2 no votes. yes voters: 89a95beac49a43d0b02b662e1a228337; no voters: 0caf13c7f5a64af781811ca30ab3656d, 7509a1eca3f14f45903715fdb6a20f77 I0307 01:24:59.990028 5672 raft_consensus.cc:2588] T e75f819cfb0a45c483899e2396b3a07a P 89a95beac49a43d0b02b662e1a228337 [term 1 FOLLOWER]: Leader election lost for term 1. Reason: could not achieve majority W0307 01:25:01.850855 5578 consensus_meta.cc:220] T e75f819cfb0a45c483899e2396b3a07a P 0caf13c7f5a64af781811ca30ab3656d: Time spent flushing consensus metadata: real 3.356s user 0.003s sys 0.000s W0307 01:25:01.850855 5456 consensus_meta.cc:220] T e75f819cfb0a45c483899e2396b3a07a P 7509a1eca3f14f45903715fdb6a20f77: Time spent flushing consensus metadata: real 3.356s user 0.003s sys 0.000s I0307 01:25:01.851020 5578 raft_consensus.cc:2346] T e75f819cfb0a45c483899e2396b3a07a P 0caf13c7f5a64af781811ca30ab3656d [term 1 FOLLOWER]: Leader election vote request: Granting yes vote for candidate 89a95beac49a43d0b02b662e1a228337 in term 1. I0307 01:25:01.851022 5456 raft_consensus.cc:2346] T e75f819cfb0a45c483899e2396b3a07a P 7509a1eca3f14f45903715fdb6a20f77 [term 1 FOLLOWER]: Leader election vote request: Granting yes vote for candidate 89a95beac49a43d0b02b662e1a228337 in term 1. W0307 01:25:01.851320 5578 rpcz_store.cc:253] Call kudu.consensus.ConsensusService.RequestConsensusVote from 127.4.141.65:60535 (request call id 1) took 3357 ms (3.36 s). Client timeout 1507 ms (1.51 s) W0307 01:25:01.851326 5456 rpcz_store.cc:253] Call kudu.consensus.ConsensusService.RequestConsensusVote from 127.4.141.65:40719 (request call id 1) took 3357 ms (3.36 s). Client timeout 1507 ms (1.51 s) W0307 01:25:01.852047 5578 rpcz_store.cc:259] Trace: 0307 01:24:58.493384 (+ 0us) service_pool.cc:162] Inserting onto call queue 0307 01:24:58.493470 (+ 86us) service_pool.cc:221] Handling call 0307 01:25:01.851251 (+3357781us) inbound_call.cc:162] Queueing success response Metrics: {} W0307 01:25:01.852052 5456 rpcz_store.cc:259] Trace: 0307 01:24:58.493945 (+ 0us) service_pool.cc:162] Inserting onto call queue 0307 01:24:58.494010 (+ 65us) service_pool.cc:221] Handling call 0307 01:25:01.851251 (+3357241us) inbound_call.cc:162] Queueing success response Metrics: {} I0307 01:25:01.851759 5455 rpcz_store.cc:269] Call kudu.consensus.ConsensusService.GetLastOpId from 127.0.0.1:54392 (request call id 3) took 3261ms. Request Metrics: {"spinlock_wait_cycles":7499205888} I0307 01:25:01.853729 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:02.057730 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:02.362155 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:02.766373 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:03.270622 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:03.875089 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:04.579949 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:05.384346 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:06.288928 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:07.293898 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:08.298326 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:09.302745 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:10.307309 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:11.311887 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:12.316773 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:13.321633 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:14.327412 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:15.333164 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:16.337683 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:17.342846 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:18.347681 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:19.351661 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:20.355283 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:21.359144 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:22.363147 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:23.366828 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:24.371155 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:25.375557 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:26.379287 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:27.383133 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 I0307 01:25:28.386693 4661 cluster_itest_util.cc:249] Not converged past 1 yet: 0.0 0.0 0.0 /data/somelongdirectorytoavoidrpathissues/src/kudu/src/kudu/integration-tests/tablet_copy-itest.cc:1019: Failure Failed Bad status: Timed out: index 1 not available on all replicas after 30.000s {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)