[ https://issues.apache.org/jira/browse/KUDU-3460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17703120#comment-17703120 ]
daicheng commented on KUDU-3460: -------------------------------- i found that the errors come out when kudu cluster in heavy load, and i solved it when set --[{{raft_heartbeat_interval_ms }}|https://kudu.apache.org/docs/configuration_reference.html#kudu-master_raft_heartbeat_interval_ms] to 2000 > RPC error from VoteRequest()call to peer **:Timed out: RequestConsensusVote > RPC to ** time out after 1.713s [SENT] > ------------------------------------------------------------------------------------------------------------------ > > Key: KUDU-3460 > URL: https://issues.apache.org/jira/browse/KUDU-3460 > Project: Kudu > Issue Type: Bug > Affects Versions: 1.16.0 > Reporter: daicheng > Priority: Major > Attachments: image-2023-03-17-15-27-45-755.png, > image-2023-03-17-15-28-13-480.png, image-2023-03-17-15-28-40-361.png, > image-2023-03-17-15-38-51-218.png > > > we hava 3 kudu_master and 6 kudu_tserver,when i create 2W tables to kudu, > wei got some error, and we cann't read any data from kudu,it throw many > errors: > here the errors from client : > {code:java} > Job aborted due to stage failure: Task 0 in stage 35.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 35.0 (TID 9601) (prod-bigdata-mw-159 > executor 3): java.lang.RuntimeException: > org.apache.kudu.client.NonRecoverableException: tablet hasn't heard from > leader or there hasn't been a stable leader fo.. > 2023-03-08 09:59:49,198 INFO org.apache.kudu.client.AsyncKuduClient > [] - Invalidating location master-10.0.2.33:7051(10.0.2.33:7051) > for tablet Kudu Master: Service unavailable: ListTables request on > kudu.master.MasterService from 10.0.3.82:8764 dropped due to backpressure. > The service queue is full; it has 100 items. {code} > and i found kudu tserver has many error like : > {code:java} > W0307 14:36:57.368008 14759 leader_election.cc:334] T > fa2a3b405a87466da7a6b1a962f35d99 P 5ac35cfccaf84228bf6d589501ec533e > [CANDIDATE]: Term 1640 pre-election: RPC error from VoteRequest() call topeer > d7b4384df45549a891f444d1a1f36a38 (10.0.2.19:7050): Timed out: > RequestConsensusVote RPC to 10.0.2.19:7050 timed out after 2.206s (SENT)W0307 > 14:36:57.368801 14759 leader_election.cc:334] T > 5f8d377660aa46f29e3f1595a33d086c P 5ac35cfccaf84228bf6d589501ec533e > [CANDIDATE]: Term 2 pre-election: RPC error from VoteRequest() call to peer > dfff3b43d48a41d5b8f2e5cbb9880454 (10.0.2.21:7050): Timed out: > RequestConsensusVote RPC to 10.0.2.21:7050 timed out after 1.725s (SENT)W0307 > 14:36:57.368917 14759 leader_election.cc:334] T > a32af7dd8af44b47b4b26d7a222c2f6b P 5ac35cfccaf84228bf6d589501ec533e > [CANDIDATE]: Term 344 pre-election: RPC error from VoteRequest() call to peer > dfff3b43d48a41d5b8f2e5cbb9880454 (10.0.2.21:7050): Timed out: > RequestConsensusVote RPC to 10.0.2.21:7050 timed out after 1.713s (SENT)W0307 > 14:36:57.369045 14759 leader_election.cc:334] T > 15e9b550c3274243a5ee923ceda67dc5 P 5ac35cfccaf84228bf6d589501ec533e > [CANDIDATE]: Term 1509 pre-election: RPC error from VoteRequest() call topeer > d7b4384df45549a891f444d1a1f36a38 (10.0.2.19:7050): Timed out: > RequestConsensusVote RPC to 10.0.2.19:7050 timed out after 3.056s (SENT)W0307 > 14:36:57.369563 14759 leader_election.cc:334] T > e5e49b443f71478984162a2eb65d3607 P 5ac35cfccaf84228bf6d589501ec533e > [CANDIDATE]: Term 1575 pre-election: RPC error from VoteRequest() call topeer > d7b4384df45549a891f444d1a1f36a38 (10.0.2.19:7050): Timed out: > RequestConsensusVote RPC to 10.0.2.19:7050 timed out after 1.553s (SENT)W0307 > 14:36:57.371872 14759 leader_election.cc:334] T > 2ec17c9dd68e47ceb7f572efb9f18fe3 P 5ac35cfccaf84228bf6d589501ec533e > [CANDIDATE]: Term 1633 pre-election: RPC error from VoteRequest() call topeer > d7b4384df45549a891f444d1a1f36a38 (10.0.2.19:7050): Timed out: > RequestConsensusVote RPC to 10.0.2.19:7050 timed out after 2.010s (SENT)W0307 > 14:36:57.372673 14759 leader_election.cc:334] T > a91cf24cc4c943cbbd041c7e6726d7aa P 5ac35cfccaf84228bf6d589501ec533e > [CANDIDATE]: Term 1610 pre-election: RPC error from VoteRequest() call topeer > d7b4384df45549a891f444d1a1f36a38 (10.0.2.19:7050): Timed out: > RequestConsensusVote RPC to 10.0.2.19:7050 timed out after 1.970s (SENT)W0307 > 14:36:57.372789 14759 leader_election.cc:334] T > cd667f33abb74afba4b9c510b8f6dfaa P 5ac35cfccaf84228bf6d589501ec533e > [CANDIDATE]: Term 3 pre-election: RPC error from VoteRequest() call to peer > dfff3b43d48a41d5b8f2e5cbb9880454 (10.0.2.21:7050): Timed out: > RequestConsensusVote RPC to 10.0.2.21:7050 timed out after 1.674s (SENT)W0307 > 14:36:57.373358 14759 leader_election.cc:334] T > 39709b52ffe34f81b08d0562e45a7a13 P 5ac35cfccaf84228bf6d589501ec533e > [CANDIDATE]: Term 44 pre-election: RPC error from VoteRequest() call to peer > d7b4384df45549a891f444d1a1f36a38 (10.0.2.19:7050): Timed out: > RequestConsensusVote RPC to 10.0.2.19:7050 timed out after 1.636s (SENT)W0307 > 14:36:57.373525 14759 leader_election.cc:334] T > 00da9e2c20814ac88e18f7d7220f01c9 P 5ac35cfccaf84228bf6d589501ec533e > [CANDIDATE]: Term 2 pre-election: RPC error from VoteRequest() call to peer > dfff3b43d48a41d5b8f2e5cbb9880454 (10.0.2.21:7050): Timed out: > RequestConsensusVote RPC to 10.0.2.21:7050 timed out after 1.524s (SENT) > {code} > and the disk where wal dir located is abnormal > !image-2023-03-17-15-27-45-755.png|width=314,height=166!!image-2023-03-17-15-28-40-361.png|width=309,height=135! > here is the wal file look like : > {code:java} > schema_version: 0compression_codec: LZ41.1@6873507535186497536 REPLICATE > NO_OP id { term: 1 index: 1 } timestamp: 6873507535186497536 op_type: > NO_OP noop_request { }COMMIT 1.1 op_type: NO_OP commited_op_id { term: > 1 index: 1 }1.2@6873839930165628928 REPLICATE CHANGE_CONFIG_OP id { > term: 1 index: 2 } timestamp: 6873839930165628928 op_type: CHANGE_CONFIG_OP > change_config_record { tablet_id: "68d1c87651f442189f4d6c642b6ea7e6" > old_config { opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: > "448ba75af48e4ffdb740f7f8fe244a28" member_type: VOTER last_known_addr { host: > "10.0.2.14" port: 7050 } } peers { permanent_uuid: > "d88293d7f919446ea14855ac8887a648" member_type: VOTER last_known_addr { host: > "10.0.2.15" port: 7050 } } peers { permanent_uuid: > "5ac35cfccaf84228bf6d589501ec533e" member_type: VOTER last_known_addr { host: > "10.0.2.20" port: 7050 } } } new_config { opid_index: 2 OBSOLETE_local: false > peers { permanent_uuid: "448ba75af48e4ffdb740f7f8fe244a28" member_type: VOTER > last_known_addr { host: "10.0.2.14" port: 7050 } } peers { permanent_uuid: > "d88293d7f919446ea14855ac8887a648" member_type: VOTER last_known_addr { host: > "10.0.2.15" port: 7050 } } peers { permanent_uuid: > "5ac35cfccaf84228bf6d589501ec533e" member_type: VOTER last_known_addr { host: > "10.0.2.20" port: 7050 } } peers { permanent_uuid: > "d7b4384df45549a891f444d1a1f36a38" member_type: NON_VOTER last_known_addr { > host: "10.0.2.19" port: 7050 } attrs { promote: true } } } }COMMIT 1.2 > op_type: CHANGE_CONFIG_OP commited_op_id { term: 1 index: 2 > }1.3@6873841023495979008 REPLICATE CHANGE_CONFIG_OP id { term: 1 > index: 3 } timestamp: 6873841023495979008 op_type: CHANGE_CONFIG_OP > change_config_record { tablet_id: "68d1c87651f442189f4d6c642b6ea7e6" > old_config { opid_index: 2 OBSOLETE_local: false peers {permanent_uuid: > "448ba75af48e4ffdb740f7f8fe244a28" member_type: VOTER last_known_addr { host: > "10.0.2.14" port: 7050 } } peers { permanent_uuid: > "d88293d7f919446ea14855ac8887a648" member_type: VOTER last_known_addr{ host: > "10.0.2.15" port: 7050 } } peers { permanent_uuid: > "5ac35cfccaf84228bf6d589501ec533e" member_type: VOTER last_known_addr { host: > "10.0.2.20" port: 7050 } } peers { permanent_uuid: > "d7b4384df45549a891f444d1a1f36a38" member_type: NON_VOTER last_known_addr { > host: "10.0.2.19" port: 7050 } attrs { promote: true } } } new_config { > opid_index: 3 OBSOLETE_local: false peers { permanent_uuid: > "448ba75af48e4ffdb740f7f8fe244a28" member_type: VOTER last_known_addr { host: > "10.0.2.14" port: 7050 } } peers { permanent_uuid: > "d88293d7f919446ea14855ac8887a648" member_type: VOTER last_known_addr { host: > "10.0.2.15" port: 7050 } } peers { permanent_uuid: > "5ac35cfccaf84228bf6d589501ec533e" member_type: VOTER last_known_addr { host: > "10.0.2.20" port: 7050 } } peers { permanent_uuid: > "d7b4384df45549a891f444d1a1f36a38" member_type: VOTER last_known_addr { host: > "10.0.2.19" port: 7050 } attrs { promote: false } } } }COMMIT 1.3 > op_type: CHANGE_CONFIG_OP commited_op_id { term: 1 index: 3 > }1.4@6873841038243381248 REPLICATE CHANGE_CONFIG_OP id { term: 1 > index: 4 } timestamp: 6873841038243381248 op_type: CHANGE_CONFIG_OP > change_config_record { tablet_id: "68d1c87651f442189f4d6c642b6ea7e6" > old_config { opid_index: 3 OBSOLETE_local: false peers {permanent_uuid: > "448ba75af48e4ffdb740f7f8fe244a28" member_type: VOTER last_known_addr { host: > "10.0.2.14" port: 7050 } } peers { permanent_uuid: > "d88293d7f919446ea14855ac8887a648" member_type: VOTER last_known_addr{ host: > "10.0.2.15" port: 7050 } } peers { permanent_uuid: > "5ac35cfccaf84228bf6d589501ec533e" member_type: VOTER last_known_addr { host: > "10.0.2.20" port: 7050 } } peers { permanent_uuid: > "d7b4384df45549a891f444d1a1f36a38" member_type: VOTER last_known_addr { host: > "10.0.2.19" port: 7050 } attrs { promote: false } } } new_config { > opid_index: 4 OBSOLETE_local: false peers { permanent_uuid: > "448ba75af48e4ffdb740f7f8fe244a28" member_type: VOTER last_known_addr { host: > "10.0.2.14" port: 7050 } } peers { permanent_uuid: > "d88293d7f919446ea14855ac8887a648" member_type: VOTER last_known_addr { host: > "10.0.2.15" port: 7050 } } peers { permanent_uuid: > "d7b4384df45549a891f444d1a1f36a38" member_type: VOTER last_known_addr { host: > "10.0.2.19" port: 7050 } attrs { promote: false } } } }COMMIT 1.4 {code} > and there are many raft worker theads running, > !image-2023-03-17-15-38-51-218.png|width=704,height=576! > it seems like system is busy to handle consensus vote, and i didn't got more > helpful error logs in kudu, can anyone explain what happened? > -- This message was sent by Atlassian Jira (v8.20.10#820010)