[ https://issues.apache.org/jira/browse/KUDU-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Henke updated KUDU-3275: ------------------------------ Attachment: raft_consensus-itest.2.txt.gz > RaftConsensusITest.TestSlowFollower is Flaky > -------------------------------------------- > > Key: KUDU-3275 > URL: https://issues.apache.org/jira/browse/KUDU-3275 > Project: Kudu > Issue Type: Bug > Affects Versions: 1.14.0 > Reporter: Grant Henke > Priority: Major > Attachments: raft_consensus-itest.2.txt.gz > > > I have seen RaftConsensusITest.TestSlowFollower fail quite a few times with > the following: > {code:java} > I0411 07:15:34.772778 3404 raft_consensus.cc:1200] T > a82c506675e74d62a75e53acd05e86e8 P 8110dc17824943a68d3051e0df96ae5d [term 1 > FOLLOWER]: Deduplicated request from leader. Original: > 1.18320->[1.18321-1.18746] Dedup: 1.18746->[]I0411 07:15:34.772778 3404 > raft_consensus.cc:1200] T a82c506675e74d62a75e53acd05e86e8 P > 8110dc17824943a68d3051e0df96ae5d [term 1 FOLLOWER]: Deduplicated request from > leader. Original: 1.18320->[1.18321-1.18746] Dedup: 1.18746->[]W0411 > 07:15:34.798368 3632 log.cc:502] T a82c506675e74d62a75e53acd05e86e8 P > 8110dc17824943a68d3051e0df96ae5d: Injecting 1096ms of latency in > SegmentAllocator::Sync()W0411 07:15:34.834003 3206 consensus_peers.cc:480] T > a82c506675e74d62a75e53acd05e86e8 P 5698d4de566c4adbae64b6f234c0561d -> Peer > 8110dc17824943a68d3051e0df96ae5d (127.28.159.66:42271): Couldn't send request > to peer 8110dc17824943a68d3051e0df96ae5d. Status: Timed out: UpdateConsensus > RPC to 127.28.159.66:42271 timed out after 0.050s (SENT). This is attempt 1: > this message will repeat every 5th retry.W0411 07:15:35.333277 3366 > tablet_service.cc:2874] Error setting up scanner with request: Service > unavailable: Timed out: could not wait for desired snapshot timestamp to be > consistent: Tablet is lagging too much to be able to serve snapshot scan. > Lagging by: 45366 ms, (max is 30000 ms):: new_scan_request { tablet_id: > "a82c506675e74d62a75e53acd05e86e8" projected_columns { name: "key" type: > INT32 is_key: true is_nullable: false } projected_columns { name: "int_val" > type: INT32 is_key: false is_nullable: false } projected_columns { name: > "string_val" type: STRING is_key: false is_nullable: true } read_mode: > READ_AT_SNAPSHOT propagated_timestamp: 6627841138429693952 cache_blocks: true > order_mode: ORDERED row_format_flags: 0 authz_token { token_data: > "\010\210\306\312\203\006\"3\n\005slave\022*\n > 4b6d5a12476d4dfcb5a4ab8a6f1190cb\020\001\030\001 \001(\001" signature: > "<redacted>""\036\312x4\323\300\350\177k\357\226a<?\375\366\272\373\220\211\031p\036@9,\364j\2608q\031\331\270\355\214\373k8tT\224\3666\000RM\033UbQ\316\247\236#\364\302VW\302\026fJ\'" > signing_key_seq_num: 0 } } call_seq_id: 0W0411 07:15:35.635779 3366 > tablet_service.cc:2874] Error setting up scanner with request: Service > unavailable: Timed out: could not wait for desired snapshot timestamp to be > consistent: Tablet is lagging too much to be able to serve snapshot scan. > Lagging by: 45668 ms, (max is 30000 ms):: new_scan_request { tablet_id: > "a82c506675e74d62a75e53acd05e86e8" projected_columns { name: "key" type: > INT32 is_key: true is_nullable: false } projected_columns { name: "int_val" > type: INT32 is_key: false is_nullable: false } projected_columns { name: > "string_val" type: STRING is_key: false is_nullable: true } read_mode: > READ_AT_SNAPSHOT propagated_timestamp: 6627841138425311232 cache_blocks: true > order_mode: ORDERED row_format_flags: 0 authz_token { token_data: > "\010\210\306\312\203\006\"3\n\005slave\022*\n > 4b6d5a12476d4dfcb5a4ab8a6f1190cb\020\001\030\001 \001(\001" signature: > "<redacted>""\036\312x4\323\300\350\177k\357\226a<?\375\366\272\373\220\211\031p\036@9,\364j\2608q\031\331\270\355\214\373k8tT\224\3666\000RM\033UbQ\316\247\236#\364\302VW\302\026fJ\'" > signing_key_seq_num: 0 } } call_seq_id: 0W0411 07:15:35.636209 3626 > scanner-internal.cc:406] Time spent opening tablet: real 57.698s user 0.002s > sys 0.000sF0411 07:15:35.636267 3626 test_workload.cc:255] Check failed: > _s.ok() Bad status: Timed out: exceeded configured scan timeout of 60.000s: > after 20 scan attempts: unable to retry before timeout: Remote error: Service > unavailable: Timed out: could not wait for desired snapshot timestamp to be > consistent: Timed out waiting for ts: P: 1618125277942117 usec, L: 0 to be > safe (mode: NON-LEADER). Current safe time: P: 1618125277252906 usec, L: 0 > Physical time difference: 0.689s*** Check failure stack trace: ****** Aborted > at 1618125335 (unix time) try "date -d @1618125335" if you are using GNU date > ***PC: @ 0x7fe2e72bafb7 gsignal*** SIGABRT (@0x3e80000727d) received by > PID 29309 (TID 0x7fe2d979a700) from PID 29309; stack trace: *** @ > 0x7fe2e97881f1 google::(anonymous namespace)::FailureSignalHandler() at ??:0 > @ 0x7fe2eb20e980 (unknown) at ??:0 @ 0x7fe2e72bafb7 gsignal at > ??:0 @ 0x7fe2e72bc921 abort at ??:0 @ 0x7fe2e9778439 > google::logging_fail() at ??:0 @ 0x7fe2e977bebd > google::LogMessage::Fail() at ??:0 @ 0x7fe2e977dace > google::LogMessage::SendToLog() at ??:0 @ 0x7fe2e977ba40 > google::LogMessage::Flush() at ??:0 @ 0x7fe2e977e3b9 > google::LogMessageFatal::~LogMessageFatal() at ??:0 @ 0x7fe2efae81f0 > kudu::TestWorkload::ReadThread() at ??:0 @ 0x7fe2eaa838df > execute_native_thread_routine at ??:0 @ 0x7fe2eb2036db start_thread at > ??:0 @ 0x7fe2e739d71f clone at ??:0I0411 07:15:35.762290 4692 > consensus_queue.cc:778] T a82c506675e74d62a75e53acd05e86e8 P > 5698d4de566c4adbae64b6f234c0561d [LEADER]: Peer > 8110dc17824943a68d3051e0df96ae5d is lagging by at least 66546 ops behind the > committed index [suppressed 8 similar messages]{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)