Grant Henke created KUDU-3275:
---------------------------------

             Summary: RaftConsensusITest.TestSlowFollower is Flaky
                 Key: KUDU-3275
                 URL: https://issues.apache.org/jira/browse/KUDU-3275
             Project: Kudu
          Issue Type: Bug
    Affects Versions: 1.14.0
            Reporter: Grant Henke


I have seen RaftConsensusITest.TestSlowFollower fail quite a few times with the 
following: 
{code:java}
I0411 07:15:34.772778  3404 raft_consensus.cc:1200] T 
a82c506675e74d62a75e53acd05e86e8 P 8110dc17824943a68d3051e0df96ae5d [term 1 
FOLLOWER]: Deduplicated request from leader. Original: 
1.18320->[1.18321-1.18746]   Dedup: 1.18746->[]I0411 07:15:34.772778  3404 
raft_consensus.cc:1200] T a82c506675e74d62a75e53acd05e86e8 P 
8110dc17824943a68d3051e0df96ae5d [term 1 FOLLOWER]: Deduplicated request from 
leader. Original: 1.18320->[1.18321-1.18746]   Dedup: 1.18746->[]W0411 
07:15:34.798368  3632 log.cc:502] T a82c506675e74d62a75e53acd05e86e8 P 
8110dc17824943a68d3051e0df96ae5d: Injecting 1096ms of latency in 
SegmentAllocator::Sync()W0411 07:15:34.834003  3206 consensus_peers.cc:480] T 
a82c506675e74d62a75e53acd05e86e8 P 5698d4de566c4adbae64b6f234c0561d -> Peer 
8110dc17824943a68d3051e0df96ae5d (127.28.159.66:42271): Couldn't send request 
to peer 8110dc17824943a68d3051e0df96ae5d. Status: Timed out: UpdateConsensus 
RPC to 127.28.159.66:42271 timed out after 0.050s (SENT). This is attempt 1: 
this message will repeat every 5th retry.W0411 07:15:35.333277  3366 
tablet_service.cc:2874] Error setting up scanner with request: Service 
unavailable: Timed out: could not wait for desired snapshot timestamp to be 
consistent: Tablet is lagging too much to be able to serve snapshot scan. 
Lagging by: 45366 ms, (max is 30000 ms):: new_scan_request { tablet_id: 
"a82c506675e74d62a75e53acd05e86e8" projected_columns { name: "key" type: INT32 
is_key: true is_nullable: false } projected_columns { name: "int_val" type: 
INT32 is_key: false is_nullable: false } projected_columns { name: "string_val" 
type: STRING is_key: false is_nullable: true } read_mode: READ_AT_SNAPSHOT 
propagated_timestamp: 6627841138429693952 cache_blocks: true order_mode: 
ORDERED row_format_flags: 0 authz_token { token_data: 
"\010\210\306\312\203\006\"3\n\005slave\022*\n 
4b6d5a12476d4dfcb5a4ab8a6f1190cb\020\001\030\001 \001(\001" signature: 
"<redacted>""\036\312x4\323\300\350\177k\357\226a<?\375\366\272\373\220\211\031p\036@9,\364j\2608q\031\331\270\355\214\373k8tT\224\3666\000RM\033UbQ\316\247\236#\364\302VW\302\026fJ\'"
 signing_key_seq_num: 0 } } call_seq_id: 0W0411 07:15:35.635779  3366 
tablet_service.cc:2874] Error setting up scanner with request: Service 
unavailable: Timed out: could not wait for desired snapshot timestamp to be 
consistent: Tablet is lagging too much to be able to serve snapshot scan. 
Lagging by: 45668 ms, (max is 30000 ms):: new_scan_request { tablet_id: 
"a82c506675e74d62a75e53acd05e86e8" projected_columns { name: "key" type: INT32 
is_key: true is_nullable: false } projected_columns { name: "int_val" type: 
INT32 is_key: false is_nullable: false } projected_columns { name: "string_val" 
type: STRING is_key: false is_nullable: true } read_mode: READ_AT_SNAPSHOT 
propagated_timestamp: 6627841138425311232 cache_blocks: true order_mode: 
ORDERED row_format_flags: 0 authz_token { token_data: 
"\010\210\306\312\203\006\"3\n\005slave\022*\n 
4b6d5a12476d4dfcb5a4ab8a6f1190cb\020\001\030\001 \001(\001" signature: 
"<redacted>""\036\312x4\323\300\350\177k\357\226a<?\375\366\272\373\220\211\031p\036@9,\364j\2608q\031\331\270\355\214\373k8tT\224\3666\000RM\033UbQ\316\247\236#\364\302VW\302\026fJ\'"
 signing_key_seq_num: 0 } } call_seq_id: 0W0411 07:15:35.636209  3626 
scanner-internal.cc:406] Time spent opening tablet: real 57.698s user 0.002s 
sys 0.000sF0411 07:15:35.636267  3626 test_workload.cc:255] Check failed: 
_s.ok() Bad status: Timed out: exceeded configured scan timeout of 60.000s: 
after 20 scan attempts: unable to retry before timeout: Remote error: Service 
unavailable: Timed out: could not wait for desired snapshot timestamp to be 
consistent: Timed out waiting for ts: P: 1618125277942117 usec, L: 0 to be safe 
(mode: NON-LEADER). Current safe time: P: 1618125277252906 usec, L: 0 Physical 
time difference: 0.689s*** Check failure stack trace: ****** Aborted at 
1618125335 (unix time) try "date -d @1618125335" if you are using GNU date 
***PC: @     0x7fe2e72bafb7 gsignal*** SIGABRT (@0x3e80000727d) received by PID 
29309 (TID 0x7fe2d979a700) from PID 29309; stack trace: ***    @     
0x7fe2e97881f1 google::(anonymous namespace)::FailureSignalHandler() at ??:0    
@     0x7fe2eb20e980 (unknown) at ??:0    @     0x7fe2e72bafb7 gsignal at ??:0  
  @     0x7fe2e72bc921 abort at ??:0    @     0x7fe2e9778439 
google::logging_fail() at ??:0    @     0x7fe2e977bebd 
google::LogMessage::Fail() at ??:0    @     0x7fe2e977dace 
google::LogMessage::SendToLog() at ??:0    @     0x7fe2e977ba40 
google::LogMessage::Flush() at ??:0    @     0x7fe2e977e3b9 
google::LogMessageFatal::~LogMessageFatal() at ??:0    @     0x7fe2efae81f0 
kudu::TestWorkload::ReadThread() at ??:0    @     0x7fe2eaa838df 
execute_native_thread_routine at ??:0    @     0x7fe2eb2036db start_thread at 
??:0    @     0x7fe2e739d71f clone at ??:0I0411 07:15:35.762290  4692 
consensus_queue.cc:778] T a82c506675e74d62a75e53acd05e86e8 P 
5698d4de566c4adbae64b6f234c0561d [LEADER]: Peer 
8110dc17824943a68d3051e0df96ae5d is lagging by at least 66546 ops behind the 
committed index  [suppressed 8 similar messages]{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to