[ 
https://issues.apache.org/jira/browse/KUDU-3275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Henke updated KUDU-3275:
------------------------------
    Attachment: raft_consensus-itest.2.txt.gz

> RaftConsensusITest.TestSlowFollower is Flaky
> --------------------------------------------
>
>                 Key: KUDU-3275
>                 URL: https://issues.apache.org/jira/browse/KUDU-3275
>             Project: Kudu
>          Issue Type: Bug
>    Affects Versions: 1.14.0
>            Reporter: Grant Henke
>            Priority: Major
>         Attachments: raft_consensus-itest.2.txt.gz
>
>
> I have seen RaftConsensusITest.TestSlowFollower fail quite a few times with 
> the following: 
> {code:java}
> I0411 07:15:34.772778  3404 raft_consensus.cc:1200] T 
> a82c506675e74d62a75e53acd05e86e8 P 8110dc17824943a68d3051e0df96ae5d [term 1 
> FOLLOWER]: Deduplicated request from leader. Original: 
> 1.18320->[1.18321-1.18746]   Dedup: 1.18746->[]I0411 07:15:34.772778  3404 
> raft_consensus.cc:1200] T a82c506675e74d62a75e53acd05e86e8 P 
> 8110dc17824943a68d3051e0df96ae5d [term 1 FOLLOWER]: Deduplicated request from 
> leader. Original: 1.18320->[1.18321-1.18746]   Dedup: 1.18746->[]W0411 
> 07:15:34.798368  3632 log.cc:502] T a82c506675e74d62a75e53acd05e86e8 P 
> 8110dc17824943a68d3051e0df96ae5d: Injecting 1096ms of latency in 
> SegmentAllocator::Sync()W0411 07:15:34.834003  3206 consensus_peers.cc:480] T 
> a82c506675e74d62a75e53acd05e86e8 P 5698d4de566c4adbae64b6f234c0561d -> Peer 
> 8110dc17824943a68d3051e0df96ae5d (127.28.159.66:42271): Couldn't send request 
> to peer 8110dc17824943a68d3051e0df96ae5d. Status: Timed out: UpdateConsensus 
> RPC to 127.28.159.66:42271 timed out after 0.050s (SENT). This is attempt 1: 
> this message will repeat every 5th retry.W0411 07:15:35.333277  3366 
> tablet_service.cc:2874] Error setting up scanner with request: Service 
> unavailable: Timed out: could not wait for desired snapshot timestamp to be 
> consistent: Tablet is lagging too much to be able to serve snapshot scan. 
> Lagging by: 45366 ms, (max is 30000 ms):: new_scan_request { tablet_id: 
> "a82c506675e74d62a75e53acd05e86e8" projected_columns { name: "key" type: 
> INT32 is_key: true is_nullable: false } projected_columns { name: "int_val" 
> type: INT32 is_key: false is_nullable: false } projected_columns { name: 
> "string_val" type: STRING is_key: false is_nullable: true } read_mode: 
> READ_AT_SNAPSHOT propagated_timestamp: 6627841138429693952 cache_blocks: true 
> order_mode: ORDERED row_format_flags: 0 authz_token { token_data: 
> "\010\210\306\312\203\006\"3\n\005slave\022*\n 
> 4b6d5a12476d4dfcb5a4ab8a6f1190cb\020\001\030\001 \001(\001" signature: 
> "<redacted>""\036\312x4\323\300\350\177k\357\226a<?\375\366\272\373\220\211\031p\036@9,\364j\2608q\031\331\270\355\214\373k8tT\224\3666\000RM\033UbQ\316\247\236#\364\302VW\302\026fJ\'"
>  signing_key_seq_num: 0 } } call_seq_id: 0W0411 07:15:35.635779  3366 
> tablet_service.cc:2874] Error setting up scanner with request: Service 
> unavailable: Timed out: could not wait for desired snapshot timestamp to be 
> consistent: Tablet is lagging too much to be able to serve snapshot scan. 
> Lagging by: 45668 ms, (max is 30000 ms):: new_scan_request { tablet_id: 
> "a82c506675e74d62a75e53acd05e86e8" projected_columns { name: "key" type: 
> INT32 is_key: true is_nullable: false } projected_columns { name: "int_val" 
> type: INT32 is_key: false is_nullable: false } projected_columns { name: 
> "string_val" type: STRING is_key: false is_nullable: true } read_mode: 
> READ_AT_SNAPSHOT propagated_timestamp: 6627841138425311232 cache_blocks: true 
> order_mode: ORDERED row_format_flags: 0 authz_token { token_data: 
> "\010\210\306\312\203\006\"3\n\005slave\022*\n 
> 4b6d5a12476d4dfcb5a4ab8a6f1190cb\020\001\030\001 \001(\001" signature: 
> "<redacted>""\036\312x4\323\300\350\177k\357\226a<?\375\366\272\373\220\211\031p\036@9,\364j\2608q\031\331\270\355\214\373k8tT\224\3666\000RM\033UbQ\316\247\236#\364\302VW\302\026fJ\'"
>  signing_key_seq_num: 0 } } call_seq_id: 0W0411 07:15:35.636209  3626 
> scanner-internal.cc:406] Time spent opening tablet: real 57.698s user 0.002s 
> sys 0.000sF0411 07:15:35.636267  3626 test_workload.cc:255] Check failed: 
> _s.ok() Bad status: Timed out: exceeded configured scan timeout of 60.000s: 
> after 20 scan attempts: unable to retry before timeout: Remote error: Service 
> unavailable: Timed out: could not wait for desired snapshot timestamp to be 
> consistent: Timed out waiting for ts: P: 1618125277942117 usec, L: 0 to be 
> safe (mode: NON-LEADER). Current safe time: P: 1618125277252906 usec, L: 0 
> Physical time difference: 0.689s*** Check failure stack trace: ****** Aborted 
> at 1618125335 (unix time) try "date -d @1618125335" if you are using GNU date 
> ***PC: @     0x7fe2e72bafb7 gsignal*** SIGABRT (@0x3e80000727d) received by 
> PID 29309 (TID 0x7fe2d979a700) from PID 29309; stack trace: ***    @     
> 0x7fe2e97881f1 google::(anonymous namespace)::FailureSignalHandler() at ??:0  
>   @     0x7fe2eb20e980 (unknown) at ??:0    @     0x7fe2e72bafb7 gsignal at 
> ??:0    @     0x7fe2e72bc921 abort at ??:0    @     0x7fe2e9778439 
> google::logging_fail() at ??:0    @     0x7fe2e977bebd 
> google::LogMessage::Fail() at ??:0    @     0x7fe2e977dace 
> google::LogMessage::SendToLog() at ??:0    @     0x7fe2e977ba40 
> google::LogMessage::Flush() at ??:0    @     0x7fe2e977e3b9 
> google::LogMessageFatal::~LogMessageFatal() at ??:0    @     0x7fe2efae81f0 
> kudu::TestWorkload::ReadThread() at ??:0    @     0x7fe2eaa838df 
> execute_native_thread_routine at ??:0    @     0x7fe2eb2036db start_thread at 
> ??:0    @     0x7fe2e739d71f clone at ??:0I0411 07:15:35.762290  4692 
> consensus_queue.cc:778] T a82c506675e74d62a75e53acd05e86e8 P 
> 5698d4de566c4adbae64b6f234c0561d [LEADER]: Peer 
> 8110dc17824943a68d3051e0df96ae5d is lagging by at least 66546 ops behind the 
> committed index  [suppressed 8 similar messages]{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to