Andrew Wong created KUDU-2906: --------------------------------- Summary: Don't allow elections when server clocks are too out of sync Key: KUDU-2906 URL: https://issues.apache.org/jira/browse/KUDU-2906 Project: Kudu Issue Type: Bug Components: consensus Affects Versions: 1.10.0 Reporter: Andrew Wong
In cases where machine clocks are not properly synchronized, if a tablet replica is elected leader whose clock happens to be very far in the future (greater than --max_clock_sync_error_usec=10 sec), it's possible that any writes that goes to that tablet will be rejected by the followers, but persisted to the leader's WAL. Then, upon fixing the clock on that machine, the replica may try to replay the future op, but fail to replay it because the op timestamp is too far in the future, with errors like: {code:java} F0715 12:03:09.369819 3500 tablet_bootstrap.cc:904] Check failed: _s.ok() Bad status: Invalid argument: Tried to update clock beyond the max. error.{code} Dumping a recovery WAL, I could see: {code:java} 130.138@6400743143334211584 REPLICATE NO_OP id { term: 130 index: 138 } timestamp: 6400743143334211584 op_type: NO_OP noop_request { } COMMIT 130.138 op_type: NO_OP commited_op_id { term: 130 index: 138 } 131.139@6400743925559676928 REPLICATE NO_OP id { term: 131 index: 139 } timestamp: 6400743925559676928 op_type: NO_OP noop_request { } COMMIT 131.139 op_type: NO_OP commited_op_id { term: 131 index: 139 } 132.140@11589864471731939930 REPLICATE NO_OP id { term: 132 index: 140 } timestamp: 11589864471731939930 op_type: NO_OP noop_request { }{code} Note the drastic jump in timestamp. In this specific case, we verified that the replayed WAL wasn't that far behind the recovery WAL, which had the future timestamps, so we could just delete the recovery WAL and bootstrap from the replayed WAL. It would have been nice had those bad ops not been written at all, maybe by preventing an election between such mismatched servers in the first place. -- This message was sent by Atlassian JIRA (v7.6.14#76016)