[ 
https://issues.apache.org/jira/browse/KUDU-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16894190#comment-16894190
 ] 

Todd Lipcon commented on KUDU-2906:
-----------------------------------

So in this case, the serve'rs NTP reported that it was in TIME_OK status, but 
gave a really incorrect error, with an incorrect maxerror?

Perhaps we could also do something to detect this kind of skew by making sure 
the monotonic clock and the wall clock advance at roughly the same rate (eg if 
the monotonic clock changed by 1 second, and the NTP clock changed by 3, it's a 
problem!)

> Don't allow elections when server clocks are too out of sync
> ------------------------------------------------------------
>
>                 Key: KUDU-2906
>                 URL: https://issues.apache.org/jira/browse/KUDU-2906
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 1.10.0
>            Reporter: Andrew Wong
>            Priority: Major
>
> In cases where machine clocks are not properly synchronized, if a tablet 
> replica is elected leader whose clock happens to be very far in the future 
> (greater than --max_clock_sync_error_usec=10 sec), it's possible that any 
> writes that goes to that tablet will be rejected by the followers, but 
> persisted to the leader's WAL.
> Then, upon fixing the clock on that machine, the replica may try to replay 
> the future op, but fail to replay it because the op timestamp is too far in 
> the future, with errors like:
> {code:java}
> F0715 12:03:09.369819  3500 tablet_bootstrap.cc:904] Check failed: _s.ok() 
> Bad status: Invalid argument: Tried to update clock beyond the max. 
> error.{code}
> Dumping a recovery WAL, I could see:
> {code:java}
> 130.138@6400743143334211584 REPLICATE NO_OP
> id { term: 130 index: 138 } timestamp: 6400743143334211584 op_type: NO_OP 
> noop_request { }
> COMMIT 130.138
> op_type: NO_OP commited_op_id { term: 130 index: 138 }
> 131.139@6400743925559676928 REPLICATE NO_OP
> id { term: 131 index: 139 } timestamp: 6400743925559676928 op_type: NO_OP 
> noop_request { }
> COMMIT 131.139
> op_type: NO_OP commited_op_id { term: 131 index: 139 }
> 132.140@11589864471731939930 REPLICATE NO_OP
> id { term: 132 index: 140 } timestamp: 11589864471731939930 op_type: NO_OP 
> noop_request { }{code}
> Note the drastic jump in timestamp.
> In this specific case, we verified that the replayed WAL wasn't that far 
> behind the recovery WAL, which had the future timestamps, so we could just 
> delete the recovery WAL and bootstrap from the replayed WAL.
> It would have been nice had those bad ops not been written at all, maybe by 
> preventing an election between such mismatched servers in the first place.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to