Hi, Recently, in on one of our 3-node clusters, system clock on one server instance jumped ahead by 7.5 hours. The cluster is setup on OAK-1.0.22 so it had features which stall background read (OAK-3388 [0]) if repository seems to be ahead in time. But, it doesn't have lease check feature in place (OAK-2739 [1] and friends)
The issue: Once the rogue instance committed with future time, other instances reacted with pausing background reads for 7.5 hours leading to (expected) CommitFailedException in the logs and more importantly (and missing in logs) that the instances were out of sync. While generally speaking, this is an operational issue and clocks should remain in sync - but I think it's fair to say that once the instance is restarted after synchronizing clock things should work out. That's not the case currently as the instances would want to wait 7.5 hours. There's a bit of respite on trunk (due to [1] and friends) and a sequence of events like {{t1 -> update lease}}, {{t2=t1+7.5hrs -> commit}} would lead to a shutdown as a big jump like 7.5 hours would be beyond lease end time. BUT, if there's lease update before a commit after a clock jump, then even on trunk there's no safeguard. If we update lease update logic to shut down as well in case it can't update lease for a long time then an instance with forward jumping clock would self-destruct. BTW, the approach mentioned above saves only agains a clock which jumps ahead directly. We'd still have similar issue for a clock slowly skews forward. But, even then I think it's worth it to have to solve the particular case. Note, this doesn't really add any incosistency in the repository - just makes instances on correct clock to not operate well. Thanks, Vikas [0]: https://issues.apache.org/jira/browse/OAK-3388 [1]: https://issues.apache.org/jira/browse/OAK-2739