Hi,

Recently, in on one of our 3-node clusters, system clock on one server
instance jumped ahead by 7.5 hours. The cluster is setup on OAK-1.0.22
so it had features which stall background read (OAK-3388 [0]) if
repository seems to be ahead in time. But, it doesn't have lease check
feature in place (OAK-2739 [1] and friends)

The issue:
Once the rogue instance committed with future time, other instances
reacted with pausing background reads for 7.5 hours leading to
(expected) CommitFailedException in the logs and more importantly (and
missing in logs) that the instances were out of sync.

While generally speaking, this is an operational issue and clocks
should remain in sync - but I think it's fair to say that once the
instance is restarted after synchronizing clock things should work
out. That's not the case currently as the instances would want to wait
7.5 hours.

There's a bit of respite on trunk (due to [1] and friends) and a
sequence of events like {{t1 -> update lease}}, {{t2=t1+7.5hrs ->
commit}} would lead to a shutdown as a big jump like 7.5 hours would
be beyond lease end time. BUT, if there's lease update before a commit
after a clock jump, then even on trunk there's no safeguard.
If we update lease update logic to shut down as well in case it can't
update lease for a long time then an instance with forward jumping
clock would self-destruct.

BTW, the approach mentioned above saves only agains a clock which
jumps ahead directly. We'd still have similar issue for a clock slowly
skews forward. But, even then I think it's worth it to have to solve
the particular case.

Note, this doesn't really add any incosistency in the repository -
just makes instances on correct clock to not operate well.

Thanks,
Vikas

[0]: https://issues.apache.org/jira/browse/OAK-3388
[1]: https://issues.apache.org/jira/browse/OAK-2739

Reply via email to