[ https://issues.apache.org/jira/browse/CASSANDRA-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14306171#comment-14306171 ]
sankalp kohli commented on CASSANDRA-8732: ------------------------------------------ I like [~benedict] suggestion. We don't use cross_node_timeout for this same reason that it is too dangerous when there is clock skew. > Make inter-node timeouts tolerate clock skew and drift > ------------------------------------------------------ > > Key: CASSANDRA-8732 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8732 > Project: Cassandra > Issue Type: Improvement > Reporter: Ariel Weisberg > > Right now internode timeouts rely on currentTimeMillis() (and NTP) to make > sure that tasks don't expire before they arrive. > Every receiver needs to deduce the offset between its nanoTime and the remote > nanoTime. I don't think currentTimeMillis is a good choice because it is > designed to be manipulated by operators and NTP. I would probably be > comfortable assuming that nanoTime isn't going to move in significant ways > without something that could be classified as operator error happening. > I suspect the one timing method you can rely on being accurate is nanoTime > within a node (on average) and that a node can report on its own scheduling > jitter (on average). > Finding the offset requires knowing what the network latency is in one > direction. > One way to do this would be to periodically send a ping request which > generates a series of ping responses at fixed intervals (maybe by UDP?). The > responses should corrected for scheduling jitter since the fixed intervals > may not be exactly achieved by the sender. By measuring the time deviation > between ping responses and their expected arrival time (based on the > interval) and correcting for the remotely reported scheduling jitter, you > should be able to measure latency in one direction. > A weighted moving average (only correct for drift, not readjustment) of these > measurements would eventually converge on a close answer and would not be > impacted by outlier measurements. It may also make sense to drop the largest > N samples to improve accuracy. > One you know network latency you can add that to the timestamp of each ping > and compare to the local clock and know what the offset is. > These measurements won't calculate the offset to be too small (timeouts fire > early), but could calculate the offset to be too large (timeouts fire late). > The conditions where you the offset won't be accurate are the conditions > where you also want them firing reliably. This and bootstrapping in bad > conditions is what I am most uncertain of. -- This message was sent by Atlassian JIRA (v6.3.4#6332)