[ 
https://issues.apache.org/jira/browse/CASSANDRA-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14325803#comment-14325803
 ] 

Benedict edited comment on CASSANDRA-8732 at 2/18/15 12:02 PM:
---------------------------------------------------------------

Another possibility is to apply a "maximal skew correction" which is to take 
apply the minimal latency calculated for any message over the past minute, and 
apply this correction to all messages. So let's say we have messages M[0..N) 
arrive over a minute, with each having the associated S, T and R. We can 
calculate the maximal skew as !maximalskew.png!, which we set to _MaxSkew_, and 
we then, for instance, simply take S+_MaxSkew_+T as the timeout.

This would bound the network (and GC) delay impact to the minimal such delay in 
the measured horizon, which should make it very manageable (and if not there 
are bigger problems)

The important thing here is we can play with these approaches however we like 
once we start sending the data over the wire, which we should aim to do in 3.0 
IMO.


was (Author: benedict):
Another possibility is to apply a "maximal skew correction" which is to take 
apply the minimal latency calculated for any message over the past minute, and 
apply this correction to all messages. So let's say we have messages M[0..N) 
arrive over a minute, with each having the associated S, T and R. We can 
calculate the maximal skew as !maximalskew.png!, which we set to _MaxSkew_, and 
we then, for instance, simply take S+_MaxSkew_+T as the timeout.

This would bound the network (and GC) delay impact to the minimal such delay in 
the measured horizon, which should make it very manageable (and if not there 
are bigger problems)

> Make inter-node timeouts tolerate clock skew and drift
> ------------------------------------------------------
>
>                 Key: CASSANDRA-8732
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8732
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Ariel Weisberg
>         Attachments: maximalskew.png
>
>
> Right now internode timeouts rely on currentTimeMillis() (and NTP) to make 
> sure that tasks don't expire before they arrive.
> Every receiver needs to deduce the offset between its nanoTime and the remote 
> nanoTime. I don't think currentTimeMillis is a good choice because it is 
> designed to be manipulated by operators and NTP. I would probably be 
> comfortable assuming that nanoTime isn't going to move in significant ways 
> without something that could be classified as operator error happening.
> I suspect the one timing method you can rely on being accurate is nanoTime 
> within a node (on average) and that a node can report on its own scheduling 
> jitter (on average).
> Finding the offset requires knowing what the network latency is in one 
> direction.
> One way to do this would be to periodically send a ping request which 
> generates a series of ping responses at fixed intervals (maybe by UDP?). The 
> responses should corrected for scheduling jitter since the fixed intervals 
> may not be exactly achieved by the sender. By measuring the time deviation 
> between ping responses and their expected arrival time (based on the 
> interval) and correcting for the remotely reported scheduling jitter, you 
> should be able to measure latency in one direction.
> A weighted moving average (only correct for drift, not readjustment) of these 
> measurements would eventually converge on a close answer and would not be 
> impacted by outlier measurements. It may also make sense to drop the largest 
> N samples to improve accuracy.
> One you know network latency you can add that to the timestamp of each ping 
> and compare to the local clock and know what the offset is.
> These measurements won't calculate the offset to be too small (timeouts fire 
> early), but could calculate the offset to be too large (timeouts fire late). 
> The conditions where you the offset won't be accurate are the conditions 
> where you also want them firing reliably. This and bootstrapping in bad 
> conditions is what I am most uncertain of.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to