[jira] [Commented] (CASSANDRA-8732) Make inter-node timeouts tolerate clock skew and drift

2015-02-18 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325803#comment-14325803
 ] 

Benedict commented on CASSANDRA-8732:
-

Another possibility is to apply a maximal skew correction which is to take 
apply the minimal latency calculated for any message over the past minute, and 
apply this correction to all messages. So let's say we have messages M[0..N) 
arrive over a minute, with each having the associated S, T and R. We can 
calculate the maximal skew as !maximalskew.png!, which we set to _MaxSkew_, and 
we then, for instance, simply take S+_MaxSkew_+T as the timeout.

This would bound the network (and GC) delay impact to the minimal such delay in 
the measured horizon, which should make it very manageable (and if not there 
are bigger problems)

 Make inter-node timeouts tolerate clock skew and drift
 --

 Key: CASSANDRA-8732
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8732
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg
 Attachments: maximalskew.png


 Right now internode timeouts rely on currentTimeMillis() (and NTP) to make 
 sure that tasks don't expire before they arrive.
 Every receiver needs to deduce the offset between its nanoTime and the remote 
 nanoTime. I don't think currentTimeMillis is a good choice because it is 
 designed to be manipulated by operators and NTP. I would probably be 
 comfortable assuming that nanoTime isn't going to move in significant ways 
 without something that could be classified as operator error happening.
 I suspect the one timing method you can rely on being accurate is nanoTime 
 within a node (on average) and that a node can report on its own scheduling 
 jitter (on average).
 Finding the offset requires knowing what the network latency is in one 
 direction.
 One way to do this would be to periodically send a ping request which 
 generates a series of ping responses at fixed intervals (maybe by UDP?). The 
 responses should corrected for scheduling jitter since the fixed intervals 
 may not be exactly achieved by the sender. By measuring the time deviation 
 between ping responses and their expected arrival time (based on the 
 interval) and correcting for the remotely reported scheduling jitter, you 
 should be able to measure latency in one direction.
 A weighted moving average (only correct for drift, not readjustment) of these 
 measurements would eventually converge on a close answer and would not be 
 impacted by outlier measurements. It may also make sense to drop the largest 
 N samples to improve accuracy.
 One you know network latency you can add that to the timestamp of each ping 
 and compare to the local clock and know what the offset is.
 These measurements won't calculate the offset to be too small (timeouts fire 
 early), but could calculate the offset to be too large (timeouts fire late). 
 The conditions where you the offset won't be accurate are the conditions 
 where you also want them firing reliably. This and bootstrapping in bad 
 conditions is what I am most uncertain of.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8732) Make inter-node timeouts tolerate clock skew and drift

2015-02-04 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305564#comment-14305564
 ] 

Sylvain Lebresne commented on CASSANDRA-8732:
-

bq. It is a common error and it is a nice to have for the database to handle it 
as well as possible.

Fair enough, but dopping messages sooner or later than you should doesn't feel 
like the biggest problem if you've made that error. Maybe adding a system that 
tries to detect clock skew/drift and warn the operation if it thinks it has 
detected some would be sensible, and once we have that maybe using it to 
improve message dropping timeouts will be trivial, but let say I'm not 
convinced that respecting timeouts is important enough in itself to justify the 
complexity.

 Make inter-node timeouts tolerate clock skew and drift
 --

 Key: CASSANDRA-8732
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8732
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg

 Right now internode timeouts rely on currentTimeMillis() (and NTP) to make 
 sure that tasks don't expire before they arrive.
 Every receiver needs to deduce the offset between its nanoTime and the remote 
 nanoTime. I don't think currentTimeMillis is a good choice because it is 
 designed to be manipulated by operators and NTP. I would probably be 
 comfortable assuming that nanoTime isn't going to move in significant ways 
 without something that could be classified as operator error happening.
 I suspect the one timing method you can rely on being accurate is nanoTime 
 within a node (on average) and that a node can report on its own scheduling 
 jitter (on average).
 Finding the offset requires knowing what the network latency is in one 
 direction.
 One way to do this would be to periodically send a ping request which 
 generates a series of ping responses at fixed intervals (maybe by UDP?). The 
 responses should corrected for scheduling jitter since the fixed intervals 
 may not be exactly achieved by the sender. By measuring the time deviation 
 between ping responses and their expected arrival time (based on the 
 interval) and correcting for the remotely reported scheduling jitter, you 
 should be able to measure latency in one direction.
 A weighted moving average (only correct for drift, not readjustment) of these 
 measurements would eventually converge on a close answer and would not be 
 impacted by outlier measurements. It may also make sense to drop the largest 
 N samples to improve accuracy.
 One you know network latency you can add that to the timestamp of each ping 
 and compare to the local clock and know what the offset is.
 These measurements won't calculate the offset to be too small (timeouts fire 
 early), but could calculate the offset to be too large (timeouts fire late). 
 The conditions where you the offset won't be accurate are the conditions 
 where you also want them firing reliably. This and bootstrapping in bad 
 conditions is what I am most uncertain of.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8732) Make inter-node timeouts tolerate clock skew and drift

2015-02-04 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305581#comment-14305581
 ] 

Benedict commented on CASSANDRA-8732:
-

bq. but let say I'm not convinced that respecting timeouts is important enough 
in itself to justify the complexity.

Is it really that complex though? What I'm suggesting is pretty simple. I won't 
fight for it beyond this final question, since as already stated it isn't super 
pressing. But it couldn't be easier: serialize the delta and the wall clock, 
and let the other end pick the most conservative of the two to use.

 Make inter-node timeouts tolerate clock skew and drift
 --

 Key: CASSANDRA-8732
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8732
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg

 Right now internode timeouts rely on currentTimeMillis() (and NTP) to make 
 sure that tasks don't expire before they arrive.
 Every receiver needs to deduce the offset between its nanoTime and the remote 
 nanoTime. I don't think currentTimeMillis is a good choice because it is 
 designed to be manipulated by operators and NTP. I would probably be 
 comfortable assuming that nanoTime isn't going to move in significant ways 
 without something that could be classified as operator error happening.
 I suspect the one timing method you can rely on being accurate is nanoTime 
 within a node (on average) and that a node can report on its own scheduling 
 jitter (on average).
 Finding the offset requires knowing what the network latency is in one 
 direction.
 One way to do this would be to periodically send a ping request which 
 generates a series of ping responses at fixed intervals (maybe by UDP?). The 
 responses should corrected for scheduling jitter since the fixed intervals 
 may not be exactly achieved by the sender. By measuring the time deviation 
 between ping responses and their expected arrival time (based on the 
 interval) and correcting for the remotely reported scheduling jitter, you 
 should be able to measure latency in one direction.
 A weighted moving average (only correct for drift, not readjustment) of these 
 measurements would eventually converge on a close answer and would not be 
 impacted by outlier measurements. It may also make sense to drop the largest 
 N samples to improve accuracy.
 One you know network latency you can add that to the timestamp of each ping 
 and compare to the local clock and know what the offset is.
 These measurements won't calculate the offset to be too small (timeouts fire 
 early), but could calculate the offset to be too large (timeouts fire late). 
 The conditions where you the offset won't be accurate are the conditions 
 where you also want them firing reliably. This and bootstrapping in bad 
 conditions is what I am most uncertain of.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8732) Make inter-node timeouts tolerate clock skew and drift

2015-02-04 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305527#comment-14305527
 ] 

Ariel Weisberg commented on CASSANDRA-8732:
---

[~benedict] asked me to log this. I don't see it as a high priority either 
because there is an element of operator error in not having clocks synced. It 
is a common error and it is a nice to have for the database to handle it as 
well as possible.

I think issue is when one clock is far enough off timeouts can fire 
sooner/later then they should. Messages are dropped by receiving nodes when 
they are proxied intracluster according to my reading of the code so that is 
why it is sensitive to drift and skew.

An append only workload or workload that only updates from a single writer 
might not notice the clock skew? Just throwing that out there. It's also 
possible only one node has a problematic clocks.


 Make inter-node timeouts tolerate clock skew and drift
 --

 Key: CASSANDRA-8732
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8732
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg

 Right now internode timeouts rely on currentTimeMillis() (and NTP) to make 
 sure that tasks don't expire before they arrive.
 Every receiver needs to deduce the offset between its nanoTime and the remote 
 nanoTime. I don't think currentTimeMillis is a good choice because it is 
 designed to be manipulated by operators and NTP. I would probably be 
 comfortable assuming that nanoTime isn't going to move in significant ways 
 without something that could be classified as operator error happening.
 I suspect the one timing method you can rely on being accurate is nanoTime 
 within a node (on average) and that a node can report on its own scheduling 
 jitter (on average).
 Finding the offset requires knowing what the network latency is in one 
 direction.
 One way to do this would be to periodically send a ping request which 
 generates a series of ping responses at fixed intervals (maybe by UDP?). The 
 responses should corrected for scheduling jitter since the fixed intervals 
 may not be exactly achieved by the sender. By measuring the time deviation 
 between ping responses and their expected arrival time (based on the 
 interval) and correcting for the remotely reported scheduling jitter, you 
 should be able to measure latency in one direction.
 A weighted moving average (only correct for drift, not readjustment) of these 
 measurements would eventually converge on a close answer and would not be 
 impacted by outlier measurements. It may also make sense to drop the largest 
 N samples to improve accuracy.
 One you know network latency you can add that to the timestamp of each ping 
 and compare to the local clock and know what the offset is.
 These measurements won't calculate the offset to be too small (timeouts fire 
 early), but could calculate the offset to be too large (timeouts fire late). 
 The conditions where you the offset won't be accurate are the conditions 
 where you also want them firing reliably. This and bootstrapping in bad 
 conditions is what I am most uncertain of.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8732) Make inter-node timeouts tolerate clock skew and drift

2015-02-04 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305595#comment-14305595
 ] 

Sylvain Lebresne commented on CASSANDRA-8732:
-

bq. What I'm suggesting is pretty simple.

Pardon me for not having anticipated your future suggestion. I though it was 
obvious I was refering to Ariel's periodical ping through UDP and what not. 
Simply sending a tiny more info so the other end minimize it's potential error 
is fine complexity wise.

 Make inter-node timeouts tolerate clock skew and drift
 --

 Key: CASSANDRA-8732
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8732
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg

 Right now internode timeouts rely on currentTimeMillis() (and NTP) to make 
 sure that tasks don't expire before they arrive.
 Every receiver needs to deduce the offset between its nanoTime and the remote 
 nanoTime. I don't think currentTimeMillis is a good choice because it is 
 designed to be manipulated by operators and NTP. I would probably be 
 comfortable assuming that nanoTime isn't going to move in significant ways 
 without something that could be classified as operator error happening.
 I suspect the one timing method you can rely on being accurate is nanoTime 
 within a node (on average) and that a node can report on its own scheduling 
 jitter (on average).
 Finding the offset requires knowing what the network latency is in one 
 direction.
 One way to do this would be to periodically send a ping request which 
 generates a series of ping responses at fixed intervals (maybe by UDP?). The 
 responses should corrected for scheduling jitter since the fixed intervals 
 may not be exactly achieved by the sender. By measuring the time deviation 
 between ping responses and their expected arrival time (based on the 
 interval) and correcting for the remotely reported scheduling jitter, you 
 should be able to measure latency in one direction.
 A weighted moving average (only correct for drift, not readjustment) of these 
 measurements would eventually converge on a close answer and would not be 
 impacted by outlier measurements. It may also make sense to drop the largest 
 N samples to improve accuracy.
 One you know network latency you can add that to the timestamp of each ping 
 and compare to the local clock and know what the offset is.
 These measurements won't calculate the offset to be too small (timeouts fire 
 early), but could calculate the offset to be too large (timeouts fire late). 
 The conditions where you the offset won't be accurate are the conditions 
 where you also want them firing reliably. This and bootstrapping in bad 
 conditions is what I am most uncertain of.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8732) Make inter-node timeouts tolerate clock skew and drift

2015-02-04 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305547#comment-14305547
 ] 

Benedict commented on CASSANDRA-8732:
-

The simplest approach I was thinking of to bound this is to send the time 
remaining, as well as the expected wall clock expiry. These can both be used on 
the remote node to do something sensible, e.g. pick the one closest to half the 
timeout interval, so that we're conservative in both directions (i.e. never 
keeping the message too long, nor expiring them too aggressively).

My biggest concern here is nodes being seen as down because clock skew 
temporarily got large enough to have messages get dropped much too aggressively 
for the response to be returned.

I also agree it's not super duper pressing, I just wanted to log the ticket for 
discussion. But it's also pretty easy to introduce. Just send a delta along 
with the wall clock, and have some simple machinery on the other end to select 
which one to use.

 Make inter-node timeouts tolerate clock skew and drift
 --

 Key: CASSANDRA-8732
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8732
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg

 Right now internode timeouts rely on currentTimeMillis() (and NTP) to make 
 sure that tasks don't expire before they arrive.
 Every receiver needs to deduce the offset between its nanoTime and the remote 
 nanoTime. I don't think currentTimeMillis is a good choice because it is 
 designed to be manipulated by operators and NTP. I would probably be 
 comfortable assuming that nanoTime isn't going to move in significant ways 
 without something that could be classified as operator error happening.
 I suspect the one timing method you can rely on being accurate is nanoTime 
 within a node (on average) and that a node can report on its own scheduling 
 jitter (on average).
 Finding the offset requires knowing what the network latency is in one 
 direction.
 One way to do this would be to periodically send a ping request which 
 generates a series of ping responses at fixed intervals (maybe by UDP?). The 
 responses should corrected for scheduling jitter since the fixed intervals 
 may not be exactly achieved by the sender. By measuring the time deviation 
 between ping responses and their expected arrival time (based on the 
 interval) and correcting for the remotely reported scheduling jitter, you 
 should be able to measure latency in one direction.
 A weighted moving average (only correct for drift, not readjustment) of these 
 measurements would eventually converge on a close answer and would not be 
 impacted by outlier measurements. It may also make sense to drop the largest 
 N samples to improve accuracy.
 One you know network latency you can add that to the timestamp of each ping 
 and compare to the local clock and know what the offset is.
 These measurements won't calculate the offset to be too small (timeouts fire 
 early), but could calculate the offset to be too large (timeouts fire late). 
 The conditions where you the offset won't be accurate are the conditions 
 where you also want them firing reliably. This and bootstrapping in bad 
 conditions is what I am most uncertain of.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8732) Make inter-node timeouts tolerate clock skew and drift

2015-02-04 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305602#comment-14305602
 ] 

Benedict commented on CASSANDRA-8732:
-

I made that suggestion in a comment 10m before yours, but it is possible we had 
a JIRA race, so I apologise if it came across negatively. I just wanted to be 
sure you had seen it, and were responding to that as well (as it seemed you 
were).

 Make inter-node timeouts tolerate clock skew and drift
 --

 Key: CASSANDRA-8732
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8732
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg

 Right now internode timeouts rely on currentTimeMillis() (and NTP) to make 
 sure that tasks don't expire before they arrive.
 Every receiver needs to deduce the offset between its nanoTime and the remote 
 nanoTime. I don't think currentTimeMillis is a good choice because it is 
 designed to be manipulated by operators and NTP. I would probably be 
 comfortable assuming that nanoTime isn't going to move in significant ways 
 without something that could be classified as operator error happening.
 I suspect the one timing method you can rely on being accurate is nanoTime 
 within a node (on average) and that a node can report on its own scheduling 
 jitter (on average).
 Finding the offset requires knowing what the network latency is in one 
 direction.
 One way to do this would be to periodically send a ping request which 
 generates a series of ping responses at fixed intervals (maybe by UDP?). The 
 responses should corrected for scheduling jitter since the fixed intervals 
 may not be exactly achieved by the sender. By measuring the time deviation 
 between ping responses and their expected arrival time (based on the 
 interval) and correcting for the remotely reported scheduling jitter, you 
 should be able to measure latency in one direction.
 A weighted moving average (only correct for drift, not readjustment) of these 
 measurements would eventually converge on a close answer and would not be 
 impacted by outlier measurements. It may also make sense to drop the largest 
 N samples to improve accuracy.
 One you know network latency you can add that to the timestamp of each ping 
 and compare to the local clock and know what the offset is.
 These measurements won't calculate the offset to be too small (timeouts fire 
 early), but could calculate the offset to be too large (timeouts fire late). 
 The conditions where you the offset won't be accurate are the conditions 
 where you also want them firing reliably. This and bootstrapping in bad 
 conditions is what I am most uncertain of.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8732) Make inter-node timeouts tolerate clock skew and drift

2015-02-04 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305569#comment-14305569
 ] 

Benedict commented on CASSANDRA-8732:
-

An interesting question, since this is quite important to us, is actually how 
well even a properly managed cluster manages to keep its nodes in sync. Most 
users will not deploy a local GPS based ntpd; most will probably just use ntpd 
from an internet source. I recall at my previous place of work (admittedly 
several years ago we went through this) NTP on Windows had really dramatic skew 
corrections and could be very out of sync. Also, being out by only a few 
hundred milliseconds is quite achievable on Linux ntpd, depending on the 
prevailing conditions, and while we default to 10s timeouts, 1s timeouts are 
probably not uncommon.

 Make inter-node timeouts tolerate clock skew and drift
 --

 Key: CASSANDRA-8732
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8732
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg

 Right now internode timeouts rely on currentTimeMillis() (and NTP) to make 
 sure that tasks don't expire before they arrive.
 Every receiver needs to deduce the offset between its nanoTime and the remote 
 nanoTime. I don't think currentTimeMillis is a good choice because it is 
 designed to be manipulated by operators and NTP. I would probably be 
 comfortable assuming that nanoTime isn't going to move in significant ways 
 without something that could be classified as operator error happening.
 I suspect the one timing method you can rely on being accurate is nanoTime 
 within a node (on average) and that a node can report on its own scheduling 
 jitter (on average).
 Finding the offset requires knowing what the network latency is in one 
 direction.
 One way to do this would be to periodically send a ping request which 
 generates a series of ping responses at fixed intervals (maybe by UDP?). The 
 responses should corrected for scheduling jitter since the fixed intervals 
 may not be exactly achieved by the sender. By measuring the time deviation 
 between ping responses and their expected arrival time (based on the 
 interval) and correcting for the remotely reported scheduling jitter, you 
 should be able to measure latency in one direction.
 A weighted moving average (only correct for drift, not readjustment) of these 
 measurements would eventually converge on a close answer and would not be 
 impacted by outlier measurements. It may also make sense to drop the largest 
 N samples to improve accuracy.
 One you know network latency you can add that to the timestamp of each ping 
 and compare to the local clock and know what the offset is.
 These measurements won't calculate the offset to be too small (timeouts fire 
 early), but could calculate the offset to be too large (timeouts fire late). 
 The conditions where you the offset won't be accurate are the conditions 
 where you also want them firing reliably. This and bootstrapping in bad 
 conditions is what I am most uncertain of.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8732) Make inter-node timeouts tolerate clock skew and drift

2015-02-04 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305179#comment-14305179
 ] 

Sylvain Lebresne commented on CASSANDRA-8732:
-

Are we talking about the timeout after which we drop messages? Cause those 
don't really have to be terribly precise so I'm not entirely sure it's worth 
adding a whole bunch of complexity for that. Unless of course their is actual 
evidence that it's a problem in practice.

 Make inter-node timeouts tolerate clock skew and drift
 --

 Key: CASSANDRA-8732
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8732
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg

 Right now internode timeouts rely on currentTimeMillis() (and NTP) to make 
 sure that tasks don't expire before they arrive.
 Every receiver needs to deduce the offset between its nanoTime and the remote 
 nanoTime. I don't think currentTimeMillis is a good choice because it is 
 designed to be manipulated by operators and NTP. I would probably be 
 comfortable assuming that nanoTime isn't going to move in significant ways 
 without something that could be classified as operator error happening.
 I suspect the one timing method you can rely on being accurate is nanoTime 
 within a node (on average) and that a node can report on its own scheduling 
 jitter (on average).
 Finding the offset requires knowing what the network latency is in one 
 direction.
 One way to do this would be to periodically send a ping request which 
 generates a series of ping responses at fixed intervals (maybe by UDP?). The 
 responses should corrected for scheduling jitter since the fixed intervals 
 may not be exactly achieved by the sender. By measuring the time deviation 
 between ping responses and their expected arrival time (based on the 
 interval) and correcting for the remotely reported scheduling jitter, you 
 should be able to measure latency in one direction.
 A weighted moving average (only correct for drift, not readjustment) of these 
 measurements would eventually converge on a close answer and would not be 
 impacted by outlier measurements. It may also make sense to drop the largest 
 N samples to improve accuracy.
 One you know network latency you can add that to the timestamp of each ping 
 and compare to the local clock and know what the offset is.
 These measurements won't calculate the offset to be too small (timeouts fire 
 early), but could calculate the offset to be too large (timeouts fire late). 
 The conditions where you the offset won't be accurate are the conditions 
 where you also want them firing reliably. This and bootstrapping in bad 
 conditions is what I am most uncertain of.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8732) Make inter-node timeouts tolerate clock skew and drift

2015-02-04 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306306#comment-14306306
 ] 

Benedict commented on CASSANDRA-8732:
-

[~aweisberg] basically, yes. since mostly we're dealing with application 
induced delay (receiving or sending server being overloaded, in GC, etc), this 
seems a pretty reasonable tradeoff. its only goal is to avoid wasting work, and 
in the event of a major network blip we're probably not starved for local 
resources. Of course a GC pause slowing down receipt would be perceived the 
same, and is likely exactly the kind of scenario we want to shed timed out 
messages for. I'm sure there's a simple further tweak to help guard against 
this. 

Let's assume on the recipient we have:

source node wallclock: S
message timeout delta at send time: T
recipient node wallclock at receipt: R
recipient node default timeout: D

Then let's say we calculate S+T and min(R+T, S+2T) and take whichever is 
closest to R+(D/2)

This helps guard against significant network delay or GC pauses being 
undercounted, especially on queries that were close to timeout anyway (e.g. due 
to slow processing on the source node), by capping our forgiveness of clock 
skew to twice the message's remaining timeout when sent.

This is just a quick hand wavy suggestion, it's quite possible there's another 
better approach along the same lines. It retains the simplicity which is the 
important thing. We could perhaps make the cap S+xT, and have x be a 
configurable parameter for power users.

 Make inter-node timeouts tolerate clock skew and drift
 --

 Key: CASSANDRA-8732
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8732
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg

 Right now internode timeouts rely on currentTimeMillis() (and NTP) to make 
 sure that tasks don't expire before they arrive.
 Every receiver needs to deduce the offset between its nanoTime and the remote 
 nanoTime. I don't think currentTimeMillis is a good choice because it is 
 designed to be manipulated by operators and NTP. I would probably be 
 comfortable assuming that nanoTime isn't going to move in significant ways 
 without something that could be classified as operator error happening.
 I suspect the one timing method you can rely on being accurate is nanoTime 
 within a node (on average) and that a node can report on its own scheduling 
 jitter (on average).
 Finding the offset requires knowing what the network latency is in one 
 direction.
 One way to do this would be to periodically send a ping request which 
 generates a series of ping responses at fixed intervals (maybe by UDP?). The 
 responses should corrected for scheduling jitter since the fixed intervals 
 may not be exactly achieved by the sender. By measuring the time deviation 
 between ping responses and their expected arrival time (based on the 
 interval) and correcting for the remotely reported scheduling jitter, you 
 should be able to measure latency in one direction.
 A weighted moving average (only correct for drift, not readjustment) of these 
 measurements would eventually converge on a close answer and would not be 
 impacted by outlier measurements. It may also make sense to drop the largest 
 N samples to improve accuracy.
 One you know network latency you can add that to the timestamp of each ping 
 and compare to the local clock and know what the offset is.
 These measurements won't calculate the offset to be too small (timeouts fire 
 early), but could calculate the offset to be too large (timeouts fire late). 
 The conditions where you the offset won't be accurate are the conditions 
 where you also want them firing reliably. This and bootstrapping in bad 
 conditions is what I am most uncertain of.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8732) Make inter-node timeouts tolerate clock skew and drift

2015-02-04 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306234#comment-14306234
 ] 

Ariel Weisberg commented on CASSANDRA-8732:
---

Benedict that sounds like we would be weighting against accurately timing out 
due to network delay in favor of timing out more accurately when there is clock 
skew?

It's a good tradeoff if it means not failing closed the way it does currently. 
It preserves the value of the timeout better if people aren't using it because 
of clock skew.

 Make inter-node timeouts tolerate clock skew and drift
 --

 Key: CASSANDRA-8732
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8732
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg

 Right now internode timeouts rely on currentTimeMillis() (and NTP) to make 
 sure that tasks don't expire before they arrive.
 Every receiver needs to deduce the offset between its nanoTime and the remote 
 nanoTime. I don't think currentTimeMillis is a good choice because it is 
 designed to be manipulated by operators and NTP. I would probably be 
 comfortable assuming that nanoTime isn't going to move in significant ways 
 without something that could be classified as operator error happening.
 I suspect the one timing method you can rely on being accurate is nanoTime 
 within a node (on average) and that a node can report on its own scheduling 
 jitter (on average).
 Finding the offset requires knowing what the network latency is in one 
 direction.
 One way to do this would be to periodically send a ping request which 
 generates a series of ping responses at fixed intervals (maybe by UDP?). The 
 responses should corrected for scheduling jitter since the fixed intervals 
 may not be exactly achieved by the sender. By measuring the time deviation 
 between ping responses and their expected arrival time (based on the 
 interval) and correcting for the remotely reported scheduling jitter, you 
 should be able to measure latency in one direction.
 A weighted moving average (only correct for drift, not readjustment) of these 
 measurements would eventually converge on a close answer and would not be 
 impacted by outlier measurements. It may also make sense to drop the largest 
 N samples to improve accuracy.
 One you know network latency you can add that to the timestamp of each ping 
 and compare to the local clock and know what the offset is.
 These measurements won't calculate the offset to be too small (timeouts fire 
 early), but could calculate the offset to be too large (timeouts fire late). 
 The conditions where you the offset won't be accurate are the conditions 
 where you also want them firing reliably. This and bootstrapping in bad 
 conditions is what I am most uncertain of.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8732) Make inter-node timeouts tolerate clock skew and drift

2015-02-04 Thread sankalp kohli (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14306171#comment-14306171
 ] 

sankalp kohli commented on CASSANDRA-8732:
--

I like [~benedict] suggestion. We don't use cross_node_timeout for this same 
reason that it is too dangerous when there is clock skew. 



 Make inter-node timeouts tolerate clock skew and drift
 --

 Key: CASSANDRA-8732
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8732
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg

 Right now internode timeouts rely on currentTimeMillis() (and NTP) to make 
 sure that tasks don't expire before they arrive.
 Every receiver needs to deduce the offset between its nanoTime and the remote 
 nanoTime. I don't think currentTimeMillis is a good choice because it is 
 designed to be manipulated by operators and NTP. I would probably be 
 comfortable assuming that nanoTime isn't going to move in significant ways 
 without something that could be classified as operator error happening.
 I suspect the one timing method you can rely on being accurate is nanoTime 
 within a node (on average) and that a node can report on its own scheduling 
 jitter (on average).
 Finding the offset requires knowing what the network latency is in one 
 direction.
 One way to do this would be to periodically send a ping request which 
 generates a series of ping responses at fixed intervals (maybe by UDP?). The 
 responses should corrected for scheduling jitter since the fixed intervals 
 may not be exactly achieved by the sender. By measuring the time deviation 
 between ping responses and their expected arrival time (based on the 
 interval) and correcting for the remotely reported scheduling jitter, you 
 should be able to measure latency in one direction.
 A weighted moving average (only correct for drift, not readjustment) of these 
 measurements would eventually converge on a close answer and would not be 
 impacted by outlier measurements. It may also make sense to drop the largest 
 N samples to improve accuracy.
 One you know network latency you can add that to the timestamp of each ping 
 and compare to the local clock and know what the offset is.
 These measurements won't calculate the offset to be too small (timeouts fire 
 early), but could calculate the offset to be too large (timeouts fire late). 
 The conditions where you the offset won't be accurate are the conditions 
 where you also want them firing reliably. This and bootstrapping in bad 
 conditions is what I am most uncertain of.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8732) Make inter-node timeouts tolerate clock skew and drift

2015-02-03 Thread Jon Haddad (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304517#comment-14304517
 ] 

Jon Haddad commented on CASSANDRA-8732:
---

How much drift are you attempting to correct? 

 Make inter-node timeouts tolerate clock skew and drift
 --

 Key: CASSANDRA-8732
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8732
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg

 Right now internode timeouts rely on currentTimeMillis() (and NTP) to make 
 sure that tasks don't expire before they arrive.
 Every receiver needs to deduce the offset between its nanoTime and the remote 
 nanoTime. I don't think currentTimeMillis is a good choice because it is 
 designed to be manipulated by operators and NTP. I would probably be 
 comfortable assuming that nanoTime isn't going to move in significant ways 
 without something that could be classified as operator error happening.
 I suspect the one timing method you can rely on being accurate is nanoTime 
 within a node (on average) and that a node can report on its own scheduling 
 jitter (on average).
 Finding the offset requires knowing what the network latency is in one 
 direction.
 One way to do this would be to periodically send a ping request which 
 generates a series of ping responses at fixed intervals (maybe by UDP?). The 
 responses should corrected for scheduling jitter since the fixed intervals 
 may not be exactly achieved by the sender. By measuring the time deviation 
 between ping responses and their expected arrival time (based on the 
 interval) and correcting for the remotely reported scheduling jitter, you 
 should be able to measure latency in one direction.
 A weighted moving average (only correct for drift, not readjustment) of these 
 measurements would eventually converge on a close answer and would not be 
 impacted by outlier measurements. It may also make sense to drop the largest 
 N samples to improve accuracy.
 One you know network latency you can add that to the timestamp of each ping 
 and compare to the local clock and know what the offset is.
 These measurements won't calculate the offset to be too small (timeouts fire 
 early), but could calculate the offset to be too large (timeouts fire late). 
 The conditions where you the offset won't be accurate are the conditions 
 where you also want them firing reliably. This and bootstrapping in bad 
 conditions is what I am most uncertain of.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8732) Make inter-node timeouts tolerate clock skew and drift

2015-02-03 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304188#comment-14304188
 ] 

Ariel Weisberg commented on CASSANDRA-8732:
---

Another thought. Why not have the timeout sent to the remote node contain time 
remaining?  Still has to be in addition two some form of wall or relative clock 
since time outs due to messaging latency need to be taken into account.

 Make inter-node timeouts tolerate clock skew and drift
 --

 Key: CASSANDRA-8732
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8732
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg

 Right now internode timeouts rely on currentTimeMillis() (and NTP) to make 
 sure that tasks don't expire before they arrive.
 Every receiver needs to deduce the offset between its nanoTime and the remote 
 nanoTime. I don't think currentTimeMillis is a good choice because it is 
 designed to be manipulated by operators and NTP. I would probably be 
 comfortable assuming that nanoTime isn't going to move in significant ways 
 without something that could be classified as operator error happening.
 I suspect the one timing method you can rely on being accurate is nanoTime 
 within a node (on average) and that a node can report on its own scheduling 
 jitter (on average).
 Finding the offset requires knowing what the network latency is in one 
 direction.
 One way to do this would be to periodically send a ping request which 
 generates a series of ping responses at fixed intervals (maybe by UDP?). The 
 responses should corrected for scheduling jitter since the fixed intervals 
 may not be exactly achieved by the sender. By measuring the time deviation 
 between ping responses and their expected arrival time (based on the 
 interval) and correcting for the remotely reported scheduling jitter, you 
 should be able to measure latency in one direction.
 A weighted moving average (only correct for drift, not readjustment) of these 
 measurements would eventually converge on a close answer and would not be 
 impacted by outlier measurements. It may also make sense to drop the largest 
 N samples to improve accuracy.
 One you know network latency you can add that to the timestamp of each ping 
 and compare to the local clock and know what the offset is.
 These measurements won't calculate the offset to be too small (timeouts fire 
 early), but could calculate the offset to be too large (timeouts fire late). 
 The conditions where you the offset won't be accurate are the conditions 
 where you also want them firing reliably. This and bootstrapping in bad 
 conditions is what I am most uncertain of.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8732) Make inter-node timeouts tolerate clock skew and drift

2015-02-03 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304007#comment-14304007
 ] 

Robert Stupp commented on CASSANDRA-8732:
-

Sounds reasonable. It could be possible to take network topology into account - 
e.g. just directly ”ping” nodes in the same rack, ”ping” one node per rack in 
the same DC, ”ping” one one in each other DC - ending in a map containing the 
approximate latency of the whole cluster.
Timeouts could then be specified using relative time instead of absolute time.

For example: one-way latency for DC ”Europe” from ”US East” is 25ms, one-way 
latency for ”US East” from ”Europe” is 20ms. A request has a timeout of 100ms.
The sending node would send the request to ”Europe“ with 100ms - 25ms = 75ms. 
The receiving node subtracts the latency itself knows for the other direction 
(75ms - 20ms) - resulting in an effective timeout of 55ms.

AFAIK NTP provides an accuracy of approx. 10ms on WAN connections and some 
hundred µs for LAN connections. We should be able to calculate network 
latencies with that accuracy.

Removing top-N outliers is a good idea - respecting GC pauses, etc.
Naturally we should also only ping those DCs/racks/nodes that are known to be 
reachable (gossip).

But I’m not sure how to handle flapping network connections - i.e. TCP/gossip 
remains stable, but UDP packets get lost.
Also network maintenance (e.g. temporarily pulling network cables) could cause 
wrong results.
BGP route changes can drastically in/decrease latency, too.
I don’t want to say that we need to detect all these things - but maybe add 
some ”smoothing“ to overall measurement and completely ignore large timeouts or 
measurement packets arriving very late.

 Make inter-node timeouts tolerate clock skew and drift
 --

 Key: CASSANDRA-8732
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8732
 Project: Cassandra
  Issue Type: Improvement
Reporter: Ariel Weisberg

 Right now internode timeouts rely on currentTimeMillis() (and NTP) to make 
 sure that tasks don't expire before they arrive.
 Every receiver needs to deduce the offset between its nanoTime and the remote 
 nanoTime. I don't think currentTimeMillis is a good choice because it is 
 designed to be manipulated by operators and NTP. I would probably be 
 comfortable assuming that nanoTime isn't going to move in significant ways 
 without something that could be classified as operator error happening.
 I suspect the one timing method you can rely on being accurate is nanoTime 
 within a node (on average) and that a node can report on its own scheduling 
 jitter (on average).
 Finding the offset requires knowing what the network latency is in one 
 direction.
 One way to do this would be to periodically send a ping request which 
 generates a series of ping responses at fixed intervals (maybe by UDP?). The 
 responses should corrected for scheduling jitter since the fixed intervals 
 may not be exactly achieved by the sender. By measuring the time deviation 
 between ping responses and their expected arrival time (based on the 
 interval) and correcting for the remotely reported scheduling jitter, you 
 should be able to measure latency in one direction.
 A weighted moving average (only correct for drift, not readjustment) of these 
 measurements would eventually converge on a close answer and would not be 
 impacted by outlier measurements. It may also make sense to drop the largest 
 N samples to improve accuracy.
 One you know network latency you can add that to the timestamp of each ping 
 and compare to the local clock and know what the offset is.
 These measurements won't calculate the offset to be too small (timeouts fire 
 early), but could calculate the offset to be too large (timeouts fire late). 
 The conditions where you the offset won't be accurate are the conditions 
 where you also want them firing reliably. This and bootstrapping in bad 
 conditions is what I am most uncertain of.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)