[jira] [Commented] (CASSANDRA-6465) DES scores fluctuate too much for cache pinning
[ https://issues.apache.org/jira/browse/CASSANDRA-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13871089#comment-13871089 ] Brandon Williams commented on CASSANDRA-6465: - Can we get some numbers on score fluctuation with the time penalty removed to be certain this fixes it? > DES scores fluctuate too much for cache pinning > --- > > Key: CASSANDRA-6465 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6465 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 1.2.11, 2 DC cluster >Reporter: Chris Burroughs >Assignee: Tyler Hobbs >Priority: Minor > Labels: gossip > Fix For: 2.0.5 > > Attachments: 6465-v1.patch, 99th_latency.png, des-score-graph.png, > des.sample.15min.csv, get-scores.py, throughput.png > > > To quote the conf: > {noformat} > # if set greater than zero and read_repair_chance is < 1.0, this will allow > # 'pinning' of replicas to hosts in order to increase cache capacity. > # The badness threshold will control how much worse the pinned host has to be > # before the dynamic snitch will prefer other replicas over it. This is > # expressed as a double which represents a percentage. Thus, a value of > # 0.2 means Cassandra would continue to prefer the static snitch values > # until the pinned host was 20% worse than the fastest. > dynamic_snitch_badness_threshold: 0.1 > {noformat} > An assumption of this feature is that scores will vary by less than > dynamic_snitch_badness_threshold during normal operations. Attached is the > result of polling a node for the scores of 6 different endpoints at 1 Hz for > 15 minutes. The endpoints to sample were chosen with `nodetool getendpoints` > for row that is known to get reads. The node was acting as a coordinator for > a few hundred req/second, so it should have sufficient data to work with. > Other traces on a second cluster have produced similar results. > * The scores vary by far more than I would expect, as show by the difficulty > of seeing anything useful in that graph. > * The difference between the best and next-best score is usually > 10% > (default dynamic_snitch_badness_threshold). > Neither ClientRequest nor ColumFamily metrics showed wild changes during the > data gathering period. > Attachments: > * jython script cobbled together to gather the data (based on work on the > mailing list from Maki Watanabe a while back) > * csv of DES scores for 6 endpoints, polled about once a second > * Attempt at making a graph -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6465) DES scores fluctuate too much for cache pinning
[ https://issues.apache.org/jira/browse/CASSANDRA-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13868251#comment-13868251 ] Brandon Williams commented on CASSANDRA-6465: - The best way to test #1 is to run in foreground mode and then suspend (^Z) the JVM. > DES scores fluctuate too much for cache pinning > --- > > Key: CASSANDRA-6465 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6465 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 1.2.11, 2 DC cluster >Reporter: Chris Burroughs >Assignee: Tyler Hobbs >Priority: Minor > Labels: gossip > Fix For: 2.0.5 > > Attachments: des-score-graph.png, des.sample.15min.csv, get-scores.py > > > To quote the conf: > {noformat} > # if set greater than zero and read_repair_chance is < 1.0, this will allow > # 'pinning' of replicas to hosts in order to increase cache capacity. > # The badness threshold will control how much worse the pinned host has to be > # before the dynamic snitch will prefer other replicas over it. This is > # expressed as a double which represents a percentage. Thus, a value of > # 0.2 means Cassandra would continue to prefer the static snitch values > # until the pinned host was 20% worse than the fastest. > dynamic_snitch_badness_threshold: 0.1 > {noformat} > An assumption of this feature is that scores will vary by less than > dynamic_snitch_badness_threshold during normal operations. Attached is the > result of polling a node for the scores of 6 different endpoints at 1 Hz for > 15 minutes. The endpoints to sample were chosen with `nodetool getendpoints` > for row that is known to get reads. The node was acting as a coordinator for > a few hundred req/second, so it should have sufficient data to work with. > Other traces on a second cluster have produced similar results. > * The scores vary by far more than I would expect, as show by the difficulty > of seeing anything useful in that graph. > * The difference between the best and next-best score is usually > 10% > (default dynamic_snitch_badness_threshold). > Neither ClientRequest nor ColumFamily metrics showed wild changes during the > data gathering period. > Attachments: > * jython script cobbled together to gather the data (based on work on the > mailing list from Maki Watanabe a while back) > * csv of DES scores for 6 endpoints, polled about once a second > * Attempt at making a graph -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6465) DES scores fluctuate too much for cache pinning
[ https://issues.apache.org/jira/browse/CASSANDRA-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13868234#comment-13868234 ] Tyler Hobbs commented on CASSANDRA-6465: [~ianbarfield] thanks for the analysis, you make some excellent observations. >From the discussion in CASSANDRA-3722, it seems like the two motivations for >the time penalty were these: # When a node dies, the FD will not mark it down for a while; in the meantime, we'd like to stop sending queries to it # In a multi-DC setup, we would like to penalize the remote DC, but not so much that we won't ever use it when local nodes become very slow I suspect that rapid read protection (CASSANDRA-4705) does a good job of mitigating the #1 case until the FD marks the node down. I'll do some testing to confirm this. I don't feel like the #2 case needs special treatment from the dynamic snitch, especially with the badness_threshold in effect. Latency to the remote DC should prevent it from being used under normal circumstances. If users really want to guarantee that, the LOCAL consistency levels are always available. > DES scores fluctuate too much for cache pinning > --- > > Key: CASSANDRA-6465 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6465 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 1.2.11, 2 DC cluster >Reporter: Chris Burroughs >Assignee: Tyler Hobbs >Priority: Minor > Labels: gossip > Fix For: 2.0.5 > > Attachments: des-score-graph.png, des.sample.15min.csv, get-scores.py > > > To quote the conf: > {noformat} > # if set greater than zero and read_repair_chance is < 1.0, this will allow > # 'pinning' of replicas to hosts in order to increase cache capacity. > # The badness threshold will control how much worse the pinned host has to be > # before the dynamic snitch will prefer other replicas over it. This is > # expressed as a double which represents a percentage. Thus, a value of > # 0.2 means Cassandra would continue to prefer the static snitch values > # until the pinned host was 20% worse than the fastest. > dynamic_snitch_badness_threshold: 0.1 > {noformat} > An assumption of this feature is that scores will vary by less than > dynamic_snitch_badness_threshold during normal operations. Attached is the > result of polling a node for the scores of 6 different endpoints at 1 Hz for > 15 minutes. The endpoints to sample were chosen with `nodetool getendpoints` > for row that is known to get reads. The node was acting as a coordinator for > a few hundred req/second, so it should have sufficient data to work with. > Other traces on a second cluster have produced similar results. > * The scores vary by far more than I would expect, as show by the difficulty > of seeing anything useful in that graph. > * The difference between the best and next-best score is usually > 10% > (default dynamic_snitch_badness_threshold). > Neither ClientRequest nor ColumFamily metrics showed wild changes during the > data gathering period. > Attachments: > * jython script cobbled together to gather the data (based on work on the > mailing list from Maki Watanabe a while back) > * csv of DES scores for 6 endpoints, polled about once a second > * Attempt at making a graph -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6465) DES scores fluctuate too much for cache pinning
[ https://issues.apache.org/jira/browse/CASSANDRA-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863309#comment-13863309 ] Ian Barfield commented on CASSANDRA-6465: - I believe the purpose of time penalty was to more quickly detect problematic nodes. If a node was suddenly suffering severe issues, that wouldn't be reflected in its latency metric until the current outstanding queries resolved. That might take until the maximum duration timeout which can be arbitrarily long, and in many cases is a lot longer than you'd like. By using timeDelay, the snitch can somewhat immediately penalize problem nodes since the queries do not have to timeout first. That said, it has numerous flaws both conceptually and in its implementation. I was working on this problem a couple weeks ago, but have been distracted since, so I might not be able to give the best summary. Here's a couple issues off the top of my head though: - if the time delay values are low, then high jitter throws the scores way off. It isn't unreasonable to expect situations where the time delay shifts semi-randomly between 0 and 1 ms. This means very little in terms of whether a node is a suitable target but can cause a drastic difference in scores if there is no slow node to anchor the scores. - if the node response periods aren't low; say they average around 50 ms. Then by definition they are highly random since the score could be calculated at any point along 0 to 50 ms. - it has a lot of complex interactions outside of its original scope of detecting bad nodes - when calculating scores, if there is no lastReceived value for a node (eg. the node has just been added to the cluster), then the logic defaults to using the current time (essentially 0 or maximum 'good'). You might instead take the view that an unproven, cache-cold node would be a bad selection. - sensitive to local noise. Each time the score is calculated, the timePenalty is calculated fresh. Since there is no concept of persistance or scope, events that corrupt the scoring process are extra harmful. eg. GC, CPU load / thread scheduling, and concurrency shenanigans occuring between the lastReceived.get() and System.currentTimeMillis() Some of these issues are somewhat alleviated by the switch to using nanos, and I've been tempted to back port that for this class at least for testing, but this logic fails in complex ways. I think at some point I was able to confirm some wildly fluctuating values of the subcomponents to the scores (specifically timePenalty) by checking the mbeans and working under the assumption that timePenalty was likely the only component to well rounded scores -- if you have at least one node with >> timePenalty then it gets cut off to UPDATE_INTERVAL_IN_MS which as a divisor makes for nicely formed floating point numbers. There are also a lot of issues with the other score components, and some of the overall logic, but... some other time. Apologies if i've gotten something quite wrong; I've never really used Cassandra. > DES scores fluctuate too much for cache pinning > --- > > Key: CASSANDRA-6465 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6465 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 1.2.11, 2 DC cluster >Reporter: Chris Burroughs >Assignee: Tyler Hobbs >Priority: Minor > Labels: gossip > Fix For: 2.0.5 > > Attachments: des-score-graph.png, des.sample.15min.csv, get-scores.py > > > To quote the conf: > {noformat} > # if set greater than zero and read_repair_chance is < 1.0, this will allow > # 'pinning' of replicas to hosts in order to increase cache capacity. > # The badness threshold will control how much worse the pinned host has to be > # before the dynamic snitch will prefer other replicas over it. This is > # expressed as a double which represents a percentage. Thus, a value of > # 0.2 means Cassandra would continue to prefer the static snitch values > # until the pinned host was 20% worse than the fastest. > dynamic_snitch_badness_threshold: 0.1 > {noformat} > An assumption of this feature is that scores will vary by less than > dynamic_snitch_badness_threshold during normal operations. Attached is the > result of polling a node for the scores of 6 different endpoints at 1 Hz for > 15 minutes. The endpoints to sample were chosen with `nodetool getendpoints` > for row that is known to get reads. The node was acting as a coordinator for > a few hundred req/second, so it should have sufficient data to work with. > Other traces on a second cluster have produced similar results. > * The scores vary by far more than I would expect, as show by the difficulty > of seeing anything useful in that graph. >
[jira] [Commented] (CASSANDRA-6465) DES scores fluctuate too much for cache pinning
[ https://issues.apache.org/jira/browse/CASSANDRA-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861012#comment-13861012 ] Tyler Hobbs commented on CASSANDRA-6465: I can reproduce Chris's results, and in my experimentation it looks like almost all of the variation is due to the "timePenalty", which is basically how long it has been since the last entry from an endpoint. I can see why something like the time penalty might be useful for the phi FD, which expects messages on a periodic basis, but it doesn't make sense to me to use it in a load balancing measure. My suggestion would be to remove the time penalty. bq. Are we sure that this mechanism of producing cache pinning is worth the complexity here, especially given speculative execution? Effective cache utilization is extremely important, so I would say it's well worth the additional complexity. I don't think speculative execution should affect this greatly, but I might be missing something; care to expand on that? > DES scores fluctuate too much for cache pinning > --- > > Key: CASSANDRA-6465 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6465 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 1.2.11, 2 DC cluster >Reporter: Chris Burroughs >Assignee: Tyler Hobbs >Priority: Minor > Labels: gossip > Fix For: 2.0.5 > > Attachments: des-score-graph.png, des.sample.15min.csv, get-scores.py > > > To quote the conf: > {noformat} > # if set greater than zero and read_repair_chance is < 1.0, this will allow > # 'pinning' of replicas to hosts in order to increase cache capacity. > # The badness threshold will control how much worse the pinned host has to be > # before the dynamic snitch will prefer other replicas over it. This is > # expressed as a double which represents a percentage. Thus, a value of > # 0.2 means Cassandra would continue to prefer the static snitch values > # until the pinned host was 20% worse than the fastest. > dynamic_snitch_badness_threshold: 0.1 > {noformat} > An assumption of this feature is that scores will vary by less than > dynamic_snitch_badness_threshold during normal operations. Attached is the > result of polling a node for the scores of 6 different endpoints at 1 Hz for > 15 minutes. The endpoints to sample were chosen with `nodetool getendpoints` > for row that is known to get reads. The node was acting as a coordinator for > a few hundred req/second, so it should have sufficient data to work with. > Other traces on a second cluster have produced similar results. > * The scores vary by far more than I would expect, as show by the difficulty > of seeing anything useful in that graph. > * The difference between the best and next-best score is usually > 10% > (default dynamic_snitch_badness_threshold). > Neither ClientRequest nor ColumFamily metrics showed wild changes during the > data gathering period. > Attachments: > * jython script cobbled together to gather the data (based on work on the > mailing list from Maki Watanabe a while back) > * csv of DES scores for 6 endpoints, polled about once a second > * Attempt at making a graph -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (CASSANDRA-6465) DES scores fluctuate too much for cache pinning
[ https://issues.apache.org/jira/browse/CASSANDRA-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13849949#comment-13849949 ] Robert Coli commented on CASSANDRA-6465: Are we sure that this mechanism of producing cache pinning is worth the complexity here, especially given speculative retry? > DES scores fluctuate too much for cache pinning > --- > > Key: CASSANDRA-6465 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6465 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: 1.2.11, 2 DC cluster >Reporter: Chris Burroughs >Assignee: Tyler Hobbs >Priority: Minor > Labels: gossip > Fix For: 2.0.4 > > Attachments: des-score-graph.png, des.sample.15min.csv, get-scores.py > > > To quote the conf: > {noformat} > # if set greater than zero and read_repair_chance is < 1.0, this will allow > # 'pinning' of replicas to hosts in order to increase cache capacity. > # The badness threshold will control how much worse the pinned host has to be > # before the dynamic snitch will prefer other replicas over it. This is > # expressed as a double which represents a percentage. Thus, a value of > # 0.2 means Cassandra would continue to prefer the static snitch values > # until the pinned host was 20% worse than the fastest. > dynamic_snitch_badness_threshold: 0.1 > {noformat} > An assumption of this feature is that scores will vary by less than > dynamic_snitch_badness_threshold during normal operations. Attached is the > result of polling a node for the scores of 6 different endpoints at 1 Hz for > 15 minutes. The endpoints to sample were chosen with `nodetool getendpoints` > for row that is known to get reads. The node was acting as a coordinator for > a few hundred req/second, so it should have sufficient data to work with. > Other traces on a second cluster have produced similar results. > * The scores vary by far more than I would expect, as show by the difficulty > of seeing anything useful in that graph. > * The difference between the best and next-best score is usually > 10% > (default dynamic_snitch_badness_threshold). > Neither ClientRequest nor ColumFamily metrics showed wild changes during the > data gathering period. > Attachments: > * jython script cobbled together to gather the data (based on work on the > mailing list from Maki Watanabe a while back) > * csv of DES scores for 6 endpoints, polled about once a second > * Attempt at making a graph -- This message was sent by Atlassian JIRA (v6.1.4#6159)