subject:"\[jira\] \[Commented\] \(CASSANDRA\-6465\) DES scores fluctuate too much for cache pinning"

[jira] [Commented] (CASSANDRA-6465) DES scores fluctuate too much for cache pinning

2014-01-14 Thread Brandon Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13871089#comment-13871089
 ] 

Brandon Williams commented on CASSANDRA-6465:
-

Can we get some numbers on score fluctuation with the time penalty removed to 
be certain this fixes it?

 DES scores fluctuate too much for cache pinning
 ---

 Key: CASSANDRA-6465
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6465
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: 1.2.11, 2 DC cluster
Reporter: Chris Burroughs
Assignee: Tyler Hobbs
Priority: Minor
  Labels: gossip
 Fix For: 2.0.5

 Attachments: 6465-v1.patch, 99th_latency.png, des-score-graph.png, 
 des.sample.15min.csv, get-scores.py, throughput.png


 To quote the conf:
 {noformat}
 # if set greater than zero and read_repair_chance is  1.0, this will allow
 # 'pinning' of replicas to hosts in order to increase cache capacity.
 # The badness threshold will control how much worse the pinned host has to be
 # before the dynamic snitch will prefer other replicas over it.  This is
 # expressed as a double which represents a percentage.  Thus, a value of
 # 0.2 means Cassandra would continue to prefer the static snitch values
 # until the pinned host was 20% worse than the fastest.
 dynamic_snitch_badness_threshold: 0.1
 {noformat}
 An assumption of this feature is that scores will vary by less than 
 dynamic_snitch_badness_threshold during normal operations.  Attached is the 
 result of polling a node for the scores of 6 different endpoints at 1 Hz for 
 15 minutes.  The endpoints to sample were chosen with `nodetool getendpoints` 
 for row that is known to get reads.  The node was acting as a coordinator for 
 a few hundred req/second, so it should have sufficient data to work with.  
 Other traces on a second cluster have produced similar results.
  * The scores vary by far more than I would expect, as show by the difficulty 
 of seeing anything useful in that graph.
  * The difference between the best and next-best score is usually  10% 
 (default dynamic_snitch_badness_threshold).
 Neither ClientRequest nor ColumFamily metrics showed wild changes during the 
 data gathering period.
 Attachments:
  * jython script cobbled together to gather the data (based on work on the 
 mailing list from Maki Watanabe a while back)
  * csv of DES scores for 6 endpoints, polled about once a second
  * Attempt at making a graph



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (CASSANDRA-6465) DES scores fluctuate too much for cache pinning

2014-01-10 Thread Tyler Hobbs (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868234#comment-13868234
]

Tyler Hobbs commented on CASSANDRA-6465:

[~ianbarfield] thanks for the analysis, you make some excellent observations.

From the discussion in CASSANDRA-3722, it seems like the two motivations for
the time penalty were these:
# When a node dies, the FD will not mark it down for a while; in the meantime,
we'd like to stop sending queries to it
# In a multi-DC setup, we would like to penalize the remote DC, but not so much
that we won't ever use it when local nodes become very slow

I suspect that rapid read protection (CASSANDRA-4705) does a good job of
mitigating the #1 case until the FD marks the node down. I'll do some testing
to confirm this.

I don't feel like the #2 case needs special treatment from the dynamic snitch,
especially with the badness_threshold in effect. Latency to the remote DC
should prevent it from being used under normal circumstances. If users really
want to guarantee that, the LOCAL consistency levels are always available.

DES scores fluctuate too much for cache pinning
---

Key: CASSANDRA-6465
URL: https://issues.apache.org/jira/browse/CASSANDRA-6465
Project: Cassandra
Issue Type: Bug
Components: Core
Environment: 1.2.11, 2 DC cluster
Reporter: Chris Burroughs
Assignee: Tyler Hobbs
Priority: Minor
Labels: gossip
Fix For: 2.0.5

Attachments: des-score-graph.png, des.sample.15min.csv, get-scores.py

To quote the conf:
{noformat}
# if set greater than zero and read_repair_chance is 1.0, this will allow
# 'pinning' of replicas to hosts in order to increase cache capacity.
# The badness threshold will control how much worse the pinned host has to be
# before the dynamic snitch will prefer other replicas over it. This is
# expressed as a double which represents a percentage. Thus, a value of
# 0.2 means Cassandra would continue to prefer the static snitch values
# until the pinned host was 20% worse than the fastest.
dynamic_snitch_badness_threshold: 0.1
{noformat}
An assumption of this feature is that scores will vary by less than
dynamic_snitch_badness_threshold during normal operations. Attached is the
result of polling a node for the scores of 6 different endpoints at 1 Hz for
15 minutes. The endpoints to sample were chosen with `nodetool getendpoints`
for row that is known to get reads. The node was acting as a coordinator for
a few hundred req/second, so it should have sufficient data to work with.
Other traces on a second cluster have produced similar results.
* The scores vary by far more than I would expect, as show by the difficulty
of seeing anything useful in that graph.
* The difference between the best and next-best score is usually 10%
(default dynamic_snitch_badness_threshold).
Neither ClientRequest nor ColumFamily metrics showed wild changes during the
data gathering period.
Attachments:
* jython script cobbled together to gather the data (based on work on the
mailing list from Maki Watanabe a while back)
* csv of DES scores for 6 endpoints, polled about once a second
* Attempt at making a graph

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (CASSANDRA-6465) DES scores fluctuate too much for cache pinning

2014-01-10 Thread Brandon Williams (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868251#comment-13868251
]

Brandon Williams commented on CASSANDRA-6465:
-

The best way to test #1 is to run in foreground mode and then suspend (^Z) the
JVM.

DES scores fluctuate too much for cache pinning
---

Attachments: des-score-graph.png, des.sample.15min.csv, get-scores.py

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (CASSANDRA-6465) DES scores fluctuate too much for cache pinning

2014-01-06 Thread Ian Barfield (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13863309#comment-13863309
]

Ian Barfield commented on CASSANDRA-6465:
-

I believe the purpose of time penalty was to more quickly detect problematic
nodes. If a node was suddenly suffering severe issues, that wouldn't be
reflected in its latency metric until the current outstanding queries resolved.
That might take until the maximum duration timeout which can be arbitrarily
long, and in many cases is a lot longer than you'd like. By using timeDelay,
the snitch can somewhat immediately penalize problem nodes since the queries do
not have to timeout first. That said, it has numerous flaws both conceptually
and in its implementation.

I was working on this problem a couple weeks ago, but have been distracted
since, so I might not be able to give the best summary. Here's a couple issues
off the top of my head though:
- if the time delay values are low, then high jitter throws the scores way off.
It isn't unreasonable to expect situations where the time delay shifts
semi-randomly between 0 and 1 ms. This means very little in terms of whether a
node is a suitable target but can cause a drastic difference in scores if there
is no slow node to anchor the scores.
- if the node response periods aren't low; say they average around 50 ms. Then
by definition they are highly random since the score could be calculated at any
point along 0 to 50 ms.
- it has a lot of complex interactions outside of its original scope of
detecting bad nodes
- when calculating scores, if there is no lastReceived value for a node (eg.
the node has just been added to the cluster), then the logic defaults to using
the current time (essentially 0 or maximum 'good'). You might instead take the
view that an unproven, cache-cold node would be a bad selection.
- sensitive to local noise. Each time the score is calculated, the timePenalty
is calculated fresh. Since there is no concept of persistance or scope, events
that corrupt the scoring process are extra harmful. eg. GC, CPU load / thread
scheduling, and concurrency shenanigans occuring between the lastReceived.get()
and System.currentTimeMillis()

Some of these issues are somewhat alleviated by the switch to using nanos, and
I've been tempted to back port that for this class at least for testing, but
this logic fails in complex ways. I think at some point I was able to confirm
some wildly fluctuating values of the subcomponents to the scores (specifically
timePenalty) by checking the mbeans and working under the assumption that
timePenalty was likely the only component to well rounded scores -- if you have
at least one node with timePenalty then it gets cut off to
UPDATE_INTERVAL_IN_MS which as a divisor makes for nicely formed floating point
numbers.

There are also a lot of issues with the other score components, and some of the
overall logic, but... some other time. Apologies if i've gotten something quite
wrong; I've never really used Cassandra.

DES scores fluctuate too much for cache pinning
---

Attachments: des-score-graph.png, des.sample.15min.csv, get-scores.py

[jira] [Commented] (CASSANDRA-6465) DES scores fluctuate too much for cache pinning

2014-01-02 Thread Tyler Hobbs (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13861012#comment-13861012
]

Tyler Hobbs commented on CASSANDRA-6465:

I can reproduce Chris's results, and in my experimentation it looks like almost
all of the variation is due to the timePenalty, which is basically how long
it has been since the last entry from an endpoint. I can see why something
like the time penalty might be useful for the phi FD, which expects messages on
a periodic basis, but it doesn't make sense to me to use it in a load balancing
measure. My suggestion would be to remove the time penalty.

bq. Are we sure that this mechanism of producing cache pinning is worth the
complexity here, especially given speculative execution?

Effective cache utilization is extremely important, so I would say it's well
worth the additional complexity. I don't think speculative execution should
affect this greatly, but I might be missing something; care to expand on that?

DES scores fluctuate too much for cache pinning
---

Attachments: des-score-graph.png, des.sample.15min.csv, get-scores.py

--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (CASSANDRA-6465) DES scores fluctuate too much for cache pinning

2013-12-16 Thread Robert Coli (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13849949#comment-13849949
]

Robert Coli commented on CASSANDRA-6465:

Are we sure that this mechanism of producing cache pinning is worth the
complexity here, especially given speculative retry?

DES scores fluctuate too much for cache pinning
---

Attachments: des-score-graph.png, des.sample.15min.csv, get-scores.py

--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (CASSANDRA-6465) DES scores fluctuate too much for cache pinning

[jira] [Commented] (CASSANDRA-6465) DES scores fluctuate too much for cache pinning

[jira] [Commented] (CASSANDRA-6465) DES scores fluctuate too much for cache pinning

[jira] [Commented] (CASSANDRA-6465) DES scores fluctuate too much for cache pinning

[jira] [Commented] (CASSANDRA-6465) DES scores fluctuate too much for cache pinning

[jira] [Commented] (CASSANDRA-6465) DES scores fluctuate too much for cache pinning

6 matches

Site Navigation

Mail list logo

Footer information