[ 
https://issues.apache.org/jira/browse/CASSANDRA-6465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13863309#comment-13863309
 ] 

Ian Barfield commented on CASSANDRA-6465:
-----------------------------------------

I believe the purpose of time penalty was to more quickly detect problematic 
nodes. If a node was suddenly suffering severe issues, that wouldn't be 
reflected in its latency metric until the current outstanding queries resolved. 
That might take until the maximum duration timeout which can be arbitrarily 
long, and in many cases is a lot longer than you'd like. By using timeDelay, 
the snitch can somewhat immediately penalize problem nodes since the queries do 
not have to timeout first. That said, it has numerous flaws both conceptually 
and in its implementation.

I was working on this problem a couple weeks ago, but have been distracted 
since, so I might not be able to give the best summary. Here's a couple issues 
off the top of my head though:
- if the time delay values are low, then high jitter throws the scores way off. 
It isn't unreasonable to expect situations where the time delay shifts 
semi-randomly between 0 and 1 ms. This means very little in terms of whether a 
node is a suitable target but can cause a drastic difference in scores if there 
is no slow node to anchor the scores.
- if the node response periods aren't low; say they average around 50 ms. Then 
by definition they are highly random since the score could be calculated at any 
point along 0 to 50 ms.
- it has a lot of complex interactions outside of its original scope of 
detecting bad nodes
- when calculating scores, if there is no lastReceived value for a node (eg. 
the node has just been added to the cluster), then the logic defaults to using 
the current time (essentially 0 or maximum 'good'). You might instead take the 
view that an unproven, cache-cold node would be a bad selection.
- sensitive to local noise. Each time the score is calculated, the timePenalty 
is calculated fresh. Since there is no concept of persistance or scope, events 
that corrupt the scoring process are extra harmful. eg. GC, CPU load / thread 
scheduling, and concurrency shenanigans occuring between the lastReceived.get() 
and System.currentTimeMillis()

Some of these issues are somewhat alleviated by the switch to using nanos, and 
I've been tempted to back port that for this class at least for testing, but 
this logic fails in complex ways. I think at some point I was able to confirm 
some wildly fluctuating values of the subcomponents to the scores (specifically 
timePenalty) by checking the mbeans and working under the assumption that 
timePenalty was likely the only component to well rounded scores -- if you have 
at least one node with >> timePenalty then it gets cut off to 
UPDATE_INTERVAL_IN_MS which as a divisor makes for nicely formed floating point 
numbers.

There are also a lot of issues with the other score components, and some of the 
overall logic, but... some other time. Apologies if i've gotten something quite 
wrong; I've never really used Cassandra.

> DES scores fluctuate too much for cache pinning
> -----------------------------------------------
>
>                 Key: CASSANDRA-6465
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6465
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: 1.2.11, 2 DC cluster
>            Reporter: Chris Burroughs
>            Assignee: Tyler Hobbs
>            Priority: Minor
>              Labels: gossip
>             Fix For: 2.0.5
>
>         Attachments: des-score-graph.png, des.sample.15min.csv, get-scores.py
>
>
> To quote the conf:
> {noformat}
> # if set greater than zero and read_repair_chance is < 1.0, this will allow
> # 'pinning' of replicas to hosts in order to increase cache capacity.
> # The badness threshold will control how much worse the pinned host has to be
> # before the dynamic snitch will prefer other replicas over it.  This is
> # expressed as a double which represents a percentage.  Thus, a value of
> # 0.2 means Cassandra would continue to prefer the static snitch values
> # until the pinned host was 20% worse than the fastest.
> dynamic_snitch_badness_threshold: 0.1
> {noformat}
> An assumption of this feature is that scores will vary by less than 
> dynamic_snitch_badness_threshold during normal operations.  Attached is the 
> result of polling a node for the scores of 6 different endpoints at 1 Hz for 
> 15 minutes.  The endpoints to sample were chosen with `nodetool getendpoints` 
> for row that is known to get reads.  The node was acting as a coordinator for 
> a few hundred req/second, so it should have sufficient data to work with.  
> Other traces on a second cluster have produced similar results.
>  * The scores vary by far more than I would expect, as show by the difficulty 
> of seeing anything useful in that graph.
>  * The difference between the best and next-best score is usually > 10% 
> (default dynamic_snitch_badness_threshold).
> Neither ClientRequest nor ColumFamily metrics showed wild changes during the 
> data gathering period.
> Attachments:
>  * jython script cobbled together to gather the data (based on work on the 
> mailing list from Maki Watanabe a while back)
>  * csv of DES scores for 6 endpoints, polled about once a second
>  * Attempt at making a graph



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to