[ 
https://issues.apache.org/jira/browse/CASSANDRA-11738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15278065#comment-15278065
 ] 

Jonathan Ellis commented on CASSANDRA-11738:
--------------------------------------------

The attractive thing about using iowait was that it's a latency metric, so 
adding it in to the dsnitch measurements sort of makes sense.  But only sort 
of, because if dsnitch has a direct latency number then iowait is getting 
double-counted.

It seems to me that the goal for "severity" ought to be deriving a synthetic 
latency number, so when we route traffic away from a node and thus don't have 
any real latency measurements available, we have a reasonable guess at what 
latency WOULD be so we don't route traffic back to it as soon as the old 
numbers age out.

Is there a way we can turn CPU load information into a pseudo-latency number?  
If not, maybe we can add a scaling factor with cpu util.

Other improvements include:

# Use actual latency measurements, or synthetic ("severity") but adding both 
together isn't really valid.  We could either stick the synthetic numbers 
directly in the windowing and let them age out like the others, or add a cutoff 
where we switch to synthetic if we don't have enough real ones.
# We can probably improve our latency guess for io-bound workloads by 
multiplying the iowait number by sstables-per-read.

> Re-think the use of Severity in the DynamicEndpointSnitch calculation
> ---------------------------------------------------------------------
>
>                 Key: CASSANDRA-11738
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11738
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Jeremiah Jordan
>             Fix For: 3.x
>
>
> CASSANDRA-11737 was opened to allow completely disabling the use of severity 
> in the DynamicEndpointSnitch calculation, but that is a pretty big hammer.  
> There is probably something we can do to better use the score.
> The issue seems to be that severity is given equal weight with latency in the 
> current code, also that severity is only based on disk io.  If you have a 
> node that is CPU bound on something (say catching up on LCS compactions 
> because of bootstrap/repair/replace) the IO wait can be low, but the latency 
> to the node is high.
> Some ideas I had are:
> 1. Allowing a yaml parameter to tune how much impact the severity score has 
> in the calculation.
> 2. Taking CPU load into account as well as IO Wait (this would probably help 
> in the cases I have seen things go sideways)
> 3. Move the -D from CASSANDRA-11737 to being a yaml level setting
> 4. Go back to just relying on Latency and get rid of severity all together.  
> Now that we have rapid read protection, maybe just using latency is enough, 
> as it can help where the predictive nature of IO wait would have been useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to