[ 
https://issues.apache.org/jira/browse/CASSANDRA-14252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379131#comment-16379131
 ] 

Dikang Gu commented on CASSANDRA-14252:
---------------------------------------

[~szhou], Yes, it's the warm up phase. We have to know the distance/latency 
differences between different replicas, otherwise we will have no way to fall 
back to remote replicas. One idea to limit unnecessary requests to remote 
replica is to only fall back when local node is really bad. Something like this:

if ({color:red}subsnitchScore > 0.5{color} && subsnitchScore > 
(sortedScoreIterator.next() * (1.0 + dynamicBadnessThreshold)))
            {
                sortByProximityWithScore(address, addresses);
                return;
            }

of course, the param 0.5 can be tunable.


> Use zero as default score in DynamicEndpointSnitch
> --------------------------------------------------
>
>                 Key: CASSANDRA-14252
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14252
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>            Reporter: Dikang Gu
>            Assignee: Dikang Gu
>            Priority: Major
>             Fix For: 4.0, 3.0.17, 3.11.3
>
>
> The problem I want to solve is that I found in our deployment, one slow but 
> alive data node can slow down the whole cluster, even caused timeout of our 
> requests. 
> We are using DynamicEndpointSnitch, with badness_threshold 0.1. I expect the 
> DynamicEndpointSnitch switch to sortByProximityWithScore, if local data node 
> latency is too high.
> I added some debug log, and figured out that in a lot of cases, the score 
> from remote data node was not populated, so the fallback to 
> sortByProximityWithScore never happened. That's why a single slow data node, 
> can cause huge problems to the whole cluster.
> In this jira, I'd like to use zero as default score, so that we will get a 
> chance to try remote data node, if local one is slow. 
> I tested it in our test cluster, it improved the client latency in single 
> slow data node case significantly.  
> I flag this as a Bug, because it caused problems to our use cases multiple 
> times.
>  ==== logs ===
> _2018-02-21_23:08:57.54145 WARN 23:08:57 [RPC-Thread:978]: 
> sortByProximityWithBadness: after sorting by proximity, addresses order 
> change to [ip1, ip2], with scores [1.0]_
>  _2018-02-21_23:08:57.54319 WARN 23:08:57 [RPC-Thread:967]: 
> sortByProximityWithBadness: after sorting by proximity, addresses order 
> change to [ip1, ip2], with scores [0.0]_
>  _2018-02-21_23:08:57.55111 WARN 23:08:57 [RPC-Thread:453]: 
> sortByProximityWithBadness: after sorting by proximity, addresses order 
> change to [ip1, ip2], with scores [1.0]_
>  _2018-02-21_23:08:57.55687 WARN 23:08:57 [RPC-Thread:753]: 
> sortByProximityWithBadness: after sorting by proximity, addresses order 
> change to [ip1, ip2], with scores [1.0]_
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to