[ 
https://issues.apache.org/jira/browse/CASSANDRA-14459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649549#comment-16649549
 ] 

Joseph Lynch commented on CASSANDRA-14459:
------------------------------------------

[~aweisberg] Alright I think that I've accomplished all of the bullet points 
that you listed . Along the way I made DES 100% pluggable by class name as well 
as significantly more testable because I know that various folks want to 
experiment with further improvements and having it be as easy as implementing a 
new subclass and dropping a jar will make experimentation much easier going 
forwards... I've also gone ahead and factored out the "measurement" parts of 
the DES into subclasses (default Histogram and new EMA option) from the 
"ranking" parts of the DES in the abstract base class (ranking, probing, etc).
||trunk||
|[Pull Request for Review|https://github.com/apache/cassandra/pull/283]|
|[!https://circleci.com/gh/jolynch/cassandra/tree/CASSANDRA-14459.png?circle-token=
 
1102a59698d04899ec971dd36e925928f7b521f5!|https://circleci.com/gh/jolynch/cassandra/tree/CASSANDRA-14459]|

There are a couple of points that I'm unsure about:
 * I was able to send small + large messages, but trying to pick the maximum 
result is hard. I think that it should be ok because if there is a serious 
problem with a stage it'll basically average the "good" stage and "bad" stage 
and still get a pretty bad number most likely.
 * Right now I've removed the reset interval but I actually think it's a good 
idea to keep it in and just re-define what it does. Specifically that would be 
the interval that we get to determine the probe list, vs the probe interval 
which actually sends a probe. I also think it's nice because we could have the 
old Histogram implementation as the default, and then users could _opt in_ to 
the new Histogram or the EMA approach as they want to. This would also remove 
the somewhat awkward (imo) use of the ping timeout to rate limit how often we 
get to generate probe lists.
 * The current algorithm only sends probes to nodes that were requested 
recently, but it's still potentially a lot of messages at the default interval 
of 1s which I know you think is ok but e.g. Jason above didn't like it (e.g. 
for a 300 * 3 dc 256 vnode cluster operating at LOCAL_ONE CL we'd be 
essentially sending out a probe per second always since there are so many 
ranked replicas including remote dcs). We could mitigate this somewhat by 
changing the local filtering step in {{ReplicaPlans::forRead}} to happen before 
we ask the DES to rank rather than after (this seems like a good idea 
regardless).

While I was making it way more testable I ran a quick microbenchmark to show 
the CPU and memory benefits of the EMA approach. The tradoff of course is that 
the EMA reacts less quickly when the sample size is small and way faster when 
there is a significant outlier (e.g. if there is a single 10s timing the EMA 
will deprefer that host almost immediately compared to the histogram which ... 
won't).
{noformat}
Benchmark                                                                      
Mode     Cnt       Score       Error   Units
SnitchMetricsBench.emaSnitchThreadedUpdate                                   
sample  190609       1.573 ±     0.016   ms/op
SnitchMetricsBench.emaSnitchThreadedUpdate:emaSnitchThreadedUpdate·p0.00     
sample               0.283               ms/op
SnitchMetricsBench.emaSnitchThreadedUpdate:emaSnitchThreadedUpdate·p0.50     
sample               1.288               ms/op
SnitchMetricsBench.emaSnitchThreadedUpdate:emaSnitchThreadedUpdate·p0.90     
sample               1.456               ms/op
SnitchMetricsBench.emaSnitchThreadedUpdate:emaSnitchThreadedUpdate·p0.95     
sample               1.597               ms/op
SnitchMetricsBench.emaSnitchThreadedUpdate:emaSnitchThreadedUpdate·p0.99     
sample              15.286               ms/op
SnitchMetricsBench.emaSnitchThreadedUpdate:emaSnitchThreadedUpdate·p0.999    
sample              21.594               ms/op
SnitchMetricsBench.emaSnitchThreadedUpdate:emaSnitchThreadedUpdate·p0.9999   
sample              31.319               ms/op
SnitchMetricsBench.emaSnitchThreadedUpdate:emaSnitchThreadedUpdate·p1.00     
sample              49.086               ms/op
SnitchMetricsBench.emaSnitchThreadedUpdate:·gc.alloc.rate                    
sample       3       4.158 ±    10.217  MB/sec
SnitchMetricsBench.emaSnitchThreadedUpdate:·gc.alloc.rate.norm               
sample       3     722.283 ±    55.616    B/op
SnitchMetricsBench.emaSnitchThreadedUpdate:·gc.churn.PS_Eden_Space           
sample       3       3.949 ±    62.392  MB/sec
SnitchMetricsBench.emaSnitchThreadedUpdate:·gc.churn.PS_Eden_Space.norm      
sample       3     745.116 ± 11782.387    B/op
SnitchMetricsBench.emaSnitchThreadedUpdate:·gc.churn.PS_Survivor_Space       
sample       3       0.108 ±     3.405  MB/sec
SnitchMetricsBench.emaSnitchThreadedUpdate:·gc.churn.PS_Survivor_Space.norm  
sample       3      19.861 ±   627.604    B/op
SnitchMetricsBench.emaSnitchThreadedUpdate:·gc.count                         
sample       3       2.000              counts
SnitchMetricsBench.emaSnitchThreadedUpdate:·gc.time                          
sample       3       6.000                  ms
SnitchMetricsBench.emaSnitchUpdate                                           
sample  101392       0.296 ±     0.001   ms/op
SnitchMetricsBench.emaSnitchUpdate:emaSnitchUpdate·p0.00                     
sample               0.263               ms/op
SnitchMetricsBench.emaSnitchUpdate:emaSnitchUpdate·p0.50                     
sample               0.297               ms/op
SnitchMetricsBench.emaSnitchUpdate:emaSnitchUpdate·p0.90                     
sample               0.326               ms/op
SnitchMetricsBench.emaSnitchUpdate:emaSnitchUpdate·p0.95                     
sample               0.351               ms/op
SnitchMetricsBench.emaSnitchUpdate:emaSnitchUpdate·p0.99                     
sample               0.417               ms/op
SnitchMetricsBench.emaSnitchUpdate:emaSnitchUpdate·p0.999                    
sample               0.523               ms/op
SnitchMetricsBench.emaSnitchUpdate:emaSnitchUpdate·p0.9999                   
sample               0.635               ms/op
SnitchMetricsBench.emaSnitchUpdate:emaSnitchUpdate·p1.00                     
sample               5.300               ms/op
SnitchMetricsBench.emaSnitchUpdate:·gc.alloc.rate                            
sample       3       2.120 ±     2.248  MB/sec
SnitchMetricsBench.emaSnitchUpdate:·gc.alloc.rate.norm                       
sample       3     690.546 ±     5.915    B/op
SnitchMetricsBench.emaSnitchUpdate:·gc.churn.PS_Eden_Space                   
sample       3       1.984 ±    62.688  MB/sec
SnitchMetricsBench.emaSnitchUpdate:·gc.churn.PS_Eden_Space.norm              
sample       3     606.680 ± 19170.514    B/op
SnitchMetricsBench.emaSnitchUpdate:·gc.count                                 
sample       3       1.000              counts
SnitchMetricsBench.emaSnitchUpdate:·gc.time                                  
sample       3       4.000                  ms
SnitchMetricsBench.histogramSnitchThreaded                                   
sample   31134       9.627 ±     0.097   ms/op
SnitchMetricsBench.histogramSnitchThreaded:histogramSnitchThreaded·p0.00     
sample               4.030               ms/op
SnitchMetricsBench.histogramSnitchThreaded:histogramSnitchThreaded·p0.50     
sample               7.545               ms/op
SnitchMetricsBench.histogramSnitchThreaded:histogramSnitchThreaded·p0.90     
sample              17.662               ms/op
SnitchMetricsBench.histogramSnitchThreaded:histogramSnitchThreaded·p0.95     
sample              23.003               ms/op
SnitchMetricsBench.histogramSnitchThreaded:histogramSnitchThreaded·p0.99     
sample              29.119               ms/op
SnitchMetricsBench.histogramSnitchThreaded:histogramSnitchThreaded·p0.999    
sample              39.697               ms/op
SnitchMetricsBench.histogramSnitchThreaded:histogramSnitchThreaded·p0.9999   
sample              53.506               ms/op
SnitchMetricsBench.histogramSnitchThreaded:histogramSnitchThreaded·p1.00     
sample              55.575               ms/op
SnitchMetricsBench.histogramSnitchThreaded:·gc.alloc.rate                    
sample       3     668.096 ±    61.306  MB/sec
SnitchMetricsBench.histogramSnitchThreaded:·gc.alloc.rate.norm               
sample       3  708597.630 ± 32119.534    B/op
SnitchMetricsBench.histogramSnitchThreaded:·gc.churn.PS_Eden_Space           
sample       3     673.780 ±   134.265  MB/sec
SnitchMetricsBench.histogramSnitchThreaded:·gc.churn.PS_Eden_Space.norm      
sample       3  714612.589 ± 88018.271    B/op
SnitchMetricsBench.histogramSnitchThreaded:·gc.churn.PS_Survivor_Space       
sample       3       0.154 ±     0.009  MB/sec
SnitchMetricsBench.histogramSnitchThreaded:·gc.churn.PS_Survivor_Space.norm  
sample       3     163.122 ±    19.745    B/op
SnitchMetricsBench.histogramSnitchThreaded:·gc.count                         
sample       3     273.000              counts
SnitchMetricsBench.histogramSnitchThreaded:·gc.time                          
sample       3     303.000                  ms
SnitchMetricsBench.histogramSnitchUpdate                                     
sample   23007       1.303 ±     0.003   ms/op
SnitchMetricsBench.histogramSnitchUpdate:histogramSnitchUpdate·p0.00         
sample               1.204               ms/op
SnitchMetricsBench.histogramSnitchUpdate:histogramSnitchUpdate·p0.50         
sample               1.268               ms/op
SnitchMetricsBench.histogramSnitchUpdate:histogramSnitchUpdate·p0.90         
sample               1.368               ms/op
SnitchMetricsBench.histogramSnitchUpdate:histogramSnitchUpdate·p0.95         
sample               1.466               ms/op
SnitchMetricsBench.histogramSnitchUpdate:histogramSnitchUpdate·p0.99         
sample               2.121               ms/op
SnitchMetricsBench.histogramSnitchUpdate:histogramSnitchUpdate·p0.999        
sample               2.830               ms/op
SnitchMetricsBench.histogramSnitchUpdate:histogramSnitchUpdate·p0.9999       
sample               3.926               ms/op
SnitchMetricsBench.histogramSnitchUpdate:histogramSnitchUpdate·p1.00         
sample               5.218               ms/op
SnitchMetricsBench.histogramSnitchUpdate:·gc.alloc.rate                      
sample       3     236.317 ±     3.365  MB/sec
SnitchMetricsBench.histogramSnitchUpdate:·gc.alloc.rate.norm                 
sample       3  339325.737 ±    47.087    B/op
SnitchMetricsBench.histogramSnitchUpdate:·gc.churn.PS_Eden_Space             
sample       3     236.374 ±    15.878  MB/sec
SnitchMetricsBench.histogramSnitchUpdate:·gc.churn.PS_Eden_Space.norm        
sample       3  339407.281 ± 21325.135    B/op
SnitchMetricsBench.histogramSnitchUpdate:·gc.churn.PS_Survivor_Space         
sample       3       0.142 ±     0.462  MB/sec
SnitchMetricsBench.histogramSnitchUpdate:·gc.churn.PS_Survivor_Space.norm    
sample       3     203.675 ±   663.495    B/op
SnitchMetricsBench.histogramSnitchUpdate:·gc.count                           
sample       3     223.000              counts
SnitchMetricsBench.histogramSnitchUpdate:·gc.time                            
sample       3     232.000                  ms
{noformat}
As you can see the traditional histogram snitch generates 700MBps of garbage vs 
the EMA which generates 4MBps, and the Histogram snitch was slower at 9.6ms avg 
vs EMA at 1.5 avg. So while the speed improvement of 6x is nice, the reduced 
garbage is going to be the really nice part.

> DynamicEndpointSnitch should never prefer latent nodes
> ------------------------------------------------------
>
>                 Key: CASSANDRA-14459
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14459
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Coordination
>            Reporter: Joseph Lynch
>            Assignee: Joseph Lynch
>            Priority: Minor
>              Labels: 4.0-feature-freeze-review-requested, 
> pull-request-available
>             Fix For: 4.x
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The DynamicEndpointSnitch has two unfortunate behaviors that allow it to 
> provide latent hosts as replicas:
>  # Loses all latency information when Cassandra restarts
>  # Clears latency information entirely every ten minutes (by default), 
> allowing global queries to be routed to _other datacenters_ (and local 
> queries cross racks/azs)
> This means that the first few queries after restart/reset could be quite slow 
> compared to average latencies. I propose we solve this by resetting to the 
> minimum observed latency instead of completely clearing the samples and 
> extending the {{isLatencyForSnitch}} idea to a three state variable instead 
> of two, in particular {{YES}}, {{NO}}, {{MAYBE}}. This extension allows 
> {{EchoMessages}} and {{PingMessages}} to send {{MAYBE}} indicating that the 
> DS should use those measurements if it only has one or fewer samples for a 
> host. This fixes both problems because on process restart we send out 
> {{PingMessages}} / {{EchoMessages}} as part of startup, and we would reset to 
> effectively the RTT of the hosts (also at that point normal gossip 
> {{EchoMessages}} have an opportunity to add an additional latency 
> measurement).
> This strategy also nicely deals with the "a host got slow but now it's fine" 
> problem that the DS resets were (afaik) designed to stop because the 
> {{EchoMessage}} ping latency will count only after the reset for that host. 
> Ping latency is a more reasonable lower bound on host latency (as opposed to 
> status quo of zero).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to