[ https://issues.apache.org/jira/browse/CASSANDRA-14459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16649549#comment-16649549 ]
Joseph Lynch commented on CASSANDRA-14459: ------------------------------------------ [~aweisberg] Alright I think that I've accomplished all of the bullet points that you listed . Along the way I made DES 100% pluggable by class name as well as significantly more testable because I know that various folks want to experiment with further improvements and having it be as easy as implementing a new subclass and dropping a jar will make experimentation much easier going forwards... I've also gone ahead and factored out the "measurement" parts of the DES into subclasses (default Histogram and new EMA option) from the "ranking" parts of the DES in the abstract base class (ranking, probing, etc). ||trunk|| |[Pull Request for Review|https://github.com/apache/cassandra/pull/283]| |[!https://circleci.com/gh/jolynch/cassandra/tree/CASSANDRA-14459.png?circle-token= 1102a59698d04899ec971dd36e925928f7b521f5!|https://circleci.com/gh/jolynch/cassandra/tree/CASSANDRA-14459]| There are a couple of points that I'm unsure about: * I was able to send small + large messages, but trying to pick the maximum result is hard. I think that it should be ok because if there is a serious problem with a stage it'll basically average the "good" stage and "bad" stage and still get a pretty bad number most likely. * Right now I've removed the reset interval but I actually think it's a good idea to keep it in and just re-define what it does. Specifically that would be the interval that we get to determine the probe list, vs the probe interval which actually sends a probe. I also think it's nice because we could have the old Histogram implementation as the default, and then users could _opt in_ to the new Histogram or the EMA approach as they want to. This would also remove the somewhat awkward (imo) use of the ping timeout to rate limit how often we get to generate probe lists. * The current algorithm only sends probes to nodes that were requested recently, but it's still potentially a lot of messages at the default interval of 1s which I know you think is ok but e.g. Jason above didn't like it (e.g. for a 300 * 3 dc 256 vnode cluster operating at LOCAL_ONE CL we'd be essentially sending out a probe per second always since there are so many ranked replicas including remote dcs). We could mitigate this somewhat by changing the local filtering step in {{ReplicaPlans::forRead}} to happen before we ask the DES to rank rather than after (this seems like a good idea regardless). While I was making it way more testable I ran a quick microbenchmark to show the CPU and memory benefits of the EMA approach. The tradoff of course is that the EMA reacts less quickly when the sample size is small and way faster when there is a significant outlier (e.g. if there is a single 10s timing the EMA will deprefer that host almost immediately compared to the histogram which ... won't). {noformat} Benchmark Mode Cnt Score Error Units SnitchMetricsBench.emaSnitchThreadedUpdate sample 190609 1.573 ± 0.016 ms/op SnitchMetricsBench.emaSnitchThreadedUpdate:emaSnitchThreadedUpdate·p0.00 sample 0.283 ms/op SnitchMetricsBench.emaSnitchThreadedUpdate:emaSnitchThreadedUpdate·p0.50 sample 1.288 ms/op SnitchMetricsBench.emaSnitchThreadedUpdate:emaSnitchThreadedUpdate·p0.90 sample 1.456 ms/op SnitchMetricsBench.emaSnitchThreadedUpdate:emaSnitchThreadedUpdate·p0.95 sample 1.597 ms/op SnitchMetricsBench.emaSnitchThreadedUpdate:emaSnitchThreadedUpdate·p0.99 sample 15.286 ms/op SnitchMetricsBench.emaSnitchThreadedUpdate:emaSnitchThreadedUpdate·p0.999 sample 21.594 ms/op SnitchMetricsBench.emaSnitchThreadedUpdate:emaSnitchThreadedUpdate·p0.9999 sample 31.319 ms/op SnitchMetricsBench.emaSnitchThreadedUpdate:emaSnitchThreadedUpdate·p1.00 sample 49.086 ms/op SnitchMetricsBench.emaSnitchThreadedUpdate:·gc.alloc.rate sample 3 4.158 ± 10.217 MB/sec SnitchMetricsBench.emaSnitchThreadedUpdate:·gc.alloc.rate.norm sample 3 722.283 ± 55.616 B/op SnitchMetricsBench.emaSnitchThreadedUpdate:·gc.churn.PS_Eden_Space sample 3 3.949 ± 62.392 MB/sec SnitchMetricsBench.emaSnitchThreadedUpdate:·gc.churn.PS_Eden_Space.norm sample 3 745.116 ± 11782.387 B/op SnitchMetricsBench.emaSnitchThreadedUpdate:·gc.churn.PS_Survivor_Space sample 3 0.108 ± 3.405 MB/sec SnitchMetricsBench.emaSnitchThreadedUpdate:·gc.churn.PS_Survivor_Space.norm sample 3 19.861 ± 627.604 B/op SnitchMetricsBench.emaSnitchThreadedUpdate:·gc.count sample 3 2.000 counts SnitchMetricsBench.emaSnitchThreadedUpdate:·gc.time sample 3 6.000 ms SnitchMetricsBench.emaSnitchUpdate sample 101392 0.296 ± 0.001 ms/op SnitchMetricsBench.emaSnitchUpdate:emaSnitchUpdate·p0.00 sample 0.263 ms/op SnitchMetricsBench.emaSnitchUpdate:emaSnitchUpdate·p0.50 sample 0.297 ms/op SnitchMetricsBench.emaSnitchUpdate:emaSnitchUpdate·p0.90 sample 0.326 ms/op SnitchMetricsBench.emaSnitchUpdate:emaSnitchUpdate·p0.95 sample 0.351 ms/op SnitchMetricsBench.emaSnitchUpdate:emaSnitchUpdate·p0.99 sample 0.417 ms/op SnitchMetricsBench.emaSnitchUpdate:emaSnitchUpdate·p0.999 sample 0.523 ms/op SnitchMetricsBench.emaSnitchUpdate:emaSnitchUpdate·p0.9999 sample 0.635 ms/op SnitchMetricsBench.emaSnitchUpdate:emaSnitchUpdate·p1.00 sample 5.300 ms/op SnitchMetricsBench.emaSnitchUpdate:·gc.alloc.rate sample 3 2.120 ± 2.248 MB/sec SnitchMetricsBench.emaSnitchUpdate:·gc.alloc.rate.norm sample 3 690.546 ± 5.915 B/op SnitchMetricsBench.emaSnitchUpdate:·gc.churn.PS_Eden_Space sample 3 1.984 ± 62.688 MB/sec SnitchMetricsBench.emaSnitchUpdate:·gc.churn.PS_Eden_Space.norm sample 3 606.680 ± 19170.514 B/op SnitchMetricsBench.emaSnitchUpdate:·gc.count sample 3 1.000 counts SnitchMetricsBench.emaSnitchUpdate:·gc.time sample 3 4.000 ms SnitchMetricsBench.histogramSnitchThreaded sample 31134 9.627 ± 0.097 ms/op SnitchMetricsBench.histogramSnitchThreaded:histogramSnitchThreaded·p0.00 sample 4.030 ms/op SnitchMetricsBench.histogramSnitchThreaded:histogramSnitchThreaded·p0.50 sample 7.545 ms/op SnitchMetricsBench.histogramSnitchThreaded:histogramSnitchThreaded·p0.90 sample 17.662 ms/op SnitchMetricsBench.histogramSnitchThreaded:histogramSnitchThreaded·p0.95 sample 23.003 ms/op SnitchMetricsBench.histogramSnitchThreaded:histogramSnitchThreaded·p0.99 sample 29.119 ms/op SnitchMetricsBench.histogramSnitchThreaded:histogramSnitchThreaded·p0.999 sample 39.697 ms/op SnitchMetricsBench.histogramSnitchThreaded:histogramSnitchThreaded·p0.9999 sample 53.506 ms/op SnitchMetricsBench.histogramSnitchThreaded:histogramSnitchThreaded·p1.00 sample 55.575 ms/op SnitchMetricsBench.histogramSnitchThreaded:·gc.alloc.rate sample 3 668.096 ± 61.306 MB/sec SnitchMetricsBench.histogramSnitchThreaded:·gc.alloc.rate.norm sample 3 708597.630 ± 32119.534 B/op SnitchMetricsBench.histogramSnitchThreaded:·gc.churn.PS_Eden_Space sample 3 673.780 ± 134.265 MB/sec SnitchMetricsBench.histogramSnitchThreaded:·gc.churn.PS_Eden_Space.norm sample 3 714612.589 ± 88018.271 B/op SnitchMetricsBench.histogramSnitchThreaded:·gc.churn.PS_Survivor_Space sample 3 0.154 ± 0.009 MB/sec SnitchMetricsBench.histogramSnitchThreaded:·gc.churn.PS_Survivor_Space.norm sample 3 163.122 ± 19.745 B/op SnitchMetricsBench.histogramSnitchThreaded:·gc.count sample 3 273.000 counts SnitchMetricsBench.histogramSnitchThreaded:·gc.time sample 3 303.000 ms SnitchMetricsBench.histogramSnitchUpdate sample 23007 1.303 ± 0.003 ms/op SnitchMetricsBench.histogramSnitchUpdate:histogramSnitchUpdate·p0.00 sample 1.204 ms/op SnitchMetricsBench.histogramSnitchUpdate:histogramSnitchUpdate·p0.50 sample 1.268 ms/op SnitchMetricsBench.histogramSnitchUpdate:histogramSnitchUpdate·p0.90 sample 1.368 ms/op SnitchMetricsBench.histogramSnitchUpdate:histogramSnitchUpdate·p0.95 sample 1.466 ms/op SnitchMetricsBench.histogramSnitchUpdate:histogramSnitchUpdate·p0.99 sample 2.121 ms/op SnitchMetricsBench.histogramSnitchUpdate:histogramSnitchUpdate·p0.999 sample 2.830 ms/op SnitchMetricsBench.histogramSnitchUpdate:histogramSnitchUpdate·p0.9999 sample 3.926 ms/op SnitchMetricsBench.histogramSnitchUpdate:histogramSnitchUpdate·p1.00 sample 5.218 ms/op SnitchMetricsBench.histogramSnitchUpdate:·gc.alloc.rate sample 3 236.317 ± 3.365 MB/sec SnitchMetricsBench.histogramSnitchUpdate:·gc.alloc.rate.norm sample 3 339325.737 ± 47.087 B/op SnitchMetricsBench.histogramSnitchUpdate:·gc.churn.PS_Eden_Space sample 3 236.374 ± 15.878 MB/sec SnitchMetricsBench.histogramSnitchUpdate:·gc.churn.PS_Eden_Space.norm sample 3 339407.281 ± 21325.135 B/op SnitchMetricsBench.histogramSnitchUpdate:·gc.churn.PS_Survivor_Space sample 3 0.142 ± 0.462 MB/sec SnitchMetricsBench.histogramSnitchUpdate:·gc.churn.PS_Survivor_Space.norm sample 3 203.675 ± 663.495 B/op SnitchMetricsBench.histogramSnitchUpdate:·gc.count sample 3 223.000 counts SnitchMetricsBench.histogramSnitchUpdate:·gc.time sample 3 232.000 ms {noformat} As you can see the traditional histogram snitch generates 700MBps of garbage vs the EMA which generates 4MBps, and the Histogram snitch was slower at 9.6ms avg vs EMA at 1.5 avg. So while the speed improvement of 6x is nice, the reduced garbage is going to be the really nice part. > DynamicEndpointSnitch should never prefer latent nodes > ------------------------------------------------------ > > Key: CASSANDRA-14459 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14459 > Project: Cassandra > Issue Type: Improvement > Components: Coordination > Reporter: Joseph Lynch > Assignee: Joseph Lynch > Priority: Minor > Labels: 4.0-feature-freeze-review-requested, > pull-request-available > Fix For: 4.x > > Time Spent: 10m > Remaining Estimate: 0h > > The DynamicEndpointSnitch has two unfortunate behaviors that allow it to > provide latent hosts as replicas: > # Loses all latency information when Cassandra restarts > # Clears latency information entirely every ten minutes (by default), > allowing global queries to be routed to _other datacenters_ (and local > queries cross racks/azs) > This means that the first few queries after restart/reset could be quite slow > compared to average latencies. I propose we solve this by resetting to the > minimum observed latency instead of completely clearing the samples and > extending the {{isLatencyForSnitch}} idea to a three state variable instead > of two, in particular {{YES}}, {{NO}}, {{MAYBE}}. This extension allows > {{EchoMessages}} and {{PingMessages}} to send {{MAYBE}} indicating that the > DS should use those measurements if it only has one or fewer samples for a > host. This fixes both problems because on process restart we send out > {{PingMessages}} / {{EchoMessages}} as part of startup, and we would reset to > effectively the RTT of the hosts (also at that point normal gossip > {{EchoMessages}} have an opportunity to add an additional latency > measurement). > This strategy also nicely deals with the "a host got slow but now it's fine" > problem that the DS resets were (afaik) designed to stop because the > {{EchoMessage}} ping latency will count only after the reset for that host. > Ping latency is a more reasonable lower bound on host latency (as opposed to > status quo of zero). -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org