[ https://issues.apache.org/jira/browse/CASSANDRA-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13456871#comment-13456871 ]
Matt Blair commented on CASSANDRA-4261: --------------------------------------- So now that CASSANDRA-1123 has been resolved, will this get merged in time for 1.2? > [patch] Support consistency-latency prediction in nodetool > ---------------------------------------------------------- > > Key: CASSANDRA-4261 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4261 > Project: Cassandra > Issue Type: New Feature > Components: Tools > Affects Versions: 1.2.0 beta 1 > Reporter: Peter Bailis > Attachments: demo-pbs-v3.sh, pbs-nodetool-v3.patch > > > h3. Introduction > Cassandra supports a variety of replication configurations: ReplicationFactor > is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting > {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong > consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or > {{THREE}}. What should users choose? > This patch provides a latency-consistency analysis within {{nodetool}}. Users > can accurately predict Cassandra's behavior in their production environments > without interfering with performance. > What's the probability that we'll read a write t seconds after it completes? > What about reading one of the last k writes? This patch provides answers via > {{nodetool predictconsistency}}: > {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}} > \\ \\ > {code:title=Example output|borderStyle=solid} > //N == ReplicationFactor > //R == read ConsistencyLevel > //W == write ConsistencyLevel > user@test:$ nodetool predictconsistency 3 100 1 > Performing consistency prediction > 100ms after a given write, with maximum version staleness of k=1 > N=3, R=1, W=1 > Probability of consistent reads: 0.678900 > Average read latency: 5.377900ms (99.900th %ile 40ms) > Average write latency: 36.971298ms (99.900th %ile 294ms) > N=3, R=1, W=2 > Probability of consistent reads: 0.791600 > Average read latency: 5.372500ms (99.900th %ile 39ms) > Average write latency: 303.630890ms (99.900th %ile 357ms) > N=3, R=1, W=3 > Probability of consistent reads: 1.000000 > Average read latency: 5.426600ms (99.900th %ile 42ms) > Average write latency: 1382.650879ms (99.900th %ile 629ms) > N=3, R=2, W=1 > Probability of consistent reads: 0.915800 > Average read latency: 11.091000ms (99.900th %ile 348ms) > Average write latency: 42.663101ms (99.900th %ile 284ms) > N=3, R=2, W=2 > Probability of consistent reads: 1.000000 > Average read latency: 10.606800ms (99.900th %ile 263ms) > Average write latency: 310.117615ms (99.900th %ile 335ms) > N=3, R=3, W=1 > Probability of consistent reads: 1.000000 > Average read latency: 52.657501ms (99.900th %ile 565ms) > Average write latency: 39.949799ms (99.900th %ile 237ms) > {code} > h3. Demo > Here's an example scenario you can run using > [ccm|https://github.com/pcmanus/ccm]. The prediction is fast: > {code:borderStyle=solid} > cd <cassandra-source-dir with patch applied> > ant > ccm create consistencytest --cassandra-dir=. > ccm populate -n 5 > ccm start > # if start fails, you might need to initialize more loopback interfaces > # e.g., sudo ifconfig lo0 alias 127.0.0.2 > # use stress to get some sample latency data > tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert > tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read > bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1 > {code} > h3. What and Why > We've implemented [Probabilistically Bounded > Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting > consistency-latency trade-offs within Cassandra. Our > [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB > 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range > of Dynamo-style data store deployments at places like LinkedIn and Yammer in > addition to profiling our own Cassandra deployments. In our experience, > prediction is both accurate and much more lightweight than profiling and > manually testing each possible replication configuration (especially in > production!). > This analysis is important for the many users we've talked to and heard about > who use "partial quorum" operation (e.g., non-{{QUORUM}} > {{ConsistencyLevel}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely > depends on their runtime environment and, short of profiling in production, > there's no existing way to answer these questions. (Keep in mind, Cassandra > defaults to CL={{ONE}}, meaning users don't know how stale their data will > be.) > We outline limitations of the current approach after describing how it's > done. We believe that this is a useful feature that can provide guidance and > fairly accurate estimation for most users. > h3. Interface > This patch allows users to perform this prediction in production using > {{nodetool}}. > Users enable tracing of latency data by calling > {{enableConsistencyPredictionLogging()}} in the {{PBSPredictorMBean}}. > Cassandra logs a variable number of latencies (controllable via JMX > ({{setMaxLoggedLatenciesForConsistencyPrediction(int maxLogged)}}, default: > 10000). Each latency is 8 bytes, and there are 4 distributions we require, so > the space overhead is {{32*logged_latencies}} bytes of memory for the > predicting node. > {{nodetool predictconsistency}} predicts the latency and consistency for each > possible {{ConsistencyLevel}} setting (reads and writes) by running > {{setNumberTrialsForConsistencyPrediction(int numTrials)}} Monte Carlo trials > per configuration (default: 10000). > Users shouldn't have to touch these parameters, and the defaults work well. > The more latencies they log, the better the predictions will be. > h3. Implementation > This patch is fairly lightweight, requiring minimal changes to existing code. > The high-level overview is that we gather empirical latency distributions > then perform Monte Carlo analysis using the gathered data. > h4. Latency Data > We log latency data in {{service.PBSPredictor}}, recording four relevant > distributions: > * *W*: time from when the coordinator sends a mutation to the time that a > replica begins to serve the new value(s) > * *A*: time from when a replica accepting a mutation sends an > * *R*: time from when the coordinator sends a read request to the time that > the replica performs the read > * *S*: time from when the replica sends a read response to the time when the > coordinator receives it > We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along > with every message (8 bytes overhead required for millisecond {{long}}). In > {{net.MessagingService}}, we log the start of every mutation and read, and, > in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. > Jonathan Ellis mentioned that > [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar > latency tracing, but, as far as we can tell, these latencies aren't in that > patch. We use an LRU policy to bound the number of latencies we track for > each distribution. > h4. Prediction > When prompted by {{nodetool}}, we call {{service.PBSPredictor.doPrediction}}, > which performs the actual Monte Carlo analysis based on the provided data. > It's straightforward, and we've commented this analysis pretty well but can > elaborate more here if required. > h4. Testing > We've modified the unit test for {{SerializationsTest}} and provided a new > unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the > {{PBSPredictor}} test with {{ant pbs-test}}. > h4. Overhead > This patch introduces 8 bytes of overhead per message. We could reduce this > overhead and add timestamps on-demand, but this would require changing > {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is > messy. > If enabled, consistency tracing requires {{32*logged_latencies}} bytes of > memory on the node on which tracing is enabled. > h3. Caveats > The predictions are conservative, or worst-case, meaning we may predict more > staleness than in practice in the following ways: > * We do not account for read repair. > * We do not account for Merkle tree exchange. > * Multi-version staleness is particularly conservative. > The predictions are optimistic in the following ways: > * We do not predict the impact of node failure. > * We do not model hinted handoff. > We simulate non-local reads and writes. We assume that the coordinating > Cassandra node is not itself a replica for a given key. (See discussion > below.) > Predictions are only as good as the collected latencies. Generally, the more > latencies that are collected, the better, but if the environment or workload > changes, things might change. Also, we currently don't distinguish between > column families or value sizes. This is doable, but it adds complexity to the > interface and possibly more storage overhead. > Finally, for accurate results, we require replicas to have synchronized > clocks (Cassandra requires this from clients anyway). If clocks are > skewed/out of sync, this will bias predictions by the magnitude of the skew. > We can potentially improve these if there's interest, but this is an area of > active research. > ---- > Peter Bailis and Shivaram Venkataraman > [pbai...@cs.berkeley.edu|mailto:pbai...@cs.berkeley.edu] > [shiva...@cs.berkeley.edu|mailto:shiva...@cs.berkeley.edu] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira