[ https://issues.apache.org/jira/browse/CASSANDRA-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13413152#comment-13413152 ]
Shivaram Venkataraman commented on CASSANDRA-4261: -------------------------------------------------- Local latency counts, but whether this is a bug or a feature depends on how often we are local. If we have a small cluster (say ReplicationFactor=3 with 3 nodes), it's possible that many requests can be fulfilled locally. If we have a large cluster (say ReplicationFactor=3 with 100 nodes), a much smaller number of requests will be fulfilled locally; each replica that is contacted will, on average, have to make a larger number of remote requests. We expected that remote operations are more common for larger clusters where consistency is a problem. One of our caveats is that we only simulate non-local reads and writes and assume that the coordinating Cassandra node is not a replica. Without knowing the client-to-replica request routing, it's difficult to make better predictions. As an alternative, we could use a heuristic (say, 30% are local operations occurring randomly on different nodes) or even try to profile this. > [patch] Support consistency-latency prediction in nodetool > ---------------------------------------------------------- > > Key: CASSANDRA-4261 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4261 > Project: Cassandra > Issue Type: New Feature > Components: Tools > Affects Versions: 1.2 > Reporter: Peter Bailis > Attachments: demo-pbs-v3.sh, pbs-nodetool-v3.patch > > > h3. Introduction > Cassandra supports a variety of replication configurations: ReplicationFactor > is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting > {{ConsistencyLevel}} to {{QUORUM}} for reads and writes ensures strong > consistency, but {{QUORUM}} is often slower than {{ONE}}, {{TWO}}, or > {{THREE}}. What should users choose? > This patch provides a latency-consistency analysis within {{nodetool}}. Users > can accurately predict Cassandra's behavior in their production environments > without interfering with performance. > What's the probability that we'll read a write t seconds after it completes? > What about reading one of the last k writes? This patch provides answers via > {{nodetool predictconsistency}}: > {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}} > \\ \\ > {code:title=Example output|borderStyle=solid} > //N == ReplicationFactor > //R == read ConsistencyLevel > //W == write ConsistencyLevel > user@test:$ nodetool predictconsistency 3 100 1 > Performing consistency prediction > 100ms after a given write, with maximum version staleness of k=1 > N=3, R=1, W=1 > Probability of consistent reads: 0.678900 > Average read latency: 5.377900ms (99.900th %ile 40ms) > Average write latency: 36.971298ms (99.900th %ile 294ms) > N=3, R=1, W=2 > Probability of consistent reads: 0.791600 > Average read latency: 5.372500ms (99.900th %ile 39ms) > Average write latency: 303.630890ms (99.900th %ile 357ms) > N=3, R=1, W=3 > Probability of consistent reads: 1.000000 > Average read latency: 5.426600ms (99.900th %ile 42ms) > Average write latency: 1382.650879ms (99.900th %ile 629ms) > N=3, R=2, W=1 > Probability of consistent reads: 0.915800 > Average read latency: 11.091000ms (99.900th %ile 348ms) > Average write latency: 42.663101ms (99.900th %ile 284ms) > N=3, R=2, W=2 > Probability of consistent reads: 1.000000 > Average read latency: 10.606800ms (99.900th %ile 263ms) > Average write latency: 310.117615ms (99.900th %ile 335ms) > N=3, R=3, W=1 > Probability of consistent reads: 1.000000 > Average read latency: 52.657501ms (99.900th %ile 565ms) > Average write latency: 39.949799ms (99.900th %ile 237ms) > {code} > h3. Demo > Here's an example scenario you can run using > [ccm|https://github.com/pcmanus/ccm]. The prediction is fast: > {code:borderStyle=solid} > cd <cassandra-source-dir with patch applied> > ant > # turn on consistency logging > sed -i .bak 's/log_latencies_for_consistency_prediction: > false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml > ccm create consistencytest --cassandra-dir=. > ccm populate -n 5 > ccm start > # if start fails, you might need to initialize more loopback interfaces > # e.g., sudo ifconfig lo0 alias 127.0.0.2 > # use stress to get some sample latency data > tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert > tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read > bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1 > {code} > h3. What and Why > We've implemented [Probabilistically Bounded > Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting > consistency-latency trade-offs within Cassandra. Our > [paper|http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB > 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range > of Dynamo-style data store deployments at places like LinkedIn and Yammer in > addition to profiling our own Cassandra deployments. In our experience, > prediction is both accurate and much more lightweight than profiling and > manually testing each possible replication configuration (especially in > production!). > This analysis is important for the many users we've talked to and heard about > who use "partial quorum" operation (e.g., non-{{QUORUM}} > {{ConsistencyLevel}}). Should they use CL={{ONE}}? CL={{TWO}}? It likely > depends on their runtime environment and, short of profiling in production, > there's no existing way to answer these questions. (Keep in mind, Cassandra > defaults to CL={{ONE}}, meaning users don't know how stale their data will > be.) > We outline limitations of the current approach after describing how it's > done. We believe that this is a useful feature that can provide guidance and > fairly accurate estimation for most users. > h3. Interface > This patch allows users to perform this prediction in production using > {{nodetool}}. > Users enable tracing of latency data by setting > {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}. > Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. > Each latency is 8 bytes, and there are 4 distributions we require, so the > space overhead is {{32*logged_latencies}} bytes of memory for the predicting > node. > {{nodetool predictconsistency}} predicts the latency and consistency for each > possible {{ConsistencyLevel}} setting (reads and writes) by running > {{number_trials_for_consistency_prediction}} Monte Carlo trials per > configuration. > Users shouldn't have to touch these parameters, and the defaults work well. > The more latencies they log, the better the predictions will be. > h3. Implementation > This patch is fairly lightweight, requiring minimal changes to existing code. > The high-level overview is that we gather empirical latency distributions > then perform Monte Carlo analysis using the gathered data. > h4. Latency Data > We log latency data in {{service.PBSPredictor}}, recording four relevant > distributions: > * *W*: time from when the coordinator sends a mutation to the time that a > replica begins to serve the new value(s) > * *A*: time from when a replica accepting a mutation sends an > * *R*: time from when the coordinator sends a read request to the time that > the replica performs the read > * *S*: time from when the replica sends a read response to the time when the > coordinator receives it > We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along > with every message (8 bytes overhead required for millisecond {{long}}). In > {{net.MessagingService}}, we log the start of every mutation and read, and, > in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. > Jonathan Ellis mentioned that > [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar > latency tracing, but, as far as we can tell, these latencies aren't in that > patch. We use an LRU policy to bound the number of latencies we track for > each distribution. > h4. Prediction > When prompted by {{nodetool}}, we call {{service.PBSPredictor.doPrediction}}, > which performs the actual Monte Carlo analysis based on the provided data. > It's straightforward, and we've commented this analysis pretty well but can > elaborate more here if required. > h4. Testing > We've modified the unit test for {{SerializationsTest}} and provided a new > unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the > {{PBSPredictor}} test with {{ant pbs-test}}. > h4. Overhead > This patch introduces 8 bytes of overhead per message. We could reduce this > overhead and add timestamps on-demand, but this would require changing > {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is > messy. > If enabled, consistency tracing requires {{32*logged_latencies}} bytes of > memory on the node on which tracing is enabled. > h3. Caveats > The predictions are conservative, or worst-case, meaning we may predict more > staleness than in practice in the following ways: > * We do not account for read repair. > * We do not account for Merkle tree exchange. > * Multi-version staleness is particularly conservative. > * We simulate non-local reads and writes. We assume that the coordinating > Cassandra node is not itself a replica for a given key. > The predictions are optimistic in the following ways: > * We do not predict the impact of node failure. > * We do not model hinted handoff. > Predictions are only as good as the collected latencies. Generally, the more > latencies that are collected, the better, but if the environment or workload > changes, things might change. Also, we currently don't distinguish between > column families or value sizes. This is doable, but it adds complexity to the > interface and possibly more storage overhead. > Finally, for accurate results, we require replicas to have synchronized > clocks (Cassandra requires this from clients anyway). If clocks are > skewed/out of sync, this will bias predictions by the magnitude of the skew. > We can potentially improve these if there's interest, but this is an area of > active research. > ---- > Peter Bailis and Shivaram Venkataraman > [pbai...@cs.berkeley.edu|mailto:pbai...@cs.berkeley.edu] > [shiva...@cs.berkeley.edu|mailto:shiva...@cs.berkeley.edu] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira