[ 
https://issues.apache.org/jira/browse/CASSANDRA-4261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Bailis updated CASSANDRA-4261:
------------------------------------

    Description: 
h3. Introduction

Cassandra supports a variety of replication configurations: ReplicationFactor 
is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting 
ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but 
QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

This patch provides a latency-consistency analysis within nodetool. Users can 
accurately predict Cassandra behavior in their production environments without 
interfering with performance.

What's the probability that we'll read a write t seconds after it completes? 
What about reading one of the last k writes? nodetool predictconsistency 
provides this:

{{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}//

{code:title=Example output|borderStyle=solid}

//N == ReplicationFactor
//R == read ConsistencyLevel
//W == write ConsistencyLevel

user@test:$ nodetool predictconsistency 3 100 1
100ms after a given write, with maximum version staleness of k=1
N=3, R=1, W=1
Probability of consistent reads: 0.811700
Average read latency: 6.896300ms (99.900th %ile 174ms)
Average write latency: 8.788000ms (99.900th %ile 252ms)

N=3, R=1, W=2
Probability of consistent reads: 0.867200
Average read latency: 6.818200ms (99.900th %ile 152ms)
Average write latency: 33.226101ms (99.900th %ile 420ms)

N=3, R=1, W=3
Probability of consistent reads: 1.000000
Average read latency: 6.766800ms (99.900th %ile 111ms)
Average write latency: 153.764999ms (99.900th %ile 969ms)

N=3, R=2, W=1
Probability of consistent reads: 0.951500
Average read latency: 18.065800ms (99.900th %ile 414ms)
Average write latency: 8.322600ms (99.900th %ile 232ms)

N=3, R=2, W=2
Probability of consistent reads: 0.983000
Average read latency: 18.009001ms (99.900th %ile 387ms)
Average write latency: 35.797100ms (99.900th %ile 478ms)

N=3, R=3, W=1
Probability of consistent reads: 0.993900
Average read latency: 101.959702ms (99.900th %ile 1094ms)
Average write latency: 8.518600ms (99.900th %ile 236ms)
{code}

h3. Demo

Here's an example scenario you can run using 
[ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

{code:borderStyle=solid}
cd <cassandra-source-dir with patch applied>
ant

# turn on consistency logging
sed -i .bak 's/log_latencies_for_consistency_prediction: 
false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

ccm create consistencytest --cassandra-dir=. 
ccm populate -n 5
ccm start

# if start fails, you might need to initialize more loopback interfaces
# e.g., sudo ifconfig lo0 alias 127.0.0.2

# use stress to get some sample latency data
tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
{code}

h3. What and Why

We've implemented [Probabilistically Bounded 
Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting 
consistency-latency trade-offs within Cassandra. Our 
[paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 
2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range 
of Dynamo-style data store deployments at places like LinkedIn and Yammer in 
addition to profiling our own Cassandra deployments. In our experience, 
prediction is both accurate and much more lightweight than trying out different 
configurations (especially in production!).

This analysis is important for the many users we've talked to and heard about 
who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should 
they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, 
short of profiling in production, there's no existing way to answer these 
questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't 
know how stale their data will be.)

We outline limitations of the current approach after describing how it's done. 
We believe that this is a useful feature that can provide guidance and fairly 
accurate estimation for most users.

h3. Interface

This patch allows users to perform this prediction in production using 
{{nodetool}}.

Users enable tracing of latency data by setting 
{{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.

Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. 
Each latency is 8 bytes, and there are 4 distributions we require, so the space 
overhead is {{32*logged_latencies}} bytes of memory for the predicting node.

{{nodetool predictconsistency}} predicts the latency and consistency for each 
possible {{ConsistencyLevel}} setting (reads and writes) by running 
{{number_trials_for_consistency_prediction}} Monte Carlo trials per 
configuration.

Users shouldn't have to touch these parameters, and the defaults work well. The 
more latencies they log, the better the predictions will be.

h3. Implementation

This patch is fairly lightweight, requiring minimal changes to existing code. 
The high-level overview is that we gather empirical latency distributions then 
perform Monte Carlo analysis using the gathered data.

h4. Latency Data

We log latency data in {{service.PBSPredictor}}, recording four relevant 
distributions:
 * *W*: time from when the coordinator sends a mutation to the time that a 
replica begins to serve the new value(s)
 * *A*: time from when a replica accepting a mutation sends an
 * *R*: time from when the coordinator sends a read request to the time that 
the replica performs the read
* *S*: time from when the replica sends a read response to the time when the 
coordinator receives it

We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along 
with every message (8 bytes overhead required for millisecond {{long}}). In 
{{net.MessagingService}}, we log the start of every mutation and read, and, in 
{{net.ResponseVerbHandler}}, we log the end of every mutation and read. 
Jonathan Ellis mentioned that 
[1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency 
tracing, but, as far as we can tell, these latencies aren't in that patch.

h4. Prediction

When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which 
performs the actual Monte Carlo analysis based on the provided data. It's 
straightforward, and we've commented this analysis pretty well but can 
elaborate more here if required.

h4. Testing

We've modified the unit test for {{SerializationsTest}} and provided a new unit 
test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the 
{{PBSPredictor}} test with {{ant pbs-test}}.

h4. Overhead

This patch introduces 8 bytes of overhead per message. We could reduce this 
overhead and add timestamps on-demand, but this would require changing 
{{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is 
messy.

If enabled, consistency tracing requires {{32*logged_latencies}} bytes of 
memory on the node on which tracing is enabled.

h3. Caveats

 The predictions are conservative, or worst-case, meaning we may predict more 
staleness than in practice in the following ways:
 * We do not account for read repair. 
 * We do not account for Merkle tree exchange.
 * Multi-version staleness is particularly conservative.
 * We simulate non-local reads and writes. We assume that the coordinating 
Cassandra node is not itself a replica for a given key.

 The predictions are optimistic in the following ways:
 * We do not predict the impact of node failure.
 * We do not model hinted handoff.

We can talk about how to improve these if you're interested. This is an area of 
active research.

  was:
h3. Introduction

Cassandra supports a variety of replication configurations: ReplicationFactor 
is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting 
ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, but 
QUORUM is often slower than ONE, TWO, or THREE. What should users choose?

This patch provides a latency-consistency analysis within nodetool. Users can 
accurately predict Cassandra behavior in their production environments without 
interfering with performance.

What's the probability that we'll read a write t seconds after it completes? 
What about reading one of the last k writes? nodetool predictconsistency 
provides this:

{{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}//

{code:title=Example output|borderStyle=solid}

//N == ReplicationFactor
//R == read ConsistencyLevel
//W == write ConsistencyLevel

user@test:$ nodetool predictconsistency 3 100 1
100ms after a given write, with maximum version staleness of k=1
N=3, R=1, W=1
Probability of consistent reads: 0.811700
Average read latency: 6.896300ms (99.900th %ile 174ms)
Average write latency: 8.788000ms (99.900th %ile 252ms)

N=3, R=1, W=2
Probability of consistent reads: 0.867200
Average read latency: 6.818200ms (99.900th %ile 152ms)
Average write latency: 33.226101ms (99.900th %ile 420ms)

N=3, R=1, W=3
Probability of consistent reads: 1.000000
Average read latency: 6.766800ms (99.900th %ile 111ms)
Average write latency: 153.764999ms (99.900th %ile 969ms)

N=3, R=2, W=1
Probability of consistent reads: 0.951500
Average read latency: 18.065800ms (99.900th %ile 414ms)
Average write latency: 8.322600ms (99.900th %ile 232ms)

N=3, R=2, W=2
Probability of consistent reads: 0.983000
Average read latency: 18.009001ms (99.900th %ile 387ms)
Average write latency: 35.797100ms (99.900th %ile 478ms)

N=3, R=3, W=1
Probability of consistent reads: 0.993900
Average read latency: 101.959702ms (99.900th %ile 1094ms)
Average write latency: 8.518600ms (99.900th %ile 236ms)
{code}

h3. Demo

Here's an example scenario you can run using 
[ccm|https://github.com/pcmanus/ccm]. The prediction is fast:

{code:borderStyle=solid}
cd <cassandra-source-dir with patch applied>
ant

# turn on consistency logging
sed -i .bak 's/log_latencies_for_consistency_prediction: 
false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml

ccm create consistencytest --cassandra-dir=. 
ccm populate -n 5
ccm start

# if start fails, you might need to initialize more loopback interfaces
# e.g., sudo ifconfig lo0 alias 127.0.0.2

# use stress to get some sample latency data
tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read

bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
{code}

h3. What and Why

We've implemented [Probabilistically Bounded 
Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting 
consistency-latency trade-offs within Cassandra. Our 
[paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 
2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range 
of Dynamo-style data store deployments at places like LinkedIn and Yammer in 
addition to profiling our own Cassandra deployments. In our experience, 
prediction is both accurate and much more lightweight than trying out different 
configurations (especially in production!).

This analysis is important for the many users we've talked to and heard about 
who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). Should 
they use CL=ONE? CL=TWO? It likely depends on their runtime environment and, 
short of profiling in production, there's no existing way to answer these 
questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning users don't 
know how stale their data will be.)

This patch allows users to perform this prediction in production using 
{{nodetool}}. Users enable tracing of latency data by setting 
{{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}. 
Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies 
(each latency is 8 bytes, and there are 4 distributions we require, so the 
space overhead is {{32*logged_latencies}} bytes of memory for the predicting 
node) and then predicts the latency and consistency for each possible 
ConsistencyLevel setting (reads and writes) by running 
{{number_trials_for_consistency_prediction}} Monte Carlo trials per 
configuration. Users shouldn't have to touch these parameters, and the defaults 
work well. The more latencies they log, the better the predictions will be.

We outline limitations of the current approach after describing how it's done. 
We believe that this is a useful feature that can provide guidance and fairly 
accurate estimation for most users.

h3. Implementation

This patch is fairly lightweight, requiring minimal changes to existing code. 
The high-level overview is that we gather empirical latency distributions then 
perform Monte Carlo analysis using the gathered data.

h4. Latency Data

We log latency data in {{service.PBSPredictor}}, recording four relevant 
distributions:
 * *W*: time from when the coordinator sends a mutation to the time that a 
replica begins to serve the new value(s)
 * *A*: time from when a replica accepting a mutation sends an
 * *R*: time from when the coordinator sends a read request to the time that 
the replica performs the read
* *S*: time from when the replica sends a read response to the time when the 
coordinator receives it

We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along 
with every message (8 bytes overhead required for millisecond {{long}}). In 
{{net.MessagingService}}, we log the start of every mutation and read, and, in 
{{net.ResponseVerbHandler}}, we log the end of every mutation and read. 
Jonathan Ellis mentioned that 
[1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar latency 
tracing, but, as far as we can tell, these latencies aren't in that patch.

h4. Prediction

When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, which 
performs the actual Monte Carlo analysis based on the provided data. It's 
straightforward, and we've commented this analysis pretty well but can 
elaborate more here if required.

h4. Testing

We've modified the unit test for {{SerializationsTest}} and provided a new unit 
test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the 
{{PBSPredictor}} test with {{ant pbs-test}}.

h4. Overhead

This patch introduces 8 bytes of overhead per message. We could reduce this 
overhead and add timestamps on-demand, but this would require changing 
{{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is 
messy.

If enabled, consistency tracing requires {{32*logged_latencies}} bytes of 
memory on the node on which tracing is enabled.

h3. Caveats

 The predictions are conservative, or worst-case, meaning we may predict more 
staleness than in practice in the following ways:
 * We do not account for read repair. 
 * We do not account for Merkle tree exchange.
 * Multi-version staleness is particularly conservative.
 * We simulate non-local reads and writes. We assume that the coordinating 
Cassandra node is not itself a replica for a given key.

 The predictions are optimistic in the following ways:
 * We do not predict the impact of node failure.
 * We do not model hinted handoff.

We can talk about how to improve these if you're interested. This is an area of 
active research.

    
> [Patch] Support consistency-latency prediction in nodetool
> ----------------------------------------------------------
>
>                 Key: CASSANDRA-4261
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4261
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Tools
>    Affects Versions: 1.2
>            Reporter: Peter Bailis
>         Attachments: pbs-nodetool-v1.patch
>
>
> h3. Introduction
> Cassandra supports a variety of replication configurations: ReplicationFactor 
> is set per-ColumnFamily and ConsistencyLevel is set per-request. Setting 
> ConsistencyLevel to QUORUM for reads and writes ensures strong consistency, 
> but QUORUM is often slower than ONE, TWO, or THREE. What should users choose?
> This patch provides a latency-consistency analysis within nodetool. Users can 
> accurately predict Cassandra behavior in their production environments 
> without interfering with performance.
> What's the probability that we'll read a write t seconds after it completes? 
> What about reading one of the last k writes? nodetool predictconsistency 
> provides this:
> {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}//
> {code:title=Example output|borderStyle=solid}
> //N == ReplicationFactor
> //R == read ConsistencyLevel
> //W == write ConsistencyLevel
> user@test:$ nodetool predictconsistency 3 100 1
> 100ms after a given write, with maximum version staleness of k=1
> N=3, R=1, W=1
> Probability of consistent reads: 0.811700
> Average read latency: 6.896300ms (99.900th %ile 174ms)
> Average write latency: 8.788000ms (99.900th %ile 252ms)
> N=3, R=1, W=2
> Probability of consistent reads: 0.867200
> Average read latency: 6.818200ms (99.900th %ile 152ms)
> Average write latency: 33.226101ms (99.900th %ile 420ms)
> N=3, R=1, W=3
> Probability of consistent reads: 1.000000
> Average read latency: 6.766800ms (99.900th %ile 111ms)
> Average write latency: 153.764999ms (99.900th %ile 969ms)
> N=3, R=2, W=1
> Probability of consistent reads: 0.951500
> Average read latency: 18.065800ms (99.900th %ile 414ms)
> Average write latency: 8.322600ms (99.900th %ile 232ms)
> N=3, R=2, W=2
> Probability of consistent reads: 0.983000
> Average read latency: 18.009001ms (99.900th %ile 387ms)
> Average write latency: 35.797100ms (99.900th %ile 478ms)
> N=3, R=3, W=1
> Probability of consistent reads: 0.993900
> Average read latency: 101.959702ms (99.900th %ile 1094ms)
> Average write latency: 8.518600ms (99.900th %ile 236ms)
> {code}
> h3. Demo
> Here's an example scenario you can run using 
> [ccm|https://github.com/pcmanus/ccm]. The prediction is fast:
> {code:borderStyle=solid}
> cd <cassandra-source-dir with patch applied>
> ant
> # turn on consistency logging
> sed -i .bak 's/log_latencies_for_consistency_prediction: 
> false/log_latencies_for_consistency_prediction: true/' conf/cassandra.yaml
> ccm create consistencytest --cassandra-dir=. 
> ccm populate -n 5
> ccm start
> # if start fails, you might need to initialize more loopback interfaces
> # e.g., sudo ifconfig lo0 alias 127.0.0.2
> # use stress to get some sample latency data
> tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o insert
> tools/bin/stress -d 127.0.0.1 -l 3 -n 10000 -o read
> bin/nodetool -h 127.0.0.1 -p 7100 predictconsistency 3 100 1
> {code}
> h3. What and Why
> We've implemented [Probabilistically Bounded 
> Staleness|http://pbs.cs.berkeley.edu/#demo], a new technique for predicting 
> consistency-latency trade-offs within Cassandra. Our 
> [paper||http://arxiv.org/pdf/1204.6082.pdf] will appear in [VLDB 
> 2012|http://www.vldb2012.org/], and, in it, we've used PBS to profile a range 
> of Dynamo-style data store deployments at places like LinkedIn and Yammer in 
> addition to profiling our own Cassandra deployments. In our experience, 
> prediction is both accurate and much more lightweight than trying out 
> different configurations (especially in production!).
> This analysis is important for the many users we've talked to and heard about 
> who use "partial quorum" operation (e.g., non-QUORUM ConsistencyLevels). 
> Should they use CL=ONE? CL=TWO? It likely depends on their runtime 
> environment and, short of profiling in production, there's no existing way to 
> answer these questions. (Keep in mind, Cassandra defaults to CL=ONE, meaning 
> users don't know how stale their data will be.)
> We outline limitations of the current approach after describing how it's 
> done. We believe that this is a useful feature that can provide guidance and 
> fairly accurate estimation for most users.
> h3. Interface
> This patch allows users to perform this prediction in production using 
> {{nodetool}}.
> Users enable tracing of latency data by setting 
> {{log_latencies_for_consistency_prediction: true}} in {{cassandra.yaml}}.
> Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. 
> Each latency is 8 bytes, and there are 4 distributions we require, so the 
> space overhead is {{32*logged_latencies}} bytes of memory for the predicting 
> node.
> {{nodetool predictconsistency}} predicts the latency and consistency for each 
> possible {{ConsistencyLevel}} setting (reads and writes) by running 
> {{number_trials_for_consistency_prediction}} Monte Carlo trials per 
> configuration.
> Users shouldn't have to touch these parameters, and the defaults work well. 
> The more latencies they log, the better the predictions will be.
> h3. Implementation
> This patch is fairly lightweight, requiring minimal changes to existing code. 
> The high-level overview is that we gather empirical latency distributions 
> then perform Monte Carlo analysis using the gathered data.
> h4. Latency Data
> We log latency data in {{service.PBSPredictor}}, recording four relevant 
> distributions:
>  * *W*: time from when the coordinator sends a mutation to the time that a 
> replica begins to serve the new value(s)
>  * *A*: time from when a replica accepting a mutation sends an
>  * *R*: time from when the coordinator sends a read request to the time that 
> the replica performs the read
> * *S*: time from when the replica sends a read response to the time when the 
> coordinator receives it
> We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along 
> with every message (8 bytes overhead required for millisecond {{long}}). In 
> {{net.MessagingService}}, we log the start of every mutation and read, and, 
> in {{net.ResponseVerbHandler}}, we log the end of every mutation and read. 
> Jonathan Ellis mentioned that 
> [1123|https://issues.apache.org/jira/browse/CASSANDRA-1123] had similar 
> latency tracing, but, as far as we can tell, these latencies aren't in that 
> patch.
> h4. Prediction
> When prompted by nodetool, we call {{service.PBSPredictor.doPrediction}}, 
> which performs the actual Monte Carlo analysis based on the provided data. 
> It's straightforward, and we've commented this analysis pretty well but can 
> elaborate more here if required.
> h4. Testing
> We've modified the unit test for {{SerializationsTest}} and provided a new 
> unit test for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the 
> {{PBSPredictor}} test with {{ant pbs-test}}.
> h4. Overhead
> This patch introduces 8 bytes of overhead per message. We could reduce this 
> overhead and add timestamps on-demand, but this would require changing 
> {{net.MessageIn}} and {{net.MessageOut}} serialization at runtime, which is 
> messy.
> If enabled, consistency tracing requires {{32*logged_latencies}} bytes of 
> memory on the node on which tracing is enabled.
> h3. Caveats
>  The predictions are conservative, or worst-case, meaning we may predict more 
> staleness than in practice in the following ways:
>  * We do not account for read repair. 
>  * We do not account for Merkle tree exchange.
>  * Multi-version staleness is particularly conservative.
>  * We simulate non-local reads and writes. We assume that the coordinating 
> Cassandra node is not itself a replica for a given key.
>  The predictions are optimistic in the following ways:
>  * We do not predict the impact of node failure.
>  * We do not model hinted handoff.
> We can talk about how to improve these if you're interested. This is an area 
> of active research.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to