[ 
https://issues.apache.org/jira/browse/CASSANDRA-5932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13782218#comment-13782218
 ] 

Aleksey Yeschenko commented on CASSANDRA-5932:
----------------------------------------------

First, let me thank you for your continued digging. Some of it helped. That 
said, you should probably look at the current cassandra-2.0 branch, and not the 
2.0.0 tarball/branches here in the comments.

bq. Issue 1 – When handling DigestMismatchException in 
StorageProxy.fetchRows(), all data read requests are sent out using sendRR 
without distinguishing remote nodes from the local node.

This is not an issue, and it's not spec retry related. Using LRR for local read 
requests is merely an optimisation - there is nothing wrong with sendRR (not 
that it isn't worth optimising here - just noting that it's not an issue). This 
is also the answer to "How do we handle the case for local node? Does the 
sendRR() and the corresponding receive part can handle the case for local node? 
If not, then this may block for 10 seconds." Same goes for Issue 2 and Issue 3.

bq. The data read request for local node may never sent out. As one of the 
nodes is down (which triggered the Speculative Retry) will cause one missing 
response.

The former is not true, the latter won't, since the current cassandra-2.0 code 
will send requests to all the contacted replicas. So if a node triggered spec 
retry, that extra speculated replica will get the request as well, and we can 
still satisfy the CL.

{noformat}
                    for (InetAddress endpoint : exec.getContactedReplicas())
                    {
                        Tracing.trace("Enqueuing full data read to {}", 
endpoint);
                        MessagingService.instance().sendRR(message, endpoint, 
repairHandler);
                    }
{noformat}


bq. Question for the Randomized approach – Since the end points are randomized, 
the first node in the list is no likely the local node. This may cause a higher 
possibility of data repair.

I don't see how the possibility of data repair is correlated with the locality 
of a target node, but, it doesn't matter. The 'randomised approach' was an 
experiment, it wasn't committed as part of the fix. See the latest 
cassandra-2.0 branch code.

bq. In the Randomized Approach, the end points are reshuffled. Then, the first 
node in the list used for data read request is not likely the local node. If 
this node happens to be the DOWN node, then, we end with all digest responses 
without the data, which will block and eventually timed out.

See the above reply.

TLDR: None of these seem to be issues, but we could optimise RR to use LRR for 
local reads to get slightly better performance for local requests (and to be 
consistent with the regular reads code).

> Speculative read performance data show unexpected results
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-5932
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5932
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Ryan McGuire
>            Assignee: Aleksey Yeschenko
>             Fix For: 2.0.2
>
>         Attachments: 5932-6692c50412ef7d.png, 5932.ded39c7e1c2fa.logs.tar.gz, 
> 5932.txt, 5933-128_and_200rc1.png, 5933-7a87fc11.png, 5933-logs.tar.gz, 
> 5933-randomized-dsnitch-replica.2.png, 5933-randomized-dsnitch-replica.3.png, 
> 5933-randomized-dsnitch-replica.png, compaction-makes-slow.png, 
> compaction-makes-slow-stats.png, eager-read-looks-promising.png, 
> eager-read-looks-promising-stats.png, eager-read-not-consistent.png, 
> eager-read-not-consistent-stats.png, node-down-increase-performance.png
>
>
> I've done a series of stress tests with eager retries enabled that show 
> undesirable behavior. I'm grouping these behaviours into one ticket as they 
> are most likely related.
> 1) Killing off a node in a 4 node cluster actually increases performance.
> 2) Compactions make nodes slow, even after the compaction is done.
> 3) Eager Reads tend to lessen the *immediate* performance impact of a node 
> going down, but not consistently.
> My Environment:
> 1 stress machine: node0
> 4 C* nodes: node4, node5, node6, node7
> My script:
> node0 writes some data: stress -d node4 -F 30000000 -n 30000000 -i 5 -l 2 -K 
> 20
> node0 reads some data: stress -d node4 -n 30000000 -o read -i 5 -K 20
> h3. Examples:
> h5. A node going down increases performance:
> !node-down-increase-performance.png!
> [Data for this test 
> here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.eager_retry.node_killed.just_20.json&metric=interval_op_rate&operation=stress-read&smoothing=1]
> At 450s, I kill -9 one of the nodes. There is a brief decrease in performance 
> as the snitch adapts, but then it recovers... to even higher performance than 
> before.
> h5. Compactions make nodes permanently slow:
> !compaction-makes-slow.png!
> !compaction-makes-slow-stats.png!
> The green and orange lines represent trials with eager retry enabled, they 
> never recover their op-rate from before the compaction as the red and blue 
> lines do.
> [Data for this test 
> here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.eager_retry.compaction.2.json&metric=interval_op_rate&operation=stress-read&smoothing=1]
> h5. Speculative Read tends to lessen the *immediate* impact:
> !eager-read-looks-promising.png!
> !eager-read-looks-promising-stats.png!
> This graph looked the most promising to me, the two trials with eager retry, 
> the green and orange line, at 450s showed the smallest dip in performance. 
> [Data for this test 
> here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.eager_retry.node_killed.json&metric=interval_op_rate&operation=stress-read&smoothing=1]
> h5. But not always:
> !eager-read-not-consistent.png!
> !eager-read-not-consistent-stats.png!
> This is a retrial with the same settings as above, yet the 95percentile eager 
> retry (red line) did poorly this time at 450s.
> [Data for this test 
> here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.eager_retry.node_killed.just_20.rc1.try2.json&metric=interval_op_rate&operation=stress-read&smoothing=1]



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to