[jira] [Commented] (CASSANDRA-14503) Internode connection management is race-prone

Vinay Chella (JIRA) Tue, 06 Nov 2018 12:52:52 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677289#comment-16677289
 ]


Vinay Chella commented on CASSANDRA-14503:
------------------------------------------

Thank you [~jasobrown] for the patch.

[~jolynch] and I benchmarked Jason's 14503-v2 branch, our benchmark results 
show [trunk-Jason's 
branch|https://github.com/jasobrown/cassandra/tree/14503-v2] is significantly 
out-performing 3.0.17 in terms of mean, 99th, and 95th percentile during a pure 
write benchmark. When systems are under heavy load, we have seen coordinator 
mean latencies are ~14x better, 99th latencies are ~4x better and 95th 
latencies are ~6x better on the trunk.

When both trunk and 3.0.17 had 67k write QPS applied, throughput is steady on 
the trunk and 3.0.17 fell over. Note that we have only tested writes in this 
benchmark. However, the trunk is accumulating more hints than 3.0.17 and 
dropping messages compared to 3.0.17, these issues are yet to troubleshoot. For 
a detailed analysis of this benchmarking, find attached document [Cassandra 4.0 
testing with CASSANDRA-14503 fixes]

> Internode connection management is race-prone
> ---------------------------------------------
>
>                 Key: CASSANDRA-14503
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14503
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Streaming and Messaging
>            Reporter: Sergio Bossa
>            Assignee: Jason Brown
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Following CASSANDRA-8457, internode connection management has been rewritten 
> to rely on Netty, but the new implementation in 
> {{OutboundMessagingConnection}} seems quite race prone to me, in particular 
> on those two cases:
> * {{#finishHandshake()}} racing with {{#close()}}: i.e. in such case the 
> former could run into an NPE if the latter nulls the {{channelWriter}} (but 
> this is just an example, other conflicts might happen).
> * Connection timeout and retry racing with state changing methods: 
> {{connectionRetryFuture}} and {{connectionTimeoutFuture}} are cancelled when 
> handshaking or closing, but there's no guarantee those will be actually 
> cancelled (as they might be already running), so they might end up changing 
> the connection state concurrently with other methods (i.e. by unexpectedly 
> closing the channel or clearing the backlog).
> Overall, the thread safety of {{OutboundMessagingConnection}} is very 
> difficult to assess given the current implementation: I would suggest to 
> refactor it into a single-thread model, where all connection state changing 
> actions are enqueued on a single threaded scheduler, so that state 
> transitions can be clearly defined and checked.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14503) Internode connection management is race-prone

Reply via email to