[ https://issues.apache.org/jira/browse/CASSANDRA-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16662265#comment-16662265 ]
Jason Brown commented on CASSANDRA-14503: ----------------------------------------- Based on testing conducted with [~jolynch] and [~vinaykumarcse], here's an updated branch with performance fixes and code improvements: ||v2|| |[branch|https://github.com/jasobrown/cassandra/tree/14503-v2]| |[utests & dtests|https://circleci.com/gh/jasobrown/workflows/cassandra/tree/14503-v2]| ||PR|https://github.com/apache/cassandra/pull/289| The major change in this branch is I experimented with aggregating the messages to send into a single ByteBuf, instead of sending the messages individually to the netty pipeline. Since we can send up to (the current hard-coded size of) 64 messages per each iteration of OMC.dequeueMessages(), that's 64 times to invoke the pipeline mechanics, 64 ByteBuf allocations (and releases), and 64 times to fulfill the promise corresponding to each message. If, instead, we send one ByteBuf (with data serialized into it) then it's just one message into the pipeline, one allocation, and one promise fulfillment. The primary trade-off is that the single buffer will be, of course, large; perhaps large enough to not be efficient with the netty allocator. To that end, wrote a JMH benchmark, and the results are compelling: TL;DR a single buffer is significantly faster than multiple smaller buffers. The closest case is a single buffer is twice as fast, with the typical percentile difference being about 10-20 times faster for the single buffer (1.5 micros vs. 23 micros). To make this work, I need the allocation and serialization code to be moved outside of the pipeline handler (as it now needs to be invoked from OMC). I had already done this work with CASSANDRA-13630. Thus, I pulled that patch into this branch. That patch also greatly reduced the need for the ChannelWriter abstraction, and combined with the outstanding work in this branch, I am able to eliminate ChannelWriter and the confusion it added. However, I still need to handle large messages separately (as we don't want to use our blocking serializers on the event loop), so I've preserved the "move large message serialization on a separate thread" behavior from CASSANDRA-13630 by creating a new abstraction in OMC by adding (the not cleverly named) MessageDequeuer interface, with implementations for large messages and "small messages" (basically the current behavior of this patch that we've been riffing on). One feature that we've been debating again is the whether to include the message coalescing feature. The current branch does not include it - mostly due to the fact that we've been iterating quite quickly over this code, and I broke it when incorporating the CASSANDRA-13630 patch (and killing off ChannelWriter). There is some testing happening to reevaluate the efficacy of message coalescing with the netty internode messaging. Some other points of interest: - switch OMC#backlog from ConcurrentLinkedQueue to MpscLinkedQueue from jctools. MpscLinkedQueue is dramtically better, and ConcurrentLinkedQueue#isEmpty was a CPU drain. - improved scheduling of the consumerTask in OutboundMessagingConnection, though still needs a bit more refinement - ditched the OMC.State from the last branch - added [~jolynch]'s fixes wrt not setting a default SO_SNDBUF value - OMC - introduced consumerTaskThread vs eventLoop member field - ditched the auto-read in RebufferingByteBufDataInputPlus - I need to document this In general I have a small bit of documenting to add, but the branch is ready for review. > Internode connection management is race-prone > --------------------------------------------- > > Key: CASSANDRA-14503 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14503 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Reporter: Sergio Bossa > Assignee: Jason Brown > Priority: Major > Labels: pull-request-available > Fix For: 4.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Following CASSANDRA-8457, internode connection management has been rewritten > to rely on Netty, but the new implementation in > {{OutboundMessagingConnection}} seems quite race prone to me, in particular > on those two cases: > * {{#finishHandshake()}} racing with {{#close()}}: i.e. in such case the > former could run into an NPE if the latter nulls the {{channelWriter}} (but > this is just an example, other conflicts might happen). > * Connection timeout and retry racing with state changing methods: > {{connectionRetryFuture}} and {{connectionTimeoutFuture}} are cancelled when > handshaking or closing, but there's no guarantee those will be actually > cancelled (as they might be already running), so they might end up changing > the connection state concurrently with other methods (i.e. by unexpectedly > closing the channel or clearing the backlog). > Overall, the thread safety of {{OutboundMessagingConnection}} is very > difficult to assess given the current implementation: I would suggest to > refactor it into a single-thread model, where all connection state changing > actions are enqueued on a single threaded scheduler, so that state > transitions can be clearly defined and checked. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org