[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16153528#comment-16153528 ] Jason Brown commented on CASSANDRA-8457: [~spo...@gmail.com] ahh! you are correct. ninja-fixed as sha {{e38015ed1fcf7d95694683601b121c918c5f6976}} > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.0 > > Attachments: 8457-load.tgz > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16152574#comment-16152574 ] Stefan Podkowinski commented on CASSANDRA-8457: --- [~jasobrown], did you forget to bump netty to 4.1.14 in build.xml? > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.0 > > Attachments: 8457-load.tgz > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16136520#comment-16136520 ] Sylvain Lebresne commented on CASSANDRA-8457: - Yes, I'm +1 on the patch on this specific ticket, and great job on that btw! But I do want to clarify that imo CASSANDRA-13630 should kind of be finished as well to call this all to be complete/committable (I've already mentioned that to Jason offline and we seem on the same page, just mentioning it for the records). > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > Attachments: 8457-load.tgz > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16135790#comment-16135790 ] sankalp kohli commented on CASSANDRA-8457: -- [~slebresne] I know you and Jason chatted on IRC. Are you +1 on this...Streaming patch has +1 from Ariel > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > Attachments: 8457-load.tgz > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102128#comment-16102128 ] Jason Brown commented on CASSANDRA-8457: dtests running locally, including upgrade and no_vodes tests, are passing (with the exception of those not working on trunk, and one other that is related to CASSANDRA-12229). [~slebresne] and I discussed offline, and he agreed the {{OutboundConnectionParams}} taking a {{Consumer}} was reasonable. Hence, that was implemented. Current branch also has a minor fix for reconnecting when a connection is rejected due to protocol version mismatch (discovered via dtest upgrade tests) > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > Attachments: 8457-load.tgz > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062157#comment-16062157 ] Jason Brown commented on CASSANDRA-8457: bq. Nit: there has been a few Netty minor released since the one in the branch, maybe worth upgrading (since we change it anyway). there were a few unstable netty releases so I skipped those. 4.1.12 is good, but I need a fix in {{Lz4FrameEncoder}} which should be in 4.1.13. bq. MessageOutHandler.AUTO_FLUSH_THRESHOLD feels like a magic constants. At least a comment on why it's a reasonable value would be nice. I've added a comment to the constant, but the value was the value that should have been for the {{Lz4FrameEncoder}} (16k). The value from the [existing code|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L85] is 64k, so I've reinstated that. Note that the previous {{OTC.BUFFER_SIZE}} can be seen also as a flush threshold, as well, as it is a [parameter to the output stream|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L471]. Also discovered that {{OTC.BUFFER_SIZE}} is configurable, and made the new version configurable, as well. bq. OutboundHandshakeHandler.WRITE_IDLE_MS: what's the rational behind 10 seconds? Not saying it feels wrong per-se, but wondering if there is more than a gut-feeling to it, and I'd kind of suggest exposing a system property to change that default. I believe 10 seconds was a value [~aweisberg] and I came up with, but I've made it configurable now. bq. There is still a few TODO:JEB in the code: would be good to resolve/clear them if we're going to a more final branch. Done, most of them I had already done on the CASSANDRA-12229 branch, but now backported to this one. bq. Largely a nit, but its a tad confusing that OutboundConnectionParams contains a OutboundMessagingConnection. It feels like saying "The parameters you need for an outbound connection is ... an outbound connection". Maybe a simple renaming would make this clearer, though it feels maybe this could be cleanup up a bit further. Hmm, yeah. I need the {{OutboundMessagingConnection}} for {{ChannelWriter#handleMessageFuture()}} so it can react to the outcome of attempting to send the message. wrt a cleanup, the least-worst idea I have is to move the error handling functionality of {{ChannelWriter#handleMessageFuture()}} to {{OutboundMessageConnection}}, as that's the only thing in {{ChannelWriter}} that requires {{OutboundMessageConnection}}. In {{OutboundConnectionParams}} we can pass a reference to an "error consumer" (not fully thought out but something like {{BiConsumer}}, that's a function on {{OutboundMessageConncetion}}), and {{ChannelWriter#handleMessageFuture()}} can invoke that instead of having a reference to a {{OutboundMessageConncetion}}. wdyt? bq. The handling of "backlogged" messages and channel writability feels a bit complex. For instance, it looks like MessageOutHandler.channelWritabilityChanged can potentially silently drop a message every time it's called. Good catch, fixed. bq. The CI links listed with the branch a bunch of comments ago are completely out of dates. updated the ticket with a pointer to the circleci utests. bq. Some basic benchmark results wouldn't hurt either ... Attaching to this ticket, as it's a non-trivial martrix to test all the combinations of compression, coalescing, jdk-based TLS, and openssl-based TLS. bq. Maybe you could create a parent ticket of which this and CASSANDRA-12229 would be sub-tasks where we could focus the testing/benchmarking/final discussions on the whole thing? done as CASSANDRA-13628 > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16028159#comment-16028159 ] Sylvain Lebresne commented on CASSANDRA-8457: - Had a quick look at the last changes. A few remaining questions/remarks: * There is still a few {{TODO:JEB}} in the code: would be good to resolve/clear them if we're going to a more final branch. * MessageOutHandler.AUTO_FLUSH_THRESHOLD feels like a magic constants. At least a comment on why it's a reasonable value would be nice. * Largely a nit, but its a tad confusing that {{OutboundConnectionParams}} contains a {{OutboundMessagingConnection}}. It feels like saying "The parameters you need for an outbound connection is ... an outbound connection". Maybe a simple renaming would make this clearer, though it feels maybe this could be cleanup up a bit further. * {{OutboundHandshakeHandler.WRITE_IDLE_MS}}: what's the rational behind 10 seconds? Not saying it feels wrong per-se, but wondering if there is more than a gut-feeling to it, and I'd kind of suggest exposing a system property to change that default. * The handling of "backlogged" messages and channel writability feels a bit complex. For instance, it looks like {{MessageOutHandler.channelWritabilityChanged}} can potentially silently drop a message every time it's called, and I'll admit not being sure if this can have unintened consequence in overload situations. In general, it would be really great to have some testing of this patch under a fair amount of load (and overload) to make sure things work as intended. * Nit: there has been a few Netty minor released since the one in the branch, maybe worth upgrading (since we change it anyway). Other than that, and related to one comment above, it would be nice to see some tests run for this (including upgrade tests as this is particularly sensible to any MessagingService changes). The CI links listed with the branch a bunch of comments ago are completely out of dates. Some basic benchmark results would hurt either since that completely change a fairly sensible part of the code. Of course, it's not that meaningful without CASSANDRA-12229, which makes this a bit awkward. Maybe you could create a parent ticket of which this and CASSANDRA-12229 would be sub-tasks where we could focus the testing/benchmarking/final discussions on the whole thing? > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15996536#comment-15996536 ] Jason Brown commented on CASSANDRA-8457: latest changes: - add netty's {{IdleStateHandler}} to the outbound pipeline. This allows us to monitor for a socket not being able to make progress sending bytes (or, at least, putting them into the socket buffer). When we don't make progress, we can close the channel and purge the back log of messages. - makes better use of netty's high/low water marks. Instead of naively using that as a flush indicator (which is now, properly, at the end of {{MessageOutHandler#write()}}, I'm using the high/low water marks as when to stop writing to the channel ({{OutboundMessagingConnection#sendMessage()}}). I then add those messages to the backlog queue in {{OutboundMessagingConnection}}. When the channel becomes writable again (because bytes were shipped out), we can purge some of the backlogged messages from {{MessageOutHandler#channelWritabilityChnaged()}}. - add support for CASSANDRA-9748 (somehow got missed before) [~slebresne] I'm adding a few more unit tests and minor doc updates, but please review the latest set of changes. > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15968472#comment-15968472 ] Jason Brown commented on CASSANDRA-8457: So, [~aweisberg] and I spent some time talking offline about the expiring messages on the outbound side, and came up with the following: 1. run a periodic, scheduled task in each channel that checks to make sure the channel is making progess wrt sending bytes. If we fail to see any progress being made after some number of seconds, we should close the connection/socket and throw away the messages. 2. repurpose the high/low water mark (and arguably use it more correctly) to indicate when we should stop writing messages to the channel (at the {{ChannelWriter}} layer). Currently, I'm just using the water mark to indicate when we should flush, but a simple check elsewhere would accomplish the same thing. Instead, the water marks should indicate when we really shouldn't write to the channel anymore, and either queue up those messages in something like {{OutboundMessageConnection#backlog}} or perhaps drop them (I'd prefer to queue). 3. When we've exceeded the high water mark, we can disable the reading incoming messages from the same peer (achievable by disabiling auto read for the channel). This would prevent the current node from executing more work on behalf of a peer to which we cannot send any data. Then when the channel drops below the low water mark (and the channel is 'writable'), we re-enable netty auto read on the read channels for the pper. 1 and 2 are reasonably easy to do (and I'll do them asap), but I'd prefer to defer 3 until later as it has a lot of races and other complexities/subtleties I'd like to put off for the scope of this ticket (especially as sockets are not bidirectional yet). Thoughts? Note: items 1 & 2 are significantly simpler than my earlier comments wrt message expiration, so please disregard them for now. > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967621#comment-15967621 ] Jason Brown commented on CASSANDRA-8457: bq. doesn't Netty already have a standard way to deal with that problem of messages piling up in its queues? Netty has a high/low water mark mechanism that looks at the number of bytes in the channel and sends a "writablility changed" event through channel once one of those thresholds has been reached. I'm currently using that feature in {{MessageOutHandler#channelWritabilityChanged()}} to know when we've hit a decent amount of buffered data before we explicitly call flush. Beyond this, I do not think netty has any other explicit back pressure mechanism built in (like a handler or something similar). We could expand our use of the high/low water mark to say "if there's greater than number of bytes in the channel, drop 'some' messages". If we want to drop older messages for which we feel the client has (or reasonably will) timed out, we'll have to do something like what I've proposed in my most recent comments (and which I'm working on right now). This message expiration behavior can then become not only about timeout (I still think there's reasonable use in that), but also protects size of data in the channel. One other thing we can do is can bound the number of tasks that can be queued in to the channel ([{{SingleThreadEventLoop#DEFAULT_MAX_PENDING_TASKS}}|https://github.com/netty/netty/blob/4.1/transport/src/main/java/io/netty/channel/SingleThreadEventLoop.java#L35]). I quickly traced the netty code, and I *think* a {{RejectedExecutionException}} is thrown when you try to add a message to channel which is filled to it's capacity. I'm not sure we want to use this as the only backpressure mechanism, but, as unbounded queues are awful (the default queue size is {{Integer.MAX_VALUE}}), it might not be a bad idea to bound this to at least something sane (16k-32k as an upper bound can't be unreasonable for a single channel). This, of course, would expect us to be more resilient to dropped messages on the enqueuing side, which is probably a good idea anyway. > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967586#comment-15967586 ] Jason Brown commented on CASSANDRA-8457: bq. doesn't Netty already have a standard way to deal with that problem of messages piling up in its queues? Netty has a high/low water mark mechanism that looks at the number of bytes in the channel and sends a "writablility changed" event through channel once one of those thresholds has been reached. I'm currently using that feature to know when we've hit a decent amount of buffered data before we explicitly call flush. We could expand this to say "if there's greater than number of bytes in the channel, drop older messages" > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967568#comment-15967568 ] Sylvain Lebresne commented on CASSANDRA-8457: - bq. I think it's important that a single slow node or network issue resulting in a socket that isn't writable shouldn't allow an arbitrary amount of data to collect on the heap. Right now there is nothing that can drop the data in that scenario. I don't necessarily disagree on that somewhat general statement, but I'm far from convinced that checking for expired message is the right tool for the job in the first place. The fact is that expiration is time-based, that default timeouts are in multiple of seconds, so plenty of time for message to accumulate and blow the heap without having any of them being droppable. On top of that, not all message have timeouts, which actually make sense because message timeout isn't a back-pressure mechanism, it's about how long we're willing to wait for an answer to a request message, and hence one-way message have no reason to have such timeout. And that's part of the point, I dislike using a concept that isn't meant to be related to back-pressure to do back-pressure, especially when it's as flawed as this one. Users shouldn't have to worry that nodes could OOM because they put writes timeout high, it's just not intuitive. Don't get me wrong, I don't disagree that some back-pressure mechanism should be added for that problem, but that should be more based on the amount of message data (or, at the very least the number of such messages) in the Netty queue. Surely we're not the only one facing this problem though, doesn't Netty already have a standard way to deal with that problem of messages piling up in its queues? > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15967513#comment-15967513 ] Jason Brown commented on CASSANDRA-8457: [~aweisberg] and I spoke with Scott Mitchell of the netty team and we have a potential solution better than what I put in the last comment. The TL;DR is to add an iterator to netty's {{ChannelOutboundBuffer}} and walk the "queue" looking for expired items. It requires a patch netty (which I'm working on), but should be straight forward. More details soon... > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966390#comment-15966390 ] Jason Brown commented on CASSANDRA-8457: [~aweisberg] So, I've kicked this idea around a bit (haven't coded anything up yet), but I thought about using something like {{ExpiringMap}} for the {{QueuedMessages}}s, and when the message times out, either update a (possibly new) field in the {{QueuedMessage}} or failing the {{ChannelPromise}}. Failing the {{ChannelPromise}} is a bit racy as the netty event loop may be in the process of actually serializing and sending the message, and failing the promise (from another thread) might not be great as we're actually doing the work. Meanwhile, updating some new {{volatile}} field in the {{QueuedMessage}} requires yet another volatile update on the send path (better than a synchronized call for sure, but ...). {{ExpiringMap}}, of course, creates more garbage ({{CacheableObject}}), and has it's own internal {{ConcurrentHashMap}}, with more volatiles and such. Not sure if this is great idea, but it's the current state of my thinking? Thoughts? > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15966293#comment-15966293 ] Ariel Weisberg commented on CASSANDRA-8457: --- bq. But was that even a good idea? I'm not convinced. The chance of a message expiring on the sending node are low to start with, but having it expire on the producer thread is kind of remote, and it doesn't feel like it's worth bothering. Especially since we'll expire the message a tiny bit later when processing it anyway. Overall, the benefit to code clarify of not checking for expiration in too many places outweigh imo any benefit we'd have here. In the context of the entire system maybe there are other mechanisms in play that prevent messages from piling up. But without any mechanism what will happen is that the messages can queue on the heap. With the old blocking implementation the thread would eventually block because the TCP buffer is full and with the NIO implementation the queued messages will not be pulled and discarded because the socket will not signal ready to write. I think it's important that a single slow node or network issue resulting in a socket that isn't writable shouldn't allow an arbitrary amount of data to collect on the heap. Right now there is nothing that can drop the data in that scenario. The producer doesn't have to do the dropping in Netty land, but maybe it makes more sense to make producers the victims for that work rather than network threads which have other responsibilities. > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15965767#comment-15965767 ] Jason Brown commented on CASSANDRA-8457: bq. you seem to have squashed everything ... Yes, that's correct, as it makes rebasing a more manageable task with these longer-lived projects. However, I always make a backup of a branch before squashing and rebasing, and here's the last one: [8457_pre-2017Mar16-rebase|https://github.com/jasobrown/cassandra/tree/8457_pre-2017Mar16-rebase]. It has everything the squashed version does except for the commit to fix ssl hostname verification (sha {{5ccdd50624ef166eb46ea559aca3c4b264d8b124}}). I'll address your other comments ASAP > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15963989#comment-15963989 ] Sylvain Lebresne commented on CASSANDRA-8457: - bq. One remaining open issue is the current patch does not expire messages on the producer thread, like what we used to do in {{OutboundTcpConnection#enqueue}} But was that even a good idea? I'm not convinced. The chance of a message expiring on the sending node are low to start with, but having it expire _on the producer thread_ is kind of remote, and it doesn't feel like it's worth bothering. Especially since we'll expire the message a tiny bit later when processing it anyway. Overall, the benefit to code clarify of not checking for expiration in too many places outweigh imo any benefit we'd have here. {quote} bq. In {{connectionTimeout()}}, what happens when a connection attempt timeout? This is what {{OutboundTcpConnnection}} does, and I've replicated it here. {quote} Fair enough, but can't we at least log a warning? Sounds to me this would be an improvement. bq. TBH, I think both of these behaviors are incorrect I agree and never fully understood the reasoning here, it's just not something that seem to have created problem. Imo, there is just 2 cases: # we have a version match: success, just use the connection. # we have a version mismatch: disconnect from that connection and reconnect with what we know will be a proper version. Don't do anything to the backlog, we're still just negociating our connection and there is no reason to do anything with outstanding messages. Other than that, I had a quick look over the branch and it looks good, but unfortunately you seem to have squashed everything with makes it really hard to check the changes made since my last review, and I don't have the time right now to redo a full careful read of the whole patch. > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15956759#comment-15956759 ] Jason Brown commented on CASSANDRA-8457: belatedly adding support for ssl hostname verification (CASSANDRA-9220). thanks to the sslnodetonode_test.py for finding this (that dtest will need an update, as well) > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950929#comment-15950929 ] Jason Brown commented on CASSANDRA-8457: rebased my branch on trunk as of 2017-Mar-28, and added in CASSANDRA-13018 and CASSANDRA-13324 > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944223#comment-15944223 ] Jason Brown commented on CASSANDRA-8457: [~norman] if you want to give it another shot, i'm always open for more input > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943172#comment-15943172 ] Norman Maurer commented on CASSANDRA-8457: -- [~jasobrown] let me know if I should review the code again. > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15943160#comment-15943160 ] Jason Brown commented on CASSANDRA-8457: - I really like what you've done with {{HandshakeProtocol}}. It's a nice clarification/centralization of the handshake data. - I'm cool with ditching the {{DisabledCoalescingStrategy}}, but didn't like passing {{null}} around. Thus, I switched to passing {{Optional}} as the intent is a bit more explicit. - With the refactorings, it made it easier to remove {{OutboundConnection}} as it's reduced functionality can now live simply in {{OutboundMessagingConnection}}, without complicating it. - addressed the comments about timeouts on the outbound handshake side, and reduced them to a single timeout (the one that was already in {{OutboundMessagingConnection}}) - upgraded to netty 4.1.9 - updated a lot of comments and documentation. - added in a lot more tests. code coverage is now > 90% for the {{org.apache.cassandra.net.async}} package. One remaining open issue is the current patch does not expire messages on the producer thread, like what we used to do in {{OutboundTcpConnection#enqueue}}. I'm awaiting the outcome of CASSANDRA-13324, and will apply the result here. bq. In {{connectionTimeout()}}, what happens when a connection attempt timeout? This is what [{{OutboundTcpConnnection}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L248] does, and I've replicated it here. Note that there could be ssveral connection attempts before the connection timeout triggers. bq. in the DISCONNECT/NEGOTIATION_FAILURE case, we don't seem to actively trying to reconect The {{DISCONNECT}} case handles the cases when the protocol versions do not match. In {{OutboundTcpConnnection}}, if the [peer's version is lesser than|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L465] what we expected it to be, we clear the backlog and do not attempt to reconnect until more messages are {{#enqueue()}}'d. On the contrary, if the [peer's version is greater than|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L488] what we expected it to be, we'll finish the handshake, send what's currently in the backlog, and then disconnect. TBH, I think both of these behaviors are incorrect. If the peer's version is lesser than what we expect, we know how to send messages (both handshake and message serialization), so we could send what we have in the backlog without throwing them away. If the peer's version is greater than what we expect, if we try to send messages it might fail on the receiver side due to a change in the handshake sequence (unlikely) or messaging framing/format (likely). WDYT? Either way, I'll need to update {{OutboundMessageConnection}} as it's not quite correct rn. {{NEGOTIATION_FAILURE}} does what [{{OutboundTcpConnnection}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L453]: if we can't get a version from the peer, either we couldn't connect or the handshake failed. I think the current behavior (throw away the backlog) is desirable. bq. In {{InboundHandshakeHandler.handleMessagingStartResponse()}}, maybe we should check the return of handshakeTimeout.cancel() As the {{handshakeTimeout}} executes in the netty event loop, it won't complete with the {{#decode}} method. As {{#failHandshake()}} closes the channel, we won't even get the {{ThirdHandshakeMessage}} into {{handleMessagingStartResponse}} as {{#decode()}} would never be triggered. bq. Regarding the sending and receiving buffer size handling, I agree with all your points and have implemented them. > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15925803#comment-15925803 ] Sylvain Lebresne commented on CASSANDRA-8457: - Finally got time to look at the new version, sorry it wasn't sooner. In general, that looks pretty good and while I have a number of remarks and suggestions, those are somewhat minor. First, I took at stab at addressing some of my comments/suggestions when it felt simpler than doing a wordy comment and pushed the resulting branch [here|https://github.com/pcmanus/cassandra/commits/8457-review]. I won't go into too much details as I tried to put details for the rational of each commit in the commit messages, but I want to mention the 3rd commit as it's the only substantial one. Basically, I felt {{OutboundMessagingConnection}} was a bit dense and that the exact details around flushing behavior being split in different classes ({{OutboundMessagingConnection}} and {{MessageOutHandler}} mostly but still) was making things slightly less easily understantable. So that commit is mostly my attempt at grouping flush behavior in a single place. Hopefully, I didn't break anything. Anyway, I tried to break those suggested commit as much as I could so it's easy to skip something if you don't agree with it. Outside of that, a few additional remarks: * In {{SSLFactory}}, it's confusing as to what is the difference between {{createSslContext}} and {{getSslContext}} and when one is supposed to use one or the other. * In {{OutboundMessagingConnection}}: ** I'd prefer using an {{AtomicReference}} for {{state}} (and {{AtomicBoolean}} for {{scheduledFlushState}} but my branch above already does this one). We're not supposed to create new connections all the time and we're not creating that much of them overall, so I doubt saving a few small allocations makes a meaningful difference here, but it makes things slightly less readable (don't get me wrong, using field updaters is a good trick, and we do already use it in other places, but I prefer keeping it for where it demonstrably make a difference, or it's premature optimization). ** At the beginning of {{sendMessage()}}, shouldn't we be checking people aren't using a closed connection? ** The naming of {{MIN_MESSAGES_FOR_COALESCE}} and more generally the phrasing of the comments on {{otc_coalescing_enough_coalesced_messages}} in {{cassandra.yaml}} wasn't all that clear to me on first read. Not necessary new to this patch though. ** We set a connection timeout in {{connect()}}, but we also have the exact same timeout in {{OutboundHandshakeHandler}}. Can't we simplify a bit and remove the latter one? ** In {{connectionTimeout()}}, what happens when a connection attempt timeout? I looks like we drop the backlog on the floor without logging any message nor attempting to reconnect. Am I missing something? ** In {{finishHandshake}}, in the DISCONNECT/NEGOTIATION_FAILURE case, we don't seem to actively trying to reconect. I suppose we rely on the next message triggering that connectio? Shouldn't we reconnect more actively? We may have a message in the backlog and we don't know when the next message will come. * In {{InboundHandshakeHandler.handleMessagingStartResponse()}}, maybe we should check the return of {{handshakeTimeout.cancel()}} and exit early if it's {{false}}, rather that reading from a closed socket and getting a random error. Also, as we cancel the timeout before setting the state to "complete", I think I would be safer to check for {{if (state == COMPLETE || (handshakeTimeout != null && handshakeTimeout.isCancelled()))}} in {{failHandshake()}} (admittedly we're not really racy because we're on the netty event loop, but it doesn't feel harmful to future proof in case this change). * Regarding the sending and receiving buffer size handling: ** we don't seem to be handling the default for these buffer the same way for inbound and outbound connections. For outbound connections, we use the default defined in {{OutboundConnectionParams}}, but for inbound ones it appears we just default to whatever Netty defaults to (in {{NettyFactory.createInboundChannel}}). Can we consolidate that (probably want to always use our own defaults)? ** since we use separate connection for inbound and outbound, I assume we barely every use the receive buffer for outbound connection and vice-versa. That is, there is handshaking which sends message both ways in both case, but the message exhanged are tiny. Would it make sense to hardcode the size of the buffer that each side barely use to something big enough for handshake but small otherwise? So that when user configure {{DatabaseDescriptor.getInternodeRecvBufferSize/getInternodeSendBufferSize}}, it only impact what I assume they truly mean to impact. ** in {{Builder.build}} method, feels weird to only check {{sendReceiveBufferSize}} as
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868315#comment-15868315 ] Jason Brown commented on CASSANDRA-8457: [~jjordan] That's a fair assessment. I'll table the large messages for the moment, and focus my energies a) on getting CASSANDRA-12229 posted and b) pinging [~slebresne] for more comments when he can :) > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868140#comment-15868140 ] Jeremiah Jordan commented on CASSANDRA-8457: Is adding more complexity to this already complex patch really the answer here (handling large messages)? It seems to me like we should finish up the patch as is and make "handle large messages" an optimization ticket after we get this in. Getting streaming to work with this seems like a better place to spend our time. > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868091#comment-15868091 ] Ariel Weisberg commented on CASSANDRA-8457: --- {quote} Thus, I propose to bring back the SwappingByteBufDataOutputStreamPlus that I had in an earlier commit. To recap, the basic idea is provide a DataOutputPlus that has a backing ByteBuffer that is written to, and when it is filled, it is written to the netty context and flushed, then allocate a new buffer for more writes - kinda similar to a BufferedOutputStream, but replacing the backing buffer when full. Bringing this idea back is also what underpins one of the major performance things I wanted to address: buffering up smaller messages into one buffer to avoid going back to the netty allocator for every tiny buffer we might need - think Mutation acks. {quote} What thread is going to be writing to the output stream to serialize the messages? If it's a Netty thread you can't block inside a serialization method waiting for the bytes to drain to the socket that is not keeping up. You also can't wind out of the serialization method and continue it later. If it's an application thread then it's no longer asynchronous and a slow connection can block the application and prevent it from doing say a quorum write to just the fast nodes. You would also need to lock during serialization or queue concurrently sent messages behind the one currently being written. With large messages we aren't really fully eliminating the issue only making it a factor better. At the other end you still need to materialize a buffer containing the message + the object graph you are going to materialize. This is different from how things worked previously where we had a dedicated thread that would read fixed size buffers and then materialize just the object graph from that. To really solve this we need to be able to avoid buffering the entire message at both sending and receiving side. The buffering is worse because we are allocating contagious memory and not just doubling the space impact. We could make it incrementally better by using chains of fixed size buffers so there is less external fragmentation and allocator overhead. That's still committing additional memory compared to pre-8457, but at least it's being committed in a more reasonable way. I think the most elegant solution is to use a lightweight thread implementation. What we will probably be boxed into doing is making the serialization of result data and other large message portions able to yield. This will bound the memory committed to large messages to the largest atomic portion we have to serialize (Cell?). Something like an output stream being able to say "shouldYield". If you continue to write it will continue to buffer and not fail, but use memory. Then serializers can implement a return value for serialize which indicates whether there is more to serialize. You would check shouldYield after each Cell or some unit of work when serializing. Most of these large things being serialized are iterators. The trick will be that most serialization is stateless, and objects are serialized concurrently so you can't stored the serialization state in object being serialized safely. > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15865754#comment-15865754 ] Jason Brown commented on CASSANDRA-8457: rebased, and pulled in CASSANDRA-13090 > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864943#comment-15864943 ] Jason Brown commented on CASSANDRA-8457: One other thing [~aweisberg] and I discussed was large messages and how they would be handled. In the current 8457 implementation, we will simply allocate a buffer of the {{serializedSize}}. If a message is supposed to be 50Mb, we'll allocate that and roll on. With enough large messages, sent to enough peers, we could OOM or get into some serious memory pressure problems. For comparison, the existing {{OutboundTcpConnection}} uses a {{BufferedOutputStream}}, which is defaulted to 64k, which we constantly reuse and never need to realloc. Thus, I propose to bring back the {{SwappingByteBufDataOutputStreamPlus}} that I had in an earlier commit. To recap, the basic idea is provide a {{DataOutputPlus}} that has a backing {{ByteBuffer}} that is written to, and when it is filled, it is written to the netty context and flushed, then allocate a new buffer for more writes - kinda similar to a {{BufferedOutputStream}}, but replacing the backing buffer when full. Bringing this idea back is also what underpins one of the major performance things I wanted to address: buffering up smaller messages into one buffer to avoid going back to the netty allocator for every tiny buffer we might need - think Mutation acks. We definitely need to address the large buffer issue, and I wouldn't mind knocking out the "buffering small messages" as it's really the same code (that I've written before). wdyt [~slebresne] and [~aweisberg]? > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864888#comment-15864888 ] Jason Brown commented on CASSANDRA-8457: Thanks to [~aweisberg], caught another race condition in {{OutboundMessagingHandler}} and pushed. Needs to be updated when CASSANDRA-13090 is committed (probably today). > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863760#comment-15863760 ] Norman Maurer commented on CASSANDRA-8457: -- [~jasobrown] whooot :) I will do another review as well, just to double-check again. > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863758#comment-15863758 ] Jason Brown commented on CASSANDRA-8457: OK, so I've been performance load testing the snot of this code for the last several weeks, with help from netty committers, flight recorder, and flame graphs. As a result, I've made some major and some minor tweaks, and now I'm slightly faster than trunk with slightly better throughput. I have some optimizations in my back pocket that will increase even more, but as Sylvain has stated before, we'll save those for follow up tickets. trunk {code} id, type total ops,op/s,pk/s, row/s,mean, med, .95, .99,.999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb 4 threadCount, total,233344,3889,3889,3889, 1.0, 1.0, 1.2, 1.3, 1.5,68.2, 60.0, 0.01549, 0, 9, 538, 538, 4,5381 8 threadCount, total,544637,9076,9076,9076, 0.8, 0.8, 1.0, 1.1, 1.4,73.8, 60.0, 0.00978, 0, 20, 1267,1267, 5, 11848 16 threadCount, total, 1126627, 18774, 18774, 18774, 0.8, 0.8, 0.9, 1.0, 5.5,78.2, 60.0, 0.01882, 0, 40, 2665,2665, 6, 23666 24 threadCount, total, 1562460, 26036, 26036, 26036, 0.9, 0.8, 1.0, 1.1, 9.1,81.3, 60.0, 0.00837, 0, 55, 3543,3543, 9, 32619 36 threadCount, total, 2098097, 34962, 34962, 34962, 1.0, 0.9, 1.1, 1.3,60.9,83.0, 60.0, 0.00793, 0, 73, 4665,4665, 7, 43144 54 threadCount, total, 2741814, 45686, 45686, 45686, 1.1, 1.0, 1.4, 1.7,62.2, 131.7, 60.0, 0.01321, 0, 93, 5748,5748, 7, 55097 81 threadCount, total, 3851131, 64166, 64166, 64166, 1.2, 1.0, 1.6, 2.6,62.3, 151.7, 60.0, 0.01152, 0,159, 8190,8521, 14, 106805 121 threadCount, total, 4798169, 79947, 79947, 79947, 1.5, 1.1, 2.0, 3.0,63.5, 117.8, 60.0, 0.05689, 0,165, 9323,9439, 5, 97536 181 threadCount, total, 5647043, 94088, 94088, 94088, 1.9, 1.4, 2.6, 4.9,68.5, 169.2, 60.0, 0.01639, 0,195, 10106, 11011, 11, 126422 271 threadCount, total, 6450510, 107461, 107461, 107461, 2.5, 1.8, 3.7,12.0,75.4, 155.8, 60.0, 0.01542, 0,228, 10304, 12789, 9, 143857 406 threadCount, total, 6700764, 111635, 111635, 111635, 3.6, 2.5, 5.3,55.8,75.6, 196.5, 60.0, 0.01800, 0,243, 9995, 13170, 7, 144166 609 threadCount, total, 7119535, 118477, 118477, 118477, 5.1, 3.5, 7.9,62.8,85.1, 170.0, 60.1, 0.01775, 0,250, 10149, 13781, 7, 148118 913 threadCount, total, 7093347, 117981, 117981, 117981, 7.7, 4.9,15.7,71.3, 101.1, 173.4, 60.1, 0.02780, 0,246, 10327, 13859, 8, 155896 {code} 8457 {code} id, type total ops,op/s,pk/s, row/s,mean, med, .95, .99,.999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb 4 threadCount, total,161668,2694,2694,2694, 1.4, 1.4, 1.6, 1.7, 3.2,68.2, 60.0, 0.01264, 0, 6, 363, 363, 4,3631 8 threadCount, total,498139,8301,8301,8301, 0.9, 0.9, 1.1, 1.3, 1.8,73.5, 60.0, 0.00446, 0, 19, 1164,1164, 6, 11266 16 threadCount, total,765437, 12756, 12756, 12756, 1.2, 1.2, 1.4, 1.5, 5.7,74.8, 60.0, 0.01251, 0, 29, 1819,1819, 5, 17238 24 threadCount, total, 1122768, 18710, 18710, 18710, 1.2, 1.2, 1.4, 1.5, 8.5, 127.7, 60.0, 0.00871, 0, 42, 2538,2538, 5, 25054 36 threadCount, total, 1649658, 27489, 27489, 27489, 1.3, 1.2, 1.4, 1.6,60.1,77.7, 60.0, 0.00627, 0, 57, 3652,3652, 7, 33743 54 threadCount, total, 2258999, 37641, 37641, 37641, 1.4, 1.3, 1.6, 1.8,62.5,81.7, 60.0, 0.00771, 0, 79, 4908,4908, 6, 46789 81 threadCount, total, 3255005, 54220, 54220, 54220, 1.5, 1.2, 1.7, 2.2,63.8, 133.4, 60.0, 0.02030, 0,117, 6953,7008, 9, 72208 121 threadCount, total, 4643184, 77293, 77293, 77293, 1.5,
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15735327#comment-15735327 ] Jason Brown commented on CASSANDRA-8457: Ok, [~slebresne], next round of review is ready :) I'm currently ferreting out a regression with {{TriggersTest}}, but that shouldn't hold up things any further. bq. {{MessageOutHandler.SwappingByteBufDataOutputStreamPlus}} - I'm wondering if it's not premature optimization Yeah, you may be correct here. This is one of the first parts I wrote, and clearly I was worried about the "pack multiple messages into a single flush" issue. It's possible that a simpler implementation, each message gets own buffer, might just be best. However, one counter argument here could be that when we have a very large message to serialize, think tens of MBs, we would allocate that one large buffer, wait to serialize data into it, and then send/flush it. {{OutboundTcpConnection}} and my current branch will buffer a certain subset of that data (say 64k) and send/flush that, rather than wait for it all to be serialized. I'm not sure if that's compelling enough at this time, or if we consider that followup ticket/work. Either way, I've removed {{MessageOutHandler.SwappingByteBufDataOutputStreamPlus}}. bq. OMC.enqueue and background queue Unfortunately, there's a few things that get in our way here of making it that simple, assuming we don't want to block threads. The biggest is that we don't want to write general messages to the channel until the internode messaging handshake completes successfully as it's a three-way handshake and if we start sending general messages in the middle of the handshake, it'll break the handshake. Further, it's not so clean or obvious how to get the ChannelFuture instance without *some* degree of blocking (some of which we might be able to avoid with some clever tricks @normanm showed me). Ultimately, we need a singular {{Channel}} instance, which would require blocking to get only one instance. Another reason why I opted for the backlog queue was to allow as close to an ordered sending of the messages as possible, to a given peer: in other words, to write to the channel as many backlogged messages as possible before sending the new message. For the vast majority of cassandra's functionality, ordering of messages is irrelevant. However, I'm wondering how sensitive repair and cluster membership changes might be to this potential edge-case of message reordering. bq. I'd expect {{enqueue()}} (which might warrant a rename) to just do a {{ctx.write()}} Makes sense, done. That being said, I've left the code in newly-named {{#sendMessage}} and {{#finishHandshake}} a bit tentative, with a note that you and I should hammer this part out further :) bq. ... flushing ... I really like what you've done wrt flushing in {{FlushingHandler}} and {{CoalesingStrategies}}. I think as I was trying to maintain both blocking and non-blocking behaviors in {{CoalesingStrategies}}, that's why that code got twisted and complex. Thus, I've eliminated {{CoalescingMessageOutHandler}} and brought in your {{FlushingHandler}} and {{CoalesingStrategies}} changes. One thing to point out is that when scheduling a task to a netty executor thread, it will execute on the same IO thread as the current context. Thus, in {{FlushingHandler}} you don't have to worry about any concurrency between a new message arrival and the execution of the task, as they execute on the same thread. I've removed your (well-documented!) javadoc comments and the volatile keyword on the {{scheduledFlush}} member field. I agree we can address "targeted benchmarking ... in a separate ticket", as I think getting the essential behavior of flush and coalesce is most important here. bq. Regarding dropped messages, the current implementation was going through the {{MessagingService.incrementDroppedMessages()}} ... Hmm, {{OutboundTcpConnection}} has it's own [counter for connection-specific dropped messages|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L136], which I've maintained in {{OutboundMessagingConnection}}. It seems like {{OutboundTcpConnection}} should probably be reporting it's dropped messages to {{MessagingService.incrementDroppedMessages()}} in addition to what it currently does. If you think that's legit, I'll fix it in this patch and I'll open a separate ticket to fix it in 3.0+. wdyt? bq. InboundHandshakeHandler#handshakeHandlerChannelHandlerName Removed this. I think I needed it with the earlier version of netty? Either way, killed it. bq. In {{OutboundMessagingConnection#writeBacklogToChannel}}, we seem to be sending timed out messages if those are retried This behavior was changed in CASSANDRA-12192, so I've retained it here. bq. {{MessagingService.MessageSender}} Removed bq.
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15643999#comment-15643999 ] Sylvain Lebresne commented on CASSANDRA-8457: - Alright, here's a new round of reviews: * On {{MessageOutHandler}}: ** I'm a bit unclear on the benefits of the {{SwappingByteBufDataOutputStreamPlus}} (whose name I'd btw probably shorten given it's private), compared to the simple solution of just allocating a {{ByteBuf}} per-message. My analysis is this: if we have no coalescing (which may end up being the default back in 4.0, CASSANDRA-12676), then for any message smaller than {{bufferCapacity}}, we end up allocating more than needed since we flush for each message, so it's less optimal. For larger messages, we end up allocating multiple small buffers (instead of a larger), which may be better but not sure how much. With coalescing, I guess we do end up potentially packing multiple messages into a {{ByteBuf}}, but does that make such a big difference (I mean, even if we were to have one {{ByteBuf}} per message, we would still not flush on every message, which feel the most important)? So to sum up, I can buy that it's slightly better in some cases, but it seems also slightly less optimal in other and it's unclear how much it helps when it does, so in the absence of any benchmarking, I'm wondering if it's not premature optimization. Shouldn't we leave that to later and keep it simpler? ** If we do keep {{SwappingByteBufDataOutputStreamPlus}}, I'd at least inline {{WriteState}}. It creates a new indirection which doesn't seem to have much benefit, and I admit the naming confused me (a "state" is not the first place I looked to find the buffer). * In {{OutboundMessagingConnection}}: ** In {{writeBacklogToChannel}}, we seem to be sending timeouted messages if those are retried: I don't think we want to do that (we should always drop a timeouted message). ** I'm not a fan of the use of the {{backlog}} queue for sending message. I'm no Netty expert but that doesn't seem to fit Netty's "spirit". Once we're in a stable state (the connection is created), I'd expect {{enqueue()}} (which might warrant a rename) to just do a {{ctx.write()}} so it's confusing to me it doesn't. We do need to handle connection creation and retries, but it feels we can handle that easily enough through the future on channel creation. To clarify, I'd expect something along the lines of (it's simplified, just to clarify what I have in mind): {noformat} void enqueue(MessageOut msg, int id) { QueuedMessage qm = new QueuedMessage(msg, id); if (state == State.READY) { channel.write(qm); } else { ChannelFuture connectionFuture = connect(); // <- connect would deal with concurrency. connectionFuture.addListener(f -> f.channel().write(qm)); } } {noformat} ** I'm a bit unclear on the {{flush()}} strategy. The code seems to flush basically on every message (ok, once per emptying of the queue, but if we ignore connection creation, this can be very well on every message or very close to it), but this seems to defeat the coalescing handler since that handler will flush on flush (does that make sense?). I could be getting this wrong, but if I don't, it seems we shouldn't flush in {{OutboundMessagingConnection}} but delegate that task to the coalescing handler. But see next point for a more general remark on flush. * In general, there is 2 important operations that I think could be better extracted (so it's cleaner when those happen): flushes and timeouting messages. Here my thinking: ** Flushes: it's my understanding that when you flush is pretty important to performance. So it would be great if we could centralize where we flush if at all possible. In particular, and as said above, I feel we should not flush in {{OutboundMessagingConnection}}. What I would do however is add a {{FlushingHandler}} for that task. That handler would mostly be the {{CoalescingMessageOutHandler}} renamed, but the point of the renaming is to make it clear that it's where the decision to flush happens. ** Timeouts: the patch currently look for timeouted message in different places and it would be great to consolidate that. In general, I think it's enough to check if a message is timeouted once in the pipeline: messages will rarely timeout on the sending time by design and if they do, it's likely because the channel is overwhelmed and the message has sit on the Netty event executor queue too long, and I believe we'd catch well enough with a simple {{MessageTimeoutingHandler}}. * Regarding dropped messages, the current implementation was going through the {{MessagingService.incrementDroppedMessages()}} methods, which is still used by other parts of the code and is distinguishing between cross-node and local messages, but the new code seems to have its own separate
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583503#comment-15583503 ] Jason Brown commented on CASSANDRA-8457: Addressed a few problems due to failing dtests. The currently failing utests and dtests are almost *all* related to streaming; I'm working on the few remaining dtests. * correctly handle canceled futures in {{OutboundMessagingConnection#handleMessageFuture}} * Fixed NPE in {{OutboundMessagingConnection#reconnectWithNewIp}} * refactored some code wrt outbound connection creation * added package-level documentation, primarily describing the handshake and messaging format Also, I have a patched version of netty 4.1.6 in this branch that fixes a flush problem in {{LZ4FrameEncoder}}. Netty folks are reviewing now, and will hopefully get an updated netty lib soon. [~slebresne] ready for your next round of comments when you have time :) > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15527775#comment-15527775 ] Jason Brown commented on CASSANDRA-8457: TL;DR - I've addressed everything except for the interaction between {{ClientHandshakeHandler}} and {{Interno -deMessagingConnection}} (both are noew renamed). I've noticed the odd rub there, as well, for a while, and I'll take some time to reconsider it. re: "talking points" - Backward compatibility - bit the bullet, and just yanked the old code - streaming - [~slebresne] and I talked offline, and CASSANDRA-12229 will address the streaming parts, and will be worked on/reviewed concurrently. Both tickets will be committed together to avoid breaking streaming. re: comments section 1 - Netty openssl - when I implemented this back in February, there was no mechanism to use {{KeyFactoryManager}} with the OpenSSL implementaion. Fortunately, this has changed since I last checked in, so I've deleted the extra {{keyfile}} and friends entries from the yaml/{{Config}}. - "old code" - deleted now - package javadoc - I absolutely want this :), I just want things to be more solid code-wise before diving into that work. - naming - names are now more consistent using In/Out (or Inbound/Outbound), and use of client/server is removed. re: comments section 2 - {{getSocketThreads()}} - I've removed this for now, and will be resolved with CASSANDRA-12229 - {{MessagingService}} renames - done - {{MessagingService#createConnection()}} In the previous implementation, {{OutboundTcpConnectionPool}} only blocked on creating the threads for it's wrapped {{OutboundTcpConnection}} instances (gossip, large, and small messages). No sockets were actually opened until a message was actually sent to that peer {{OutboundTcpConnection#connect()}}. Since we do not spawn a separate thread for each connection type (even though we will have separate sockets), I don't think it's necessary to block {{MessagingService#createConnection()}}, or more correctly now, {{MessagingService.NettySender#getMessagingConnection()}}. - "Seems {{NettySender.listen()}} always starts a non-secure connection" - You are correct; however, looks like we've always been doing it that way (for better or worse). I've gone ahead and made the change (it's a one liner, plus a couple extra for error checking). - {{ClientConnect#connectComplete}} - I've renmaed the function to be more accurate ({{connectCallback}}). - {{CoalescingMessageOutHandler}} - done Other issues resolved, as well. Branch has been pushed (with several commits at the top) and tests running. > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15523138#comment-15523138 ] Sylvain Lebresne commented on CASSANDRA-8457: - Here comes more review bits, though it's still incomplete as I have to context switch a bit. I'll start by a number of general remarks, followed by some more specific comments on some classes, and a few random nits to finish. The first thing that I think we should discuss is how we want to handle the transition to this (that is, what happens to the old implementation). Namely, I see 2 possible options: # we simply switch to Netty, period, and the old implementation goes away; # we keep both implementations in, with a configuration flag to pick which one is used as implemented in the patch, and this probably until 5.0. I think the main advantage of keeping both implementation for a while is that it's somewhat safer and easier for user to evaluate outside of the 4.0 upgrade itself. But the downsides are obviously that it's a lot more code to maintain, it's more messy right off the bat, and given only the new implementation will stay eventually, it's only pushing the switch to later. Personally, since we'll ship this in a new major (4.0) and since the cost of maintaining 2 implementations is pretty unpleasant and we don't have infinite devs resources, I'd have a slightly preference for option 1, but I'm not going to argue strongly either way. Mostly, I want us to decide, because I think this influence the final product here. The second thing I want to mention is streaming. I could be missing something here but it appears streaming is just not supported at all by the new code, meaning that the patch as is is unusable outside of toy testing (I mean by that that if you start a node with the Netty MessagingService, it seems you can't received any streaming, so you're not a fully functioning node). So I'm kind of wondering what is the plan here. I'm happy to thinker on another ticket as Streaming is a best on its own, but I'm not sure I'm ok committing anything until it's a fully working product. So anyway, with those talking point out of the way, a few more "technical" general remarks on the patch: * Netty's openssl: what's the fundamental difference between the new {{keyfile}} and the old {{keystore}}? It sounds confusing to have to configure new things when there is already pre-existing parameters for SSL. * Some "old" code is made public (many constants in particular) and reused by the "new" code, while other bits are copy-pasted, but it's not very consistent and makes things sometimes a bit hard to follow ({{QueuedMessage}} (used by new code too) being an internal class of {{OutboundTcpConnection}} (old code) is a particularly bad offender imo). That's typically where I want us to make a decision on the point I raise above, as whatever decision we take I feel there is some cleanups to be done. * It would be really nice to have a package javadoc for the new "async" package that explains the main classes involved and how they interact. * There is a few TODO that needs to be handled. * I feel some of the naming could be made more consistent. In particular, the patch uses Client/Server for some classes (e.g. {{ClientHandshakeHandler}}/{{ClientConnector}}) but In/Out for others (e.g. {{MessageInHandler}}). And things like {{InternodeMessagingPool}} is imo less good than the old {{OutboundTcpConnectionPool}} in that it doesn't indicates it's only for outbound connections. I'd prefer avoid Client/Server for internode stuffs and reserve it for native protocol classes, sticking to In/Out (or Inbound/Outbound when that reads better) for internode classes, like we did before. I understand this may have been to avoid name clash with existing classes, but maybe that's where we should also decide how long the old implementation to stay in. Then a few more specific comments: * In {{MessagingService}}: ** We should probably move everything related to {{ServerSocket}} inside the new {{BlockingIOSender}} (though again, if we decide to just ditch the old implementation, it simply goes away). ** Related to the previous point, the {{getSocketThreads()}} is used by {{StreamingTransferTest}} to apparently wait for the streaming connections to close. Returning an empty list seems to kind of break that. I think we could replace the method by some {{awaitAllStreamingConnectionsClosed()}} (that would still just be for testing). Of course, that's kind of streaming related. ** I would probably rename {{switchIpAddress()}} to {{reconnectWithNewIp()}} as it makes it more explicit what does happen (or at least have some javadoc). Similarly, {{getCurrentEndpoint()}} is imo no very clearly named (not this patch fault though), maybe {{getPreferredAddress()}}? ** In {{createConnection()}}, it appears the old implementation was starting the
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15514498#comment-15514498 ] Jason Brown commented on CASSANDRA-8457: I've pushed a couple commits to the same branch, and kicked off tests now. re: {{LegacyClientHandler}} - it ended up not being too difficult to parse the message "header" to get through to the payload size. It's implemented in {{MessageReceiveHandler}}. I've also removed {{LegacyClientHandler}} and the original message in classes, {{MessageInHandler}} and {{AppendingByteBufInputStream}} and their tests. re: flush: yup, dumped my counter thing, and am using {{FlushConsolidationHandler}}. There's still some finer semantics/behaviors to groom over wrt flushing, but this is a better solution already. re: {{MessagingService}} As a short term solution, I created an interface to abstract out all the blocking IO vs netty related stuffs in MS. It's not a thing of beauty, but hopefully it's cleaner and won't need to live all that long. I hope we can live with this during the scope of the review. > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508103#comment-15508103 ] Jason Brown commented on CASSANDRA-8457: re: {{LegacyClientHandler}}, and v3 vs v4 (or, "to frame or not to frame") I'll admit this was a more difficult decision to make in this the patch. While I'm not worried (right now) about the size occurring twice in the message (it is only four bytes, and we could move it ;)), I was more concerned about being able to fish out the payload size from *after* the message's parameters, for which we have no size information. A quick scan of the current code shows we only use the message headers for tracing and multi-DC mutations. In both cases, the total data is small (tens of bytes). Of course, we don't know if we have all the bytes for deserializing without a size field. In the case of having headers in a message but not enough bytes in the buffer, we'll then need to wait until more bytes come in and try deserailizing again. If we are OK with that possibility (looping a couple times for more bytes, creating extra garbage), then I have no major concerns about switching to that model, especially if we avoid altering the protocol, which I agree would be great to defer a different ticket if possible. bq. Why not just submit a flush task to the channel {{EventLoop}} ... Yeah, this makes sense, and is much more sane than my counters idea. I spoke with the netty folks, and they recommended using netty's {{FlushConsolidationHandler}} so I'll give that a shot before something custom first. (That spotify example was instructive, though, so thanks!) bq. The changes to {{MessagingService}} ... tbh, I think we only need the two implementations until either CASSANDRA-12229 or CASSANDRA-8911 resolves how streaming is affected by this ticket (streaming is effectively broken, for now, with the current internode messaging changes because it shares the same listening socket with streaming). This assumes, of course, that either of those tickets ship in the same rev as this one, which might not be a great idea. So for now, at a minimum, I'll abstract out the implementations, and we can throw out the legacy when when ready on either of the other two tickets is ready as well. I'm open for disucssion on how/if changes to internode messaging should affect streaming (i've discussed it a bit on CASSANDRA-12229). Thanks for the initial comments, I'll have some updated code in a day or so. > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502872#comment-15502872 ] Sylvain Lebresne commented on CASSANDRA-8457: - I'm still very much in the process of reviewing this and I'll have more on this later, but I wanted to provide a few early remarks instead of waiting to having fully digested it all: * On {{LegacyClientHandler}}, I'm no sure it's really needed as current v3 messages actually do ship with their payload size. Now, granted said size is actually within the {{MessageIn/MessageOut}} object rather than the "framing" (the distinction between the 2 being a bit of an implementation detail though tbh), and there is a little bit more work to get to it than if it was the first thing in the "frame" as done by this patch for v4, but it should still be a lot easier than what {{LegacyClientHandler}} does. This also means the v4 format include twice the (exact same) payload size, which feels wasteful. On that v4 format, I actually think we could use to rethink that framing/message format a bit as we can easily save a few bytes, *but* it's definitively for a different ticket. But maybe for this thicket it's actually simpler to stick to the v3 format using the payload size that prefix said payload. * Regarding the flushing, I agree relying on counters doesn't feel great. Why not just submit a flush task to the channel {{EventLoop}} (something like what [this|https://github.com/spotify/netty-zmtp/blob/master/src/main/java/com/spotify/netty4/util/BatchFlusher.java] does), which accordingly to Netty 4 thread model should ensure it happens after any outstanding other writes. * The changes to {{MessagingService}} feels a bit ugly tbh given that we only ever use one implementation at runtime. I'd prefer making {{MessagingService}} abstract, abstracting the (relatively few) methods that use the connections, and have 2 sub-classes. > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419276#comment-15419276 ] Jason Brown commented on CASSANDRA-8457: rebased and cleaned up some of the error handling/channel closing based on more discussion with [~norman] and Scott Mitchell. > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15372251#comment-15372251 ] Jason Brown commented on CASSANDRA-8457: I've pushed a new commit to the same branch after addressing the comments from [~norman] and Scott Mitchell - thanks guys for the feedback!. Tests running now. > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15370179#comment-15370179 ] Norman Maurer commented on CASSANDRA-8457: -- [~jasobrown] I did a first review-round and left some comments. All in all it looks solid from a Netty perspective, can't say a lot about cassandra perspective here :) Let me know if you have any questions etc. > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: netty, performance > Fix For: 4.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368604#comment-15368604 ] Jason Brown commented on CASSANDRA-8457: Here's the first pass at switching internode messaging to netty. ||8457|| |[branch|https://github.com/jasobrown/cassandra/tree/8457]| |[dtest|http://cassci.datastax.com/view/Dev/view/jasobrown/job/jasobrown-8457-dtest/]| |[testall|http://cassci.datastax.com/view/Dev/view/jasobrown/job/jasobrown-8457-testall/]| I've tried to preserve all of the functionality/behaviors of the existing implementation as I could, and some aspects were a bit tricky in the non-blocking IO, netty world. I've also extensively documented the code as much as I can, and I still want to add more high-level docs on 1) the internode protocol itself, and 2) the use of netty in internode messaging. Hopefully the current state of documentation helps understanding and reviewing the changes. Here's some high-level notes as points of departure/interest/discussion: - I've left the existing {{OutboundTcpConnection}} code largely intact for the short term (read on for more detail). But mostly the new and existing behaviors and coexist in the code together (though not at run time) - There is a yaml property to enable/disable using netty for internode messaging. If disabled, we'll fall back to the existing {{OutboundTcpConnection}} code. Part of this stems from the fact that streaming also uses the same the socket infrastructure as internode messaging handshake as messaging, and streaming would be broken without the {{OutboundTcpConnection}} implementation. I am knees deep in switching streaming over to a non-blocking, netty-based solution, but that is a separate ticket/body of work. - In order to support non-blocking IO, I've altered the internode messaging protocol such that each message is framed, and the frame contains a message size. The protocol change is what forces these changes to happen at a major rev update, hence 4.0 - Backward compatibility - We will need to handle the case of cluster upgrade where some nodes are on the previous version of the protocol (not upgraded), and some are upgraded. The upgraded nodes will still need to behave and operate correctly with the older nodes, and that functionality is encapsulated and documented in {{LegacyClientHandler}} (for the receive side) and {{MessageOutHandler}} for the send side. - Message coalescing - The existing behaviors in {{CoalescingStrategies}} are predicated on parking the thread to allow outbound messages to arrive (and be coalesced). Parking a thread in a non-blocking/netty context is a bad thing, so I've inverted the behavior of message coalescing a bit. Instead of blocking a thread, I've extended the {{CoalescingStrategies.CoalescingStrategy}} implementations to return a 'time to wait' to left messages arrive for sending. I then schedule a task in the netty scheduler to execute that many nanoseconds in the future, queuing up incoming message, and then send them out when the scheduled task executes (this is {{CoalescingMessageOutHandler}}). I've also added callback functions to {{CoalescingStrategies.CoalescingStrategy}} implementations for the non-blocking paradigm to record updates to the strategy (for recalculation of the time window, etc). - Message flushing - Currently in {{OutboundTcpConnection}}, we only call flush on the output stream if the backlog is empty (there's no more messages to send to the peer). Unfortunately there's no equivalent API in netty to know there's any messages in the channel waiting to be sent. The solution that I've gone with is to have a shared counter outside of the channel {{InternodeMessagingConnection#outboundCount}} and inside the channel {{CoalescingMessageOutHandler#outboundCounter}}, and when {{CoalescingMessageOutHandler}} sees the value of that counter is zero, it knows it can explicitly call flush. I'm not entirely thrilled with this approach, and there's some potential race/correctness problems (and complexity!) when reconnections occur, so I'm open to suggestions on how to achieve this functionality. - I've included support for netty's OpenSSL library. The operator will need to deploy an extra netty jar (http://netty.io/wiki/forked-tomcat-native.html) to get the OpenSSL behavior (I'm not sure if we can or want to include it in our distro). {{SSLFactory}} needed to be refactored a bit to support the OpenSSL functionality. I'll be doing some more extensive testing next week (including a more thorough exploration of the backward compatibility). > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Priority: Minor >
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15141370#comment-15141370 ] Norman Maurer commented on CASSANDRA-8457: -- [~jasobrown] you know where to find me ;) > nio MessagingService > > > Key: CASSANDRA-8457 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 > Project: Cassandra > Issue Type: New Feature >Reporter: Jonathan Ellis >Assignee: Jason Brown >Priority: Minor > Labels: performance > Fix For: 3.x > > > Thread-per-peer (actually two each incoming and outbound) is a big > contributor to context switching, especially for larger clusters. Let's look > at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271701#comment-14271701 ] Ariel Weisberg commented on CASSANDRA-8457: --- Took a stab at writing an adaptive approach to coalescing based on a moving average. Numbers look good for the workloads tested. Code https://github.com/aweisberg/cassandra/compare/6be33289f34782e12229a7621022bb5ce66b2f1b...e48133c4d5acbaa6563ea48a0ca118c278b2f6f7 Testing in AWS, 14 servers 6 clients. Using a fixed coalescing window at low concurrency there is a drop of performance from 6746 to 3929. With adaptive coalescing I got 6758. At medium concurrency (5 threads per client, 6 clients) I got 31097 with coalescing disable and 31120 with coalescing. At high concurrency (500 threads per client, 6 clients) I got 479532 with coalescing and 166010 without. This is with a maximum coalescing window of 200 milliseconds. I added debug output to log when coalescing starts and stops and it's interesting. At the beginning of the benchmark things flap, but they don't flap madly. After a few minutes it settles. I also notice a strange thing where CPU utilization at the start of a benchmark is 500% or so and then after a while it climbs. Like something somewhere is warming up or balancing. I recall seeing this in GCE as well. I had one of the OutboundTcpConnections (first to get the permit) log a trace of all outgoing message times. I threw that into a histogram for informational purposes. 50% of messages are sent within 100 microseconds of each other and 92% are sent within one millisecond. This is without any coalescing. {noformat} Value Percentile TotalCount 1/(1-Percentile) 0.000 0. 5554 1.00 5.703 0.1000 124565 1.11 13.263 0.2000 249128 1.25 24.143 0.3000 373630 1.43 40.607 0.4000 498108 1.67 94.015 0.5000 622664 2.00 158.463 0.5500 684867 2.22 244.351 0.6000 747137 2.50 305.407 0.6500 809631 2.86 362.239 0.7000 871641 3.33 428.031 0.7500 933978 4.00 467.711 0.7750 965085 4.44 520.703 0.8000 996254 5.00 595.967 0.82501027359 5.71 672.767 0.85001058457 6.67 743.935 0.87501089573 8.00 780.799 0.88751105290 8.89 821.247 0.90001120774 10.00 868.351 0.91251136261 11.43 928.767 0.92501151889 13.33 1006.079 0.93751167421 16.00 1049.599 0.943750001175260 17.78 1095.679 0.95001183041 20.00 1143.807 0.956250001190779 22.86 1198.079 0.96251198542 26.67 1264.639 0.968750001206301 32.00 1305.599 0.971875001210228 35.56 1354.751 0.97501214090 40.00 1407.999 0.978125001217975 45.71 1470.463 0.981250001221854 53.33 1542.143 0.984375001225759 64.00 1586.175 0.985937501227720 71.11 1634.303 0.98751229643 80.00 1688.575 0.989062501231596 91.43 1756.159 0.990625001233523 106.67 1839.103 0.992187501235464 128.00 1887.231 0.992968751236430 142.22 1944.575 0.993750001237409 160.00 2007.039 0.994531251238384 182.86 2084.863 0.995312501239358 213.33 2174.975 0.996093751240326 256.00 2230.271 0.9964843750001240818 284.44 2293.759 0.996875001241292 320.00 2369.535 0.9972656250001241785 365.71 2455.551 0.997656251242271 426.67 2578.431 0.9980468750001242752 512.00 2656.255 0.9982421875001242999 568.89 2740.223 0.998437501243244 640.00 2834.431 0.9986328125001243482 731.43 2957.311 0.9988281250001243725 853.33 3131.391 0.99902343750012439691024.00 3235.839 0.99912109375012440911137.78 3336.191 0.9992187512442121280.00 3471.359 0.99931640625012443321462.86 3641.343 0.99941406250012444551706.67 3837.951 0.99951171875012445762048.00 4001.791 0.99956054687512446362275.56 4136.959 0.99960937500012446972560.00 4399.103 0.99965820312512447582925.71 4628.479
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268221#comment-14268221 ] Ariel Weisberg commented on CASSANDRA-8457: --- I wanted to account for the impact of coalescing at low concurrency. Low concurrency is not a recipe for great performance, but it is part of the out of the box experience and people do compare different databases at low concurrency. In GCE coalescing provided a 12% increase in throughput in this specific message heavy high concurrency workload. The penalty is that at low concurrency there is an immediate loss of performance with any coalescing and a large window has a greater impact at low concurrency so there is tension there. The larger the window the better the performance bump. Testing with 3 client threads each running on a dedicated client instance (3 threads total). This is in GCE. With TCP no delay on and coalescing ||Coalesce window microseconds|Throughput|| |0| 2191| |6| 1910| |12| 1873| |25| 1867| |50| 1779| |100| 1667| |150| 1566| |200| 1491| I also tried disabling coalescing when using HSHA and it didn't seem to make a difference. Surprising considering the impact of 25 microseconds of coalescing intra-cluster. I also experimented with some other things. Binding interrupts cores 0 and 8 and task setting C* off of those cores. I didn't see a big impact. I did see a small positive impact using 3 clients 8 servers which means the measurements with 2 clients might be a little suspect. With 3 clients and 200 microseconds of coalescing it peaked at 165k in GCE. I also found out that banned CPUs in irqbalance is broken and has no effect and this has been the case for some time. Right scale does not offer instances with enhanced networking. To find out whether coalescing provides real benefits in EC2/Xen or milder GCE like benefits I will have to get my hands on some. nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266963#comment-14266963 ] Ariel Weisberg commented on CASSANDRA-8457: --- I ran on GCE. 11 n1-highcpu-16, 2 clients 9 servers. ubuntu-1404-trusty-v20141212 GCE distributes interrupts to all cores by default. The conclusion is that on GCE at that cluster config coalescing provides some benefit, although not to the extreme degree it was beneficial in EC2 without enhanced networking. The window to coalesce in doesn't have to be large to get the value. Binding interrupts to core 0 has a negative impact on throughput that doesn't change depending on whether no delay is enabled/disabled, but coalescing and binding yields the same similar throughput as distributing interrupts and coalescing. It's very much an apples to oranges comparison to the EC2 c3.8xlarge which is much bigger instance (certainly in terms of RAM and exposed hardware threads) and more likely to benefit from more packet throughput. They are also completely different virtualization technologies and I guess it shows with things like toggling no delay having no impact in GCE even though performing the coalescing manually does. Running with TCP no delay off 132254 132862 With TCP no delay on and coalescing ||Coalesce window microseconds| Throughput|| |200| 149791| |150| 148755| |100| 147678| |50| 142565| |25| 144542| |12| 141679| |6| 142626| |0| 133905| |0| 132849| I then tested with all interrupts bound to core 0 With TCP no delay off 116157 No delay on, no coalescing 118134 No delay on, coalescing 200 microseconds 147257 146453 nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266422#comment-14266422 ] Ariel Weisberg commented on CASSANDRA-8457: --- https://forums.aws.amazon.com/thread.jspa?messageID=459260 This looks like a Xen/EC2 issue. I'll bet bare metal never has this issue because it has multiple interrupt vectors for NICs. The only work around is using multiple elastic network interfaces and doing something to make that practical for intra-cluster communication. nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14266402#comment-14266402 ] Ariel Weisberg commented on CASSANDRA-8457: --- I think I stumbled onto what is going on based on Benedict's suggestion to disable TCP no delay. It looks like there is a small message performance issue. This is something I have seen before in EC2 where you can only send a surprisingly small number of messages in/out of a VM. I don't have the numbers from when I micro benchmarked it, but it is something like 450k messages with TCP no delay and a million or so without. Adding more sockets helps but it doesn't even double the number of messages in/out. Throwing more cores at the problem doesn't help you just end up with under utilized cores which matches the mysterious levels of starvation I was seeing in C* even though I was exposing sufficient concurrency. 14 servers nodes. 6 client nodes. 500 threads per client. Server started with row_cache_size_in_mb : 2000, key_cache_size_in_mb:500, rpc_max_threads : 1024, rpc_min_threads : 16, native_transport_max_threads : 1024 8-gig old gen, 2 gig new gen. Client running CL=ALL and the same schema I have been using throughout this ticket. With no delay off First set of runs 390264 387958 392322 After replacing 10 instances 366579 365818 378221 No delay on 162987 Modified trunk to fix a bug batching messages and add a configurable window for coalescing multiple messages into a socket see https://github.com/aweisberg/cassandra/compare/f733996...49c6609 ||Coalesce window microseconds|Throughput|| |250| 502614| |200| 496206| |150| 487195| |100| 423415| |50| 326648| |25| 308175| |12| 292894| |6| 268456| |0| 153688| I did not expect get mileage out of coalescing at the application level but it works extremely well. CPU utilization is still low at 1800%. There seems to be less correlation between CPU utilization and throughput as I vary the coalescing window and throughput changes dramatically. I do see that core 0 is looking pretty saturated and is only 10% idle. That might be the next or actual bottleneck. What role this optimization plays at different cluster sizes is an important question. There hast to be a tipping point where coalescing stops working because not enough packets go to each end point at the same time. With vnodes it wouldn't be unusual to be communicating with a large number of other hosts right? It also takes a significant amount of additional latency to get the mileage at high levels of throughput, but at lower concurrency there is no benefit and it will probably show up as decreased throughput. It makes it tough to crank it up as a default. Either it is adaptive or most people don't get the benefit. At high levels of throughput it is a clear latency win. Latency is much lower for individual requests on average. Making this a config option is viable as a starting point. Possibly a separate option for local/remote DC coalescing. Ideally we could make it adapt to the workload. I am going to chase down what impact coalescing has at lower levels of concurrency so we can quantify the cost of turning it on. I'm also going to try and get to the bottom of all interrupts going to core 0. Maybe it is the real problem and coalescing is just a band aid to get more throughput. nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263096#comment-14263096 ] Ariel Weisberg commented on CASSANDRA-8457: --- i can't get performance counters for cache behaviors on EC2 as far as I can tell and I don't have a good answer for why I get the performance numbers I am seeing. I ran the measurements with CL.QUORUM, ONE, and ALL against trunk and my branch with/without rpc_max_threads increased to 1024. This was prompted by measurements on a 15 node cluster where CL.ONE was 10x faster then CL.ALL. I measured the full matrix on a 9 node cluster and CL.ONE was 5x faster than CL.ALL which with RF=5 is the expected performance delta. I definitely see under utilization. With CL.ONE run right at 1600% and with CL.ALL they don't make it up that high although trunk does better in that respect. The under utilization is worse with the modified code that uses SEPExecutor. I maybe have to run with 15 nodes again to see if the jump from 9-15 is what causes CL.ALL to perform worse or if the difference is that I was using a placement group and 14.04 in the 9 node cluster. The change to use SEPExecutor for writes was slightly slower to a lot slower in QUORUM and ALL cases at 9 nodes. I think that is a dead end, but I do wonder if that is because SEPExecutor might not have the same cache friendly behavior that running dedicated threads does. Dedicated threads require signaling and context switching, but thread scheduling policies could result in threads servicing each socket alway running in the same spot. I am going to try again with netty. I should at least be able to match the performance of trunk with a non-blocking approach so I think it is still worth digging. nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14260565#comment-14260565 ] Ariel Weisberg commented on CASSANDRA-8457: --- I reran this with placement groups. Took a while to get cluster setup scripted. Unfortunately I couldn't go past 20 nodes due to a restriction on # of spot instances. Might be able to start regular instances in the same placement group. Still working on it. I upped the number of servers to 14 thinking the clients weren't heavily loaded since I got more throughput per client with only 3 servers and there is not a lot of load on them. This is going from 9 to 14 servers. Throughput was worse so something is wrong. This is a different kernel, I went from 12.04 with the 3.2 kernel to 14.04 running 3.13 so more then one variable has changed. I wanted to add more clients to see if that would get the throughput up. Something is very fishy. Clients claim they are fully connected to the cluster. They print New Cassandra host /172.31.56.45:9042 added for each server node. ||Run|trunk|modified|| |#1|101,395|114,981| |#2|123,916|98,308| |#3|117,831|109,442| |#4|120,267|111,571| Going to try and find out why adding capacity did not yield more throughput so I can start running valid benchmarks. nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256293#comment-14256293 ] Ariel Weisberg commented on CASSANDRA-8457: --- Testing on AWS with 6 client and 9 servers. c3.8xlarge running Ubuntu 12.04 and no placement groups. I changed the workload to run RF=5 with the same data set size and increased the number of operations so the test runs longer. All tests were run on the same set of instances without restarting the instances. Numbers for trunk are all over the map and range from fantastically faster to slower. The modified version was more consistent but slower. When trunk was faster it had better CPU utilization. There is a correlation between low CPU utilization (not 1600% on a 16 core 32 thread server) and lower throughput. It kind of suggests there is some kind of starvation or contention preventing tasks from flowing. ||Run|Trunk|Modified|| |#1| 147925|119058| |#2|101740| 139155| |#3| 197265|147973| |#4|105774|#| Trunk performance is not ideal if only from a consistency perspective. I am going to try and get the performance counters for cache misses and such from perf as well as a profile tomorrow. nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256326#comment-14256326 ] Ariel Weisberg commented on CASSANDRA-8457: --- Repeating these benchmarks, trunk can be slower as or more often then it is faster. It's one of those peak performance vs. average and worst case sort of decisions. This is odd. Some measurements running trunk. Soft interrupt hotspot on one CPU. Don't see why that should happen. Interrupt hotspots are usually related to not enough TCP streams and there should be several in this workload. Kind of looks like someone somewhere pinned interrupts to core 0. {noformat} 11:20:04 PM CPU%usr %nice%sys %iowait%irq %soft %steal %guest %idle 11:20:05 PM all 11.190.009.820.000.000.981.27 0.00 76.74 11:20:05 PM0 18.970.00 12.070.000.00 32.761.72 0.00 34.48 11:20:05 PM16.120.00 10.200.000.000.004.08 0.00 79.59 11:20:05 PM27.550.00 11.320.000.000.003.77 0.00 77.36 11:20:05 PM3 11.110.009.260.000.000.001.85 0.00 77.78 11:20:05 PM4 17.860.005.360.000.000.000.00 0.00 76.79 11:20:05 PM5 10.710.008.930.000.000.000.00 0.00 80.36 11:20:05 PM6 10.170.00 10.170.000.000.001.69 0.00 77.97 11:20:05 PM7 10.000.00 11.670.000.000.001.67 0.00 76.67 11:20:05 PM8 13.560.006.780.000.000.001.69 0.00 77.97 11:20:05 PM9 12.500.00 10.940.000.000.001.56 0.00 75.00 11:20:05 PM 10 15.150.009.090.000.000.001.52 0.00 74.24 11:20:05 PM 117.940.00 12.700.000.000.001.59 0.00 77.78 11:20:05 PM 129.520.009.520.000.000.001.59 0.00 79.37 11:20:05 PM 139.090.00 12.120.000.000.000.00 0.00 78.79 11:20:05 PM 14 11.760.00 10.290.000.000.001.47 0.00 76.47 11:20:05 PM 159.230.009.230.000.000.001.54 0.00 80.00 11:20:05 PM 16 10.290.00 14.710.000.000.001.47 0.00 73.53 11:20:05 PM 17 10.450.00 11.940.000.000.000.00 0.00 77.61 11:20:05 PM 18 11.940.008.960.000.000.001.49 0.00 77.61 11:20:05 PM 19 12.680.00 12.680.000.000.001.41 0.00 73.24 11:20:05 PM 20 13.240.008.820.000.000.000.00 0.00 77.94 11:20:05 PM 215.800.00 15.940.000.000.000.00 0.00 78.26 11:20:05 PM 22 13.890.009.720.000.000.001.39 0.00 75.00 11:20:05 PM 239.380.009.380.000.000.000.00 0.00 81.25 11:20:05 PM 247.810.009.380.000.000.001.56 0.00 81.25 11:20:05 PM 258.960.008.960.000.000.001.49 0.00 80.60 11:20:05 PM 26 11.590.00 10.140.000.000.000.00 0.00 78.26 11:20:05 PM 27 13.040.007.250.000.000.001.45 0.00 78.26 11:20:05 PM 287.810.007.810.000.000.000.00 0.00 84.38 11:20:05 PM 29 11.760.008.820.000.000.000.00 0.00 79.41 11:20:05 PM 30 11.430.00 10.000.000.000.001.43 0.00 77.14 11:20:05 PM 31 12.500.004.690.000.000.000.00 0.00 82.81 {noformat} Modified mpstat output is similar. perf stat doesn't have access to performance counters in these instances. I'll have to see if I can get instances that do that. I have a flight recording of each, but not a flight recording of great performance on trunk. nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249678#comment-14249678 ] Benedict commented on CASSANDRA-8457: - A couple of comments on the code: # Requiring synchronized seems _possible_ to negate any benefit, since we could easily get more threads competing. Certainly the cost seems higher than what we save from a lazySet; my typical strategy for this situation is to set the wakeup flag to true, check if any more work needs to be done, and _if so_ atomically re-adopt the state. So the despatch task would be a a loop, terminating only when there is definitely no work to do. # If we're worrying about context switching, we should probably switch to CLQ so that producers never conflict. If we're worried about less efficient draining, we can later introduce a MPSC queue with an efficient drainTo(). # droppedUpdater.decrementAndGet() looks like a typo However I doubt any of these will have a _significant_ impact. The stress test you're running should exercise these pathways in a typical manner. I'm not terribly surprised at a lack of impact, but it is probably worth trying on a much larger cluster to see if more-often-empty queues and a surfeit of threads can elicit a result. nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249854#comment-14249854 ] sankalp kohli commented on CASSANDRA-8457: -- I agree with [~iamaleksey]. We should try this with many nodes. nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249884#comment-14249884 ] T Jake Luciani commented on CASSANDRA-8457: --- This test doesn't deal with the signaling cost within the node. We have a thread blocking on the readExecutors completing. This does tie into CASSANDRA-5239 but part of the goal here is to push the callbacks to the edge without relying on explicit block/signalling calls. nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249930#comment-14249930 ] Ariel Weisberg commented on CASSANDRA-8457: --- #1 The wakeup is protected by a CAS so in the common case there shouldn't be multiple threads contending to dispatch. The synchronized block is there for the case where the thread that is finishing up dispatching signals that it is going to sleep and a dispatch task will be necessary for the next submission. At that point it has to check the queue one more time to avoid lost wakeups, and it is possible a new dispatch task will be created while that is happening. The synchronized forces the new task to wait while the last check and drain completes. How often this race occurs and blocks a thread I have no idea. I could add a counter and check. The only way to avoid it is to lock while checking the queue empty condition and updating the needs wakeup field, or to have a 1:1 mapping between sockets and dispatch threads (AKA not SEP). This would force producers to lock on task submission as well. I don't see how the dispatch task can atomically check that there is no work to do and set the needs wakeup flag at the same time. At that point is there a reason to use a lock free queue? #2 I didn't replace the queue because I needed to maintain size for the dropped message functionality and I didn't want to reason about maintaining size non-atomically with queue operations like offer/poll/drainto. I could give it a whirl. I am also not sure how well iterator.remove in CLQ works, but I can check. #3 Indeed this is a a typo Jake it definitely doesn't address several sources of signaling, but should reduce total # of threads signaled per request. I will profile the two versions today and then add more nodes. For benchmark purposes I could disable the message dropping functionality and use MPSCLinkedQueue from Netty. nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249940#comment-14249940 ] Benedict commented on CASSANDRA-8457: - bq. The only way to avoid it is to lock while checking the queue empty condition and updating the needs wakeup field, or to have a 1:1 mapping between sockets and dispatch threads (AKA not SEP). This would force producers to lock on task submission as well. I don't see how the dispatch task can atomically check that there is no work to do and set the needs wakeup flag at the same time. At that point is there a reason to use a lock free queue? {code} private Runnable dispatchTask = new Runnable() { @Override public void run() { while (true) { while (!backlog.isEmpty()) dispatchQueue(); needsWakeup = true; if (backlog.isEmpty() || !needsWakeupUpdater.compareAndSet(OutboundTcpConnection.this, TRUE, FALSE)) return; } } }; {code} nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249943#comment-14249943 ] Ariel Weisberg commented on CASSANDRA-8457: --- Well there you have it. Thanks Benedict. nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249085#comment-14249085 ] Ariel Weisberg commented on CASSANDRA-8457: --- I have some code and results. https://github.com/aweisberg/cassandra/tree/C-8457 I tested on AWS using a 3 node cluster of c3.8xlarge instances in the same placement group using HVM with Ubuntu 14.04. Other then /etc/security/limits.conf I made no changes to the install which was the Rightscale ServerTemplate Base ServerTemplate for Linux (v14.1.0). Config provided to cstar bootstrap was {code:JavaScript} { revision: aweisberg/C-8457, label: test, yaml: key_cache_size_in_mb: 256\nrow_cache_size_in_mb: 2000\ncommitlog_sync: periodic\ncommitlog_sync_batch_window_in_ms: null\ncommitlog_sync_period_in_ms: 1\ncompaction_throughput_mb_per_sec: 0\nconcurrent_compactors: 4, env: MAX_HEAP_SIZE=8g\nHEAP_NEWSIZE=2g, options: { use_vnodes: true } } { commitlog_directory: /mnt/ephemeral/commitlog, data_file_directories: [ /mnt/ephemeral/datadir ], block_devices: [ /dev/mapper/vg--data-ephemeral0 ], blockdev_readahead: 128, hosts: { ec2-54-175-1-84.compute-1.amazonaws.com: { internal_ip: 172.31.49.199, hostname: ec2-54-175-1-84.compute-1.amazonaws.com, seed: true }, ec2-54-175-32-238.compute-1.amazonaws.com: { internal_ip: 172.31.53.77, hostname: ec2-54-175-32-238.compute-1.amazonaws.com, seed: true }, ec2-54-175-32-206.compute-1.amazonaws.com: { internal_ip: 172.31.57.63, hostname: ec2-54-175-32-206.compute-1.amazonaws.com, seed: true } }, user: ariel_weisberg, name: example1, saved_caches_directory: /mnt/ephemeral/caches } {code} To populate data I used bq. ./cassandra-stress write n=10 -pop seq=1...10 no-wrap -rate threads=50 -col 'n=fixed(1)' -schema 'replication(factor=3)' -node file=$HOME/hosts To read I used bq. ./cassandra-stress read n=1000 cl=ALL -pop 'dist=UNIFORM(1...10)' -rate threads=200 -col 'n=fixed(1)' -schema 'replication(factor=3)' -node file=~/hosts I ran two client instances on two nodes (one per node) also on c3.8xlarge in the same placement group. Unmodified trunk {noformat} op rate : 87497 partition rate: 87497 row rate : 87497 latency mean : 2.3 latency median: 2.1 latency 95th percentile : 3.2 latency 99th percentile : 3.7 latency 99.9th percentile : 4.5 latency max : 124.0 total gc count: 28 total gc mb : 44299 total gc time (s) : 1 avg gc time(ms) : 21 stdev gc time(ms) : 17 Total operation time : 00:01:54 END op rate : 87598 partition rate: 87598 row rate : 87598 latency mean : 2.3 latency median: 2.1 latency 95th percentile : 3.2 latency 99th percentile : 3.8 latency 99.9th percentile : 4.4 latency max : 124.8 total gc count: 133 total gc mb : 211358 total gc time (s) : 3 avg gc time(ms) : 20 stdev gc time(ms) : 17 Total operation time : 00:01:54 END {noformat} Modified {noformat} Results: op rate : 87476 partition rate: 87476 row rate : 87476 latency mean : 2.3 latency median: 2.1 latency 95th percentile : 3.2 latency 99th percentile : 3.7 latency 99.9th percentile : 4.0 latency max : 130.2 total gc count: 102 total gc mb : 165487 total gc time (s) : 3 avg gc time(ms) : 25 stdev gc time(ms) : 21 Total operation time : 00:01:54 END Results: op rate : 87347 partition rate: 87347 row rate : 87347 latency mean : 2.3 latency median: 2.1 latency 95th percentile : 3.1 latency 99th percentile : 3.6 latency 99.9th percentile : 3.9 latency max : 129.2 total gc count: 59 total gc mb : 93416 total gc time (s) : 1 avg gc time(ms) : 23 stdev gc time(ms) : 16 Total operation time : 00:01:54 END {noformat} [~benedict] Can you look at the code and the stress params and validate that you think I am measuring what I think I am measuring? I am going to profile the client and server tomorrow to get my bearings on what executing this workload actually looks like. nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249139#comment-14249139 ] Jonathan Ellis commented on CASSANDRA-8457: --- TLDR throughput is about the same, but latency variance improves? nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249237#comment-14249237 ] Ariel Weisberg commented on CASSANDRA-8457: --- For me the TLDR is that they look the same. I would have to run a bunch more times restart the DB each time and then throw away the best and the worst and then average to see if there is a difference and if there is it is small. I want to understand why the change had 0 impact as opposed to an insignificant one. The number is the run, so matching numbers ran at the same time. Trunk ||Client|Throughput|99.9%tile|| |A #1| 87497|4.5| |B #1| 87598|4.4| |A#2| 89089|3.7| |B#2| 88461|4.1| |A#3| 89002|4.4| |B#3| 88529|4.4| |restart||| |A#4| 87140|3.8| |B#4| 87434|4.2| With changes ||Client|Throughput|99.9%tile|| |A#1| 87476|4.0| |B#1| 87347|3.9| |restart||| |A#2| 87688|4.0| |B#2| 87849|4.4| |A#3| 87905|4.0| |B#3| 88834|3.9| |A#4| 88132|4.3| |B#4| 88187|3.8| nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14249242#comment-14249242 ] Aleksey Yeschenko commented on CASSANDRA-8457: -- Were we expecting any difference at all with just 3 nodes? nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243924#comment-14243924 ] Benedict commented on CASSANDRA-8457: - bq. cstar doesn't support multiple stress clients Stress could be modified to support simulating true multiple client access; i've filed CASSANDRA-8466. What we really need is to be able to fire up a (much) larger cluster, though. With our current hardware probably necessitating multiple VMs per node - say, 4, giving a viable cluster of 24 which is probably about bare minimum for these kinds of tests. This necessarily pollutes the results somewhat since each VM will have only half a CPU, and incur extra thread signalling penalties, but it's better than nothing. Either that or we get a bunch of cheapo nodes, or we add EC2 integration. [~enigmacurry] any plans in the works to introduce support for large clusters? nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244239#comment-14244239 ] Ryan McGuire commented on CASSANDRA-8457: - I've just tried to bring up a 25 node EC2 cluster on cstar_perf, and it did so without hiccup. We have an automatic tool for installing cstar_perf.tool on EC2, it's still lacking integration with the frontend though (meaning you can run tests against it, with the frontend, if we attatch the cluster. You just can't create the cluster from the frontend, yet.) nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244258#comment-14244258 ] Ryan McGuire commented on CASSANDRA-8457: - @Benedict modifying cstar_perf to run multiple instances per node is a larger task, and I'm wondering how useful that will be (seems like a lot of resource contention / non-real-world variables.) Assuming we had the alternative of 100% automated EC2 cluster bootstrapping/teardown, how often would we want to run these larger tests for it to be worth it? nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14244297#comment-14244297 ] Benedict commented on CASSANDRA-8457: - [~enigmacurry] if we could fire up EC2 instances from cstar, that would seem to me to be plenty sufficient for adhoc testing. We can figure something out internally for a more regular CI approach. nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14242859#comment-14242859 ] Benedict commented on CASSANDRA-8457: - FTR, I strongly doubt _context switching_ is actually as much of a problem as we think, although constraining it is never a bad thing. The big hit we have is _thread signalling_ costs, which is a different but related beast. Certainly the talking point that raised this was discussing system time spent serving context switches which would definitely be referring to signalling, not the switching itself. Now, we do use a BlockingQueue for OutboundTcpConnection which will incur these costs, however I strongly suspect the impact will be much lower than predicted - especially as the testing done to flag this up was on small clusters with RF=1, where these threads would not be being exercised at all. The costs of going to the network itself are likely to exceed the context switching costs, and naturally permit messages to accumulate in the queue, reducing the number of signals actually needed. There's then the negative performance implications we have found from small numbers of connections under NIO to consider, so that this change could have significant downsides for the majority of deployed clusters (although if we get batching in the client driver we may see these penalties disappear). To establish if there's likely a benefit to exploit, we could most likely refactor this code comparatively minimally (than rewriting to NIO/Netty) to make use of the SharedExecutorPool to establish if such a positive effect is indeed to be had, as this would reduce the number of threads in flight to those actually serving work on the OTCs. This wouldn't affect the ITC, but I am dubious of their contribution. We should probably also actually test if this is indeed a problem from clusters at scale performing in-memory CL1 reads. nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243008#comment-14243008 ] Ariel Weisberg commented on CASSANDRA-8457: --- bq. To establish if there's likely a benefit to exploit, we could most likely refactor this code comparatively minimally (than rewriting to NIO/Netty) to make use of the SharedExecutorPool to establish if such a positive effect is indeed to be had, as this would reduce the number of threads in flight to those actually serving work on the OTCs. This wouldn't affect the ITC, but I am dubious of their contribution. We should probably also actually test if this is indeed a problem from clusters at scale performing in-memory CL1 reads. I wonder what there is to be gained by having a single socket for inbound/outbound? Running a representative test will take some doing. cstar doesn't support multiple stress clients and it seems like the clusters only have 3 nodes? This is another argument for getting some decent size performance runs in CI working rather then doing one-off manual tests. Having profiling artifacts collected as part of this would also make doing performance research and validation easier. I feel pretty under informed when we discuss what to do next due to the lack of profiling information and the lack of canonical/repeatable performance data and workloads. nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243095#comment-14243095 ] T Jake Luciani commented on CASSANDRA-8457: --- bq. Running a representative test will take some doing. cstar doesn't support multiple stress clients and it seems like the clusters only have 3 nodes? But if you run with RF 1 you can stress the internal network which is what we are changing in this ticket nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8457) nio MessagingService
[ https://issues.apache.org/jira/browse/CASSANDRA-8457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243188#comment-14243188 ] Ariel Weisberg commented on CASSANDRA-8457: --- Thank's Jake that is a good point. 3 nodes is still a problem as that allows these threads to get a lot hotter then they normally would in then in larger cluster. I will try Benedict's suggestion since that would be easy to put in. nio MessagingService Key: CASSANDRA-8457 URL: https://issues.apache.org/jira/browse/CASSANDRA-8457 Project: Cassandra Issue Type: New Feature Components: Core Reporter: Jonathan Ellis Assignee: Ariel Weisberg Labels: performance Fix For: 3.0 Thread-per-peer (actually two each incoming and outbound) is a big contributor to context switching, especially for larger clusters. Let's look at switching to nio, possibly via Netty. -- This message was sent by Atlassian JIRA (v6.3.4#6332)