[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503945#comment-14503945 ] Michael Kjellman commented on CASSANDRA-8789: - My testing has shown that relying on message size as a heuristic to determine the channel/socket to write to has adverse effects under load. The problem is this mixes high priority "Command" verbs (e.g GOSSIP_DIGEST_SYN/GOSSIP_DIGEST_ACK) - that cannot be delayed in any way due to the current implementation of FailureDetector - with lower priority "Response/Data" (e.g MUTATION/READ/REQUEST_RESPONSE) verbs. The effect of this is that nodes will flap and be considered incorrectly DOWN due to failure in sending Gossip verbs which are now queued behind lower priority messages. The implementation of MessagingService is "fire and forget", however we do expect for most messages some form of ACK. For instance, each MUTATION expects a REQUEST_RESPONSE within a given timeout; otherwise a hint is generated. Here lies the problem: the REQUEST_RESPONSE verb is 6 bytes (with no payload -- so now considered "small"). We also have INTERNAL_RESPONSE (also 6 bytes). By using size instead of priority, or the old hard coded Command/Data implementation, (sending high priority messages like GOSSIP over one channel and normal/low priority messages over another) this means the REQUEST_RESPONSE for each MUTATION after this change will now be sent over the same channel that used to be reserved for GOSSIP (or other high priority Command) verbs. If the kernel buffers backup sufficiently (although we have the NO_DELAY option on the socket, it isn't very difficult under moderate/high load to still saturate the NIC) we've now moved an ACK message for every MUTATION onto the same socket that is sending GOSSIP messages. Eventually if we backup with enough small messages we likely will end up unable to send *important* messages (e.g GOSSIP_DIGEST_SYN/GOSSIP_DIGEST_ACK), and FD will falsely be triggered and nodes will be marked DOWN incorrectly. Additionally, once we hit this condition, we end up flapping as GOSSIP messages eventually get thru which compounds the problem. h4. How to reproduce: I'm unable to figure out the new stress so I ran the stress from 2.0 against trunk (commit sha 1fab7b785dc5e440a773828ff17e927a1f3c2e5f from 4/20/15) with all defaults except for changing the replication factor from it's default of 1 to 3. I'm pretty sure the reason I can't easily reproduce with the new stress is I seem to be failing to figure out the command line parsing to change it from the default of 8 threads back to the 30 threads default that was in the old stress. While it's crazy to run with 30 threads, this simulates enough traffic on my 2014 MacBook Pro to actually backup the kernel buffers on loopback which will trigger this. 1) Setup a 3 node ccm cluster locally with all defaults (ccm create tcptest --install-dir=/Users/username/pathto/cassandra-apache/ && ccm populate -n 3 && ccm start) 2) Run stress from 2.0 using all defaults aside from specifying a RF=3 (tools/bin/cassandra-stress -l 3) 3) Monitor FailureDetector messages in the logs, overall load written, etc h4. Expected Results: # Without these changes, stress will not timeout while inserting data. With this change, I've now observed timeouts starting 50% of the way thru the 1 million records. {noformat} Operation [303198] retried 10 times - error inserting key 0303198 ((TTransportException): java.net.SocketException: Broken pipe) {noformat} # Although MUTATION messages should/are expected to be dropped under high load etc, GOSSIP messages should not fail in being written to the socket in a timely manner to avoid FD (FailureDetector) from incorrectly marking nodes DOWN incorrectly. # Amount of inserted load reported in nodetool ring should be ~250MB using the 2.0 stress tool. On my machine I saw a "final" load of 1.44MB on node(1), and only ~65MB on node(2,3). This is due to FD marking the nodes down and dropping mutations and creating hints. (Additionally, once in this state, memory overhead get's even worse as we generate unnecessary hints because in the prior design we were able to actually write to the socket.) h4. Alternative Proposal I'm 100% on board with using a more priority based system to better utilize the two channels/sockets we have. For instance: MUTATION(2), READ_REPAIR(3), REQUEST_RESPONSE(2), REPLICATION_FINISHED(1), INTERNAL_RESPONSE(1), COUNTER_MUTATION(2), GOSSIP_DIGEST_SYN(1), GOSSIP_DIGEST_ACK(1), GOSSIP_DIGEST_ACK2(1), That way we can use the priorities to route small messages like SNAPSHOT, TRUNCATE, GOSSIP_DIGEST_SYN over the high-priority channel and the normal-priority messages over the other channel/socket. > OutboundTcpConnectionPool should route messages to sockets by size not type > --
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14503977#comment-14503977 ] Brandon Williams commented on CASSANDRA-8789: - I agree on priority-based messaging. Gossip is fairly low throughput, but also very important to get delivered and should take priority. > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504005#comment-14504005 ] Benedict commented on CASSANDRA-8789: - I don't doubt there are problems with this, but I'm not sure they're significantly worse under the new scheme than the old... Currently messages are split along the following boundaries: REQUEST_RESPONSE, INTERNAL_RESPONSE, GOSSIP, READ, MUTATION, COUNTER_MUTATION, ANTI_ENTROPY, MIGRATION, MISC, TRACING, READ_REPAIR; READ_RESPONSE is half of the problem messages you highlighted, and in many workloads likely significantly more of a problem than mutations (since with clustering data they have the potential to deliver much larger payloads), and they currently operate on the same channel as gossip. The main difference is that you won't see them on a pure stress write workload; a mixed workload you would. So if this is a potentially serious problem, it is likely already being exhibited. I should make clear that I'm not disputing there's a problem - this seems very clearly something we want to avoid. But I don't think we have made matters _worse_ with this ticket (though the profile has perhaps changed). Introducing "extra" channels that are managed via NIO for whom we have no throughput requirements, only latency, seems like a potential solution to this. Or a priority queue and a capped send buffer size (capped low for slow WAN connections, for instance). I would quite like to see us abstract MessagingService so that not only the transport can be pluggable, but it can be different per end-point (e.g. cross-dc), and per message type. I think all of these endeavours are orthogonal to this ticket, though, and deserve their own. > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504050#comment-14504050 ] Ariel Weisberg commented on CASSANDRA-8789: --- I can reproduce this using the 2.0 version of stress which is interesting. It didn't reproduce with a write only workload of stress on trunk. The why of that is probably interesting is well. I will look into it more tomorrow. > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504158#comment-14504158 ] Benedict commented on CASSANDRA-8789: - 2.0 stress, AFAICR, does not load balance. By default 2.1 does (smart thrift routing round-robins the owning nodes for any token). So all of the writes to the cluster are likely being piped through a single node in the 2.0 experiment (so over just two tcp connections), instead of evenly spread over six. > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509731#comment-14509731 ] Ariel Weisberg commented on CASSANDRA-8789: --- [~mkjellman] I tried this reverting the socket change and initially I thought it mattered, but I think I was swapping when it passed with the change reverted. I tried it three times and they do the same thing. The first node OOMs and the heap dump blames tasks sitting in SEPExecutor. I also ran with flight recorder and checked the node serving client traffic and one of the other nodes. There is some significant blocking on the coordinating node, but the longest pause was 300 milliseconds and total duration was 2 seconds for a 1 minute period (200 pauses). If I chased those down I bet they are correlated with GC pauses. I was able to get 2.1.2 to write hints, but not to fail the same way that trunk does with SEPExecutor OOM. Still digging into why trunk fares worse. > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509954#comment-14509954 ] Michael Kjellman commented on CASSANDRA-8789: - I'm less concerned about hints being generated in general. With the old stress + defaults (and RF=3 to generate lots of MUTATIONS between nodes) hints will be generated bc we never will be able to keep up and send all of the REQUEST_RESPONSE before we see timeouts. The real concern I have is that Gossiper/FD will kick in and DOWN healthy up nodes simply because we can't get gossip messages out onto the wire as they backup behind all of the REQUEST_RESPONSE messages... > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509969#comment-14509969 ] Aleksey Yeschenko commented on CASSANDRA-8789: -- Should we look int the logs before and after for failure detector mentions then? > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509975#comment-14509975 ] Michael Kjellman commented on CASSANDRA-8789: - You should be able to run the old stress for all 1 million rows without FD/Gossip down'ing any nodes. I generally just tail the logs while stress runs to ensure there are no logs from the Gossiper class (assuming the log level is set to the default INFO level). > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509978#comment-14509978 ] Ariel Weisberg commented on CASSANDRA-8789: --- This is just an OOM. Nothing special going on WRT to Gossip/FD. Benedict and I have been positing is that there is no change in behavior from previous versions in terms of what messages are contending for access to the socket for this workload, and I think that I have confirmed that. That doesn't mean there aren't some conditions where head of line blocking would be an issue for gossip, but I am guessing that they would have to be pretty weird. Even then the real solution might actually be to base failure detection on all incoming messages and not just Gossip. It's a little weird to me that only heartbeats count as liveness/reachability, but it really depends on what you are trying to prove. > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14509992#comment-14509992 ] Ariel Weisberg commented on CASSANDRA-8789: --- I can do that with 2.1.2 (I went to 10 million) so it's not a head of line blocking issue since gossip is sharing a socket with mutation responses. I think flight recorder confirms that by showing a rough bound on how long threads are waiting to write to sockets. > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14510016#comment-14510016 ] Benedict commented on CASSANDRA-8789: - I should clarify here that I do think MUTATION messages could plausibly delay gossip messages where they couldn't before. However REQUEST_RESPONSE messages as mentioned above as the potential cause could always cause head of line blocking for gossip messages. So my position is only that the head of line blocking concern is not a new one, not that its characteristics are identical. > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14511331#comment-14511331 ] Ariel Weisberg commented on CASSANDRA-8789: --- I was able to reproduce the OOM once in 2.1.2. I have found that the mutation stage is filling up with tasks and they look like responses to writes. In 2.1.2 when it succeeds it kind of looks like it is just dropping the messages. The reason it fails at 300k is that some 50k or so get processed and 250k back up causing OOM. We could try and do some things to make this more robust against overload. Say by having the producer (IncomingTcpConnection) detect overload and start dropping messages without relying on the consumer (MutationStage) to drop them. I am leaning towards not trying to fix this wart because it requires somewhat unrealistic conditions. There has to be no load balancing, a heap that is too small, and an oversubscribed instance. [~mkjellman] I created a CASSANDRA-9237 for the issue of Gossip sharing a connection with most traffic. > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14511595#comment-14511595 ] Pavel Yaskevich commented on CASSANDRA-8789: [~aweisberg] I kind of lost track of what is going on in this ticket. On one side Michael is saying that is the problem with prioritization and he had never got anything to OOM at all which [~benedict] seem to confirm (?), but you keep saying that this is an OOM for you every time, so maybe it's worth a while to try to figure out how to reproduce exact problem Michael is talking about instead? Also CASSANDRA-9237 seems to try to address the same problem which is caused by this ticket so why do you need a separate ticket for it instead of re-opening this one and working here? I would understand if you open a separate ticket for OOM tho which sounds to be a different problem... > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14511879#comment-14511879 ] Ariel Weisberg commented on CASSANDRA-8789: --- [~xedin] I tried to reproduce what Michael described and I found a root cause that is different and it seems to be an issue across multiple versions. IOW I think it is unrelated to this ticket. It's definitely worth reproducing the problem Michael is talking about which is why I created a ticket for that specific issue. AFAIK no one besides myself has tested with and without this change on trunk and found that it has an impact. [~mkjellman] if you try and run this using your reproducer steps if you let it hang long enough do you get the heap dump and OOM? If you revert the change are you saying everything starts to work for you? > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512147#comment-14512147 ] Michael Kjellman commented on CASSANDRA-8789: - I tried to cleanly revert the following commits to demonstrate that stress functions "as expected" without the changes from CASSANDRA-8789 but I got into conflict hell. {noformat} ebd0ae820a3fc7c13d58b6ddb48ba4d26b3fcd65 144644bbf77a546c45db384e2dbc18e13f65c9ce 1caa4f942662cd49609e86e2cd747421a9d71700 16499ca9b0080ea4d3c4ed3bc55c753bacc3c24e 828496492c51d7437b690999205ecc941f41a0a9 {noformat} I tried to checkout 21bdf8700601f8150e8c13e0b4f71e061822c802, however the build is broken in that commit and it was reverted by jbellis in b25adc765769869d16410f1ca156227745d9b17b. I tried to next checkout 21bdf8700601f8150e8c13e0b4f71e061822c802-1 (1279009e0e29267d8fc3300071034e2ede6065ca) which I could build and unlike a few other commits I tried there were no exceptions logged while inserting data. In this commit though I do see issues with stress around 300k rows even with the OutboundTcpConnection changes backed out. I next checked out 8896a70b015102c212d0a27ed1f4e1f0fabe85c4 which is the previous commit to 828496492c51d7437b690999205ecc941f41a0a9 for OutboundTcpConnection. Testing against that commit I'm able to insert all rows as expected and Gossiper does not down any nodes during the duration of stress. This commit however was logging intermittant NPEs (however otherwise load after stress looks sane...) {noformat} ERROR [CompactionExecutor:2] 2015-04-24 17:48:20,741 CassandraDaemon.java:182 - Exception in thread Thread[CompactionExecutor:2,1,main] java.lang.NullPointerException: null at org.apache.cassandra.io.sstable.format.SSTableReader$Tidier.tidy(SSTableReader.java:1798) ~[main/:na] {noformat} h4. Strees Output {noformat} Michaels-MacBook-Pro:cassandra-aml mkjellman$ tools/bin/cassandra-stress -l 3 null total,interval_op_rate,interval_key_rate,latency,95th,99.9th,elapsed_time 31769,3176,3176,1.3,100.7,562.8,10 84820,5305,5305,1.1,76.7,452.8,20 152130,6731,6731,2.1,50.0,2137.6,30 267053,11492,11492,2.4,19.2,2137.6,40 346529,7947,7947,2.4,14.9,2159.1,51 455677,10914,10914,2.3,9.6,2131.6,61 619967,16429,16429,1.8,7.2,203.1,71 796739,17677,17677,1.3,5.3,202.4,81 967800,17106,17106,0.9,5.1,202.2,91 100,3220,3220,0.9,4.8,202.2,95 Averages from the middle 80% of values: interval_op_rate : 11865 interval_key_rate : 11865 latency median: 2.0 latency 95th percentile : 17.7 latency 99.9th percentile : 1495.2 Total operation time : 00:01:35 END {noformat} h4. nodetool ring output {noformat} Datacenter: datacenter1 == AddressRackStatus State LoadOwnsToken 3074457345618258602 127.0.0.1 rack1 Up Normal 294.67 MB ? -9223372036854775808 127.0.0.2 rack1 Up Normal 246.91 MB ? -3074457345618258603 127.0.0.3 rack1 Up Normal 247.19 MB ? 3074457345618258602 {noformat} So -- I would agree that until the source of the regression (unrelated to this ticket) that is causing stress to fail even without the changes to OutboundTcpConnection reverted, we can't move forward evaluating the merits of changes to the actual TCP socket handling affecting the timely delivery of GOSSIP messages/verbs. > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of lin
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512166#comment-14512166 ] Michael Kjellman commented on CASSANDRA-8789: - I just tried the following. Checkout 8896a70b015102c212d0a27ed1f4e1f0fabe85c4 (which I'm able to insert all 100k records without issue) and then apply 828496492c51d7437b690999205ecc941f41a0a9 and 144644bbf77a546c45db384e2dbc18e13f65c9ce I started seeing failures 1/3 of the way thru stress with messages like the following in the logs {noformat} WARN [GossipTasks:1] 2015-04-24 18:32:16,832 Gossiper.java:685 - Gossip stage has 3 pending tasks; skipping status check (no nodes will be marked down) INFO [GossipTasks:1] 2015-04-24 18:32:40,995 Gossiper.java:938 - InetAddress /127.0.0.1 is now DOWN INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:42,002 OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004 OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004 OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,005 OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,010 OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,010 OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,011 OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,012 OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,022 OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:07,023 OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1 {noformat} So, in summary, I am able reproduces and have 2.0 stress fail with the changes to OutboundTcpConnection/OutboundTcpConnectionPool (828496492c51d7437b690999205ecc941f41a0a9/144644bbf77a546c45db384e2dbc18e13f65c9ce) applied against (8896a70b015102c212d0a27ed1f4e1f0fabe85c4) which I can successfully run cassandra-stress -l 3 without failure. > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512523#comment-14512523 ] Ariel Weisberg commented on CASSANDRA-8789: --- I ran exactly what you suggested, except I routed gossip on the large message socket and set the large message threshold to Integer.MAX_VALUE. getConnection() looked like {noformat} /** * returns the appropriate connection based on message type. * returns null if a connection could not be established. */ OutboundTcpConnection getConnection(MessageOut msg) { if (msg.getStage() == Stage.GOSSIP) { return largeMessages; } return msg.payloadSize(smallMessages.getTargetVersion()) > LARGE_MESSAGE_THRESHOLD ? largeMessages : smallMessages; } {noformat} And it fails in the exact same way. The fact that you have to pull in the coalescing fixes to get it to fail further confirms my belief that messaging got faster (when there are no network issues) not slower and that is leading to the node hanging. 2.0 doesn't log pending tasks in each stage so I would have to instrument some more to confirm this is the issue. Trying to further prove that thesis I cherry-picked only 144644bbf77a546c45db384e2dbc18e13f65c9ce and it ran 10 million writes no problem. Doesn't mean there isn't a head of line blocking issue when network connections are genuinely slow. That's why I created CASSANDRA-9237 and I have a couple ideas of how to make FD less dependent on heartbeats or how to get gossip messages to not be blocked. Taking it one more step further I added back coalescing, but not the full deal. I just fixed a bug in OutboundTcpConnection where it would never write multiple messages at once without flushing. {noformat} diff --git a/src/java/org/apache/cassandra/net/OutboundTcpConnection.java b/src/java/org/apache/cassandra/net/OutboundTcpConnection.java index cddce07..e90cef8 100644 --- a/src/java/org/apache/cassandra/net/OutboundTcpConnection.java +++ b/src/java/org/apache/cassandra/net/OutboundTcpConnection.java @@ -132,7 +132,7 @@ public class OutboundTcpConnection extends Thread outer: while (true) { -if (backlog.drainTo(drainedMessages, drainedMessages.size()) == 0) +if (backlog.drainTo(drainedMessages, 128) == 0) { try { @@ -142,7 +142,7 @@ public class OutboundTcpConnection extends Thread { throw new AssertionError(e); } - +backlog.drainTo(drainedMessages, 127); } currentMsgBufferCount = drainedMessages.size(); {noformat} With this change it fails. Fundamentally the changes in this ticket as Benedict pointed out are not completely new. Gossip always contended with mutation responses and read responses. The big change is that "small" mutations share a socket with gossip messages. > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512525#comment-14512525 ] Benedict commented on CASSANDRA-8789: - I'm confused by that output and your analysis, so if you could clarify it would be appreciated. The message on node1 doesn't indicate anything about the TCP connection, only that we have 3 gossip messages on the node that have yet to be processed, meaning the gossip _stage_ (thread pool) is backed up for some reason. Possibly due to the node being overloaded. The second messages on the other hand seem to indicate the node1 really is suffering difficulty, though? Because it cannot reconnect its connection to it, after it was forcibly closed by the gossiper (though it is possible we have some other problems wrt reconnection that I'm not aware of). > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512580#comment-14512580 ] Michael Kjellman commented on CASSANDRA-8789: - {quote} Gossip always contended with mutation responses and read responses. {quote} No, they didn't. This is why there were two sockets in the first place. A Command and socket and a Data socket. I have said since day one when I raised this as a concern that with changes to Gossip (large and definitely outside the scope of 3.0) could be made so this might not be an issue. Today with these changes and today's Gossip implementation -- this is a regression. > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512587#comment-14512587 ] Benedict commented on CASSANDRA-8789: - bq. Gossip always contended with mutation *responses* and read responses. I suspect there may be an issue with nomenclature here. These statements made by Ariel are both true, but the internal nomenclature for both of these is REQUEST_RESPONSE. > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512597#comment-14512597 ] Benedict commented on CASSANDRA-8789: - FTR, my current perception of this is: * it does look to me like the increased throughput of the new code is a plausible cause of server degradation in these localhost tests, since we know that the server has no extra shedding logic in place beyond the normal timeout. ** improved shedding should be addressed separately, e.g. CASSANDRA-8518 * that doesn't mean head of line blocking isn't a real concern, especially for low bandwidth links ** it does seem likely already an issue in 2.0/2.1 to some greater or lesser degree given the existing combination of gossip with read response data ** however this does change the exposure profile, and especially for smart routed clients it might exacerbate this problem in certain cases ** i don't think the exposure profile is sufficiently different to consider this a regression or to revert the other positive improvements delivered by this change * I do think we can quite easily manage this by opening a new connection, managed by netty or raw NIO, over which we communicate only gossip messages (or other low frequency, high urgency messages) ** this would in the typical case mean we are using no more connections than 2.1 (though with large mutations/responses we may end up using 50% more connections), but: *** these connections would not have significant threading impacts *** nor would they have any impact on the improved throughput delivered by coalescing * CASSANDRA-9237 is IMO a good place to continue this discussion > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324537#comment-14324537 ] Benedict commented on CASSANDRA-8789: - is this based on latest trunk? got a failed apply. Much prefer github links so this isn't a problem :) > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324577#comment-14324577 ] Ariel Weisberg commented on CASSANDRA-8789: --- Well... I avoid rebasing trunk frequently because a good chunk of the time I do that I get something that is not working. Meaning I can't run a benchmark to evaluate performance. It also means my baseline is slightly more suspect as various things change and I have to take earlier performance numbers with a grain of salt. I rebased off of trunk https://github.com/aweisberg/cassandra/compare/C-8789-2?expand=1 > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324597#comment-14324597 ] Ariel Weisberg commented on CASSANDRA-8789: --- I should also add that I originally did this off of C-8692 so that the performance measurements would be meaningful since coalescing is a pre-req for this to a degree. > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14324683#comment-14324683 ] Benedict commented on CASSANDRA-8789: - It's a nice neat patch. It might be worth commenting on the payloadSize memoization that we piggyback on visibility guarantees of the queue we use to pass the message to another thread, since we do always pass it, and that once handed over we should never call payloadSize() again on the thread that has handed off ownership. When I commit I'll also clean up some legacy cruft, like some generic parameters, and normalising the operation over both connections (in one place we just list them both, in the other two we construct an array and iterate, I'd prefer to do just one). But these are unrelated to this patch. > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14328285#comment-14328285 ] Ariel Weisberg commented on CASSANDRA-8789: --- I added a comment to MessageOut.payloadSize. > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349381#comment-14349381 ] Aleksey Yeschenko commented on CASSANDRA-8789: -- Looks neat indeed. Not trying to stall progress here, but do we have numbers on this (standalone, and/or with 8692 included)? Mostly just curious. > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14349390#comment-14349390 ] Ariel Weisberg commented on CASSANDRA-8789: --- Yes it's in this comment https://issues.apache.org/jira/browse/CASSANDRA-8789?focusedCommentId=14320467&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14320467 > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type
[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483270#comment-14483270 ] Ariel Weisberg commented on CASSANDRA-8789: --- Good catch. Agreed, it's just for the size estimate which we don't use for anything other then a heuristic so current version is fine. I'll set that as the initial value for OutboundTcpConnection.targetVersion. [Code on github|https://github.com/apache/cassandra/compare/trunk...aweisberg:C-8789-3?expand=1] Running unit tests now. > OutboundTcpConnectionPool should route messages to sockets by size not type > --- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)