[ https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14512166#comment-14512166 ]
Michael Kjellman edited comment on CASSANDRA-8789 at 4/25/15 1:41 AM: ---------------------------------------------------------------------- I just tried the following. Checkout 8896a70b015102c212d0a27ed1f4e1f0fabe85c4 (which I'm able to insert all 100k records without issue) and then apply 828496492c51d7437b690999205ecc941f41a0a9 and 144644bbf77a546c45db384e2dbc18e13f65c9ce I started seeing failures 1/3 of the way thru stress with messages like the following in the logs h4. ccm node1 showlog {noformat} WARN [GossipTasks:1] 2015-04-24 18:32:16,832 Gossiper.java:685 - Gossip stage has 3 pending tasks; skipping status check (no nodes will be marked down) INFO [GossipTasks:1] 2015-04-24 18:32:40,995 Gossiper.java:938 - InetAddress /127.0.0.1 is now DOWN {noformat} h4. ccm node2 showlog {noformat} INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:42,002 OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004 OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004 OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,005 OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,010 OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,010 OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,011 OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,012 OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,022 OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:07,023 OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1 {noformat} So, in summary, I am able to cause Gossiper/FD to DOWN nodes and have 2.0 stress fail with the changes to OutboundTcpConnection/OutboundTcpConnectionPool (828496492c51d7437b690999205ecc941f41a0a9/144644bbf77a546c45db384e2dbc18e13f65c9ce) applied against (8896a70b015102c212d0a27ed1f4e1f0fabe85c4) which (detailed in previous comment on this ticket) I was able to successfully run cassandra-stress -l 3 against without failure. was (Author: mkjellman): I just tried the following. Checkout 8896a70b015102c212d0a27ed1f4e1f0fabe85c4 (which I'm able to insert all 100k records without issue) and then apply 828496492c51d7437b690999205ecc941f41a0a9 and 144644bbf77a546c45db384e2dbc18e13f65c9ce I started seeing failures 1/3 of the way thru stress with messages like the following in the logs {noformat} WARN [GossipTasks:1] 2015-04-24 18:32:16,832 Gossiper.java:685 - Gossip stage has 3 pending tasks; skipping status check (no nodes will be marked down) INFO [GossipTasks:1] 2015-04-24 18:32:40,995 Gossiper.java:938 - InetAddress /127.0.0.1 is now DOWN INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:42,002 OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004 OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004 OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,005 OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,010 OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,010 OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,011 OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,012 OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,022 OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1 INFO [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:07,023 OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1 {noformat} So, in summary, I am able to cause Gossiper/FD to DOWN nodes and have 2.0 stress fail with the changes to OutboundTcpConnection/OutboundTcpConnectionPool (828496492c51d7437b690999205ecc941f41a0a9/144644bbf77a546c45db384e2dbc18e13f65c9ce) applied against (8896a70b015102c212d0a27ed1f4e1f0fabe85c4) which (detailed in previous comment on this ticket) I was able to successfully run cassandra-stress -l 3 against without failure. > OutboundTcpConnectionPool should route messages to sockets by size not type > --------------------------------------------------------------------------- > > Key: CASSANDRA-8789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8789 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Ariel Weisberg > Assignee: Ariel Weisberg > Fix For: 3.0 > > Attachments: 8789.diff > > > I was looking at this trying to understand what messages flow over which > connection. > For reads the request goes out over the command connection and the response > comes back over the ack connection. > For writes the request goes out over the command connection and the response > comes back over the command connection. > Reads get a dedicated socket for responses. Mutation commands and responses > both travel over the same socket along with read requests. > Sockets are used uni-directional so there are actually four sockets in play > and four threads at each node (2 inbounded, 2 outbound). > CASSANDRA-488 doesn't leave a record of what the impact of this change was. > If someone remembers what situations were made better it would be good to > know. > I am not clear on when/how this is helpful. The consumer side shouldn't be > blocking so the only head of line blocking issue is the time it takes to > transfer data over the wire. > If message size is the cause of blocking issues then the current design mixes > small messages and large messages on the same connection retaining the head > of line blocking. > Read requests share the same connection as write requests (which are large), > and write acknowledgments (which are small) share the same connections as > write requests. The only winner is read acknowledgements. -- This message was sent by Atlassian JIRA (v6.3.4#6332)