subject:"\[jira\] \[Comment Edited\] \(CASSANDRA\-8789\) OutboundTcpConnectionPool should route messages to sockets by size not type"

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

2015-04-25 Thread Benedict (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512587#comment-14512587
]

Benedict edited comment on CASSANDRA-8789 at 4/25/15 4:50 PM:
--

bq. Gossip always contended with mutation *responses* and read responses.

I suspect there may be an issue with nomenclature here. These statements made
by Ariel are both true, but the internal nomenclature for both of these is
REQUEST_RESPONSE. But to further clarify, the distinction has never been
command/data, but between command/acknowledgement. Where acknowledgement in the
case of a read request includes the entire data for serving that read request.

was (Author: benedict):
bq. Gossip always contended with mutation *responses* and read responses.

I suspect there may be an issue with nomenclature here. These statements made
by Ariel are both true, but the internal nomenclature for both of these is
REQUEST_RESPONSE.

OutboundTcpConnectionPool should route messages to sockets by size not type
---

Key: CASSANDRA-8789
URL: https://issues.apache.org/jira/browse/CASSANDRA-8789
Project: Cassandra
Issue Type: Improvement
Components: Core
Reporter: Ariel Weisberg
Assignee: Ariel Weisberg
Fix For: 3.0

Attachments: 8789.diff

I was looking at this trying to understand what messages flow over which
connection.
For reads the request goes out over the command connection and the response
comes back over the ack connection.
For writes the request goes out over the command connection and the response
comes back over the command connection.
Reads get a dedicated socket for responses. Mutation commands and responses
both travel over the same socket along with read requests.
Sockets are used uni-directional so there are actually four sockets in play
and four threads at each node (2 inbounded, 2 outbound).
CASSANDRA-488 doesn't leave a record of what the impact of this change was.
If someone remembers what situations were made better it would be good to
know.
I am not clear on when/how this is helpful. The consumer side shouldn't be
blocking so the only head of line blocking issue is the time it takes to
transfer data over the wire.
If message size is the cause of blocking issues then the current design mixes
small messages and large messages on the same connection retaining the head
of line blocking.
Read requests share the same connection as write requests (which are large),
and write acknowledgments (which are small) share the same connections as
write requests. The only winner is read acknowledgements.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

2015-04-25 Thread Michael Kjellman (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512580#comment-14512580
]

Michael Kjellman edited comment on CASSANDRA-8789 at 4/25/15 4:37 PM:
--

{quote}
Gossip always contended with mutation responses and read responses.
{quote}

No, they didn't. This is why there were two sockets in the first place. A
Command socket and a Data socket. I have said since day one when I raised this
as a concern that with changes to Gossip (large and definitely outside the
scope of 3.0) could be made so this might not be an issue.

Today with these changes and today's Gossip implementation -- this is a
regression.

was (Author: mkjellman):
{quote}
Gossip always contended with mutation responses and read responses.
{quote}

No, they didn't. This is why there were two sockets in the first place. A
Command and socket and a Data socket. I have said since day one when I raised
this as a concern that with changes to Gossip (large and definitely outside the
scope of 3.0) could be made so this might not be an issue.

Today with these changes and today's Gossip implementation -- this is a
regression.

OutboundTcpConnectionPool should route messages to sockets by size not type
---

Attachments: 8789.diff

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

2015-04-24 Thread Ariel Weisberg (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511879#comment-14511879
]

Ariel Weisberg edited comment on CASSANDRA-8789 at 4/24/15 10:09 PM:
-

[~xedin] I tried to reproduce what Michael described and I found a root cause
that is different and it seems to be an issue across multiple versions, both
with and without the changes to OTCP. IOW I think it is unrelated to this
ticket.

It's definitely worth reproducing the problem Michael is talking about which is
why I created a ticket for that specific issue. AFAIK no one besides myself has
tested with and without this change on trunk and found that it has an impact.

[~mkjellman] if you try and run this using your reproducer steps if you let it
hang long enough do you get the heap dump and OOM? If you revert the change are
you saying everything starts to work for you?

The reason I think you are seeing the same thing I am is that it flakes out at
300k for the reason I mentioned earlier (only 250k fits on heap).

was (Author: aweisberg):
[~xedin] I tried to reproduce what Michael described and I found a root cause
that is different and it seems to be an issue across multiple versions, both
with and without the changes to OTCP. IOW I think it is unrelated to this
ticket.

OutboundTcpConnectionPool should route messages to sockets by size not type
---

Attachments: 8789.diff

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

2015-04-24 Thread Ariel Weisberg (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511879#comment-14511879
]

Ariel Weisberg edited comment on CASSANDRA-8789 at 4/24/15 10:09 PM:
-

was (Author: aweisberg):
[~xedin] I tried to reproduce what Michael described and I found a root cause
that is different and it seems to be an issue across multiple versions. IOW I
think it is unrelated to this ticket.

OutboundTcpConnectionPool should route messages to sockets by size not type
---

Attachments: 8789.diff

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

2015-04-24 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512166#comment-14512166
 ] 

Michael Kjellman edited comment on CASSANDRA-8789 at 4/25/15 1:41 AM:
--

I just tried the following.

Checkout 8896a70b015102c212d0a27ed1f4e1f0fabe85c4 (which I'm able to insert all 
100k records without issue) and then apply 
828496492c51d7437b690999205ecc941f41a0a9 and 
144644bbf77a546c45db384e2dbc18e13f65c9ce

I started seeing failures 1/3 of the way thru stress with messages like the 
following in the logs

h4. ccm node1 showlog
{noformat}
WARN  [GossipTasks:1] 2015-04-24 18:32:16,832 Gossiper.java:685 - Gossip stage 
has 3 pending tasks; skipping status check (no nodes will be marked down)
INFO  [GossipTasks:1] 2015-04-24 18:32:40,995 Gossiper.java:938 - InetAddress 
/127.0.0.1 is now DOWN
{noformat}

h4. ccm node2 showlog
{noformat}
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:42,002 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,005 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,010 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,010 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,011 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,012 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,022 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:07,023 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
{noformat}

So, in summary, I am able to cause Gossiper/FD to DOWN nodes and have 2.0 
stress fail with the changes to OutboundTcpConnection/OutboundTcpConnectionPool 
(828496492c51d7437b690999205ecc941f41a0a9/144644bbf77a546c45db384e2dbc18e13f65c9ce)
 applied against (8896a70b015102c212d0a27ed1f4e1f0fabe85c4) which (detailed in 
previous comment on this ticket) I was able to successfully run 
cassandra-stress -l 3 against without failure.


was (Author: mkjellman):
I just tried the following.

Checkout 8896a70b015102c212d0a27ed1f4e1f0fabe85c4 (which I'm able to insert all 
100k records without issue) and then apply 
828496492c51d7437b690999205ecc941f41a0a9 and 
144644bbf77a546c45db384e2dbc18e13f65c9ce

I started seeing failures 1/3 of the way thru stress with messages like the 
following in the logs

{noformat}
WARN  [GossipTasks:1] 2015-04-24 18:32:16,832 Gossiper.java:685 - Gossip stage 
has 3 pending tasks; skipping status check (no nodes will be marked down)
INFO  [GossipTasks:1] 2015-04-24 18:32:40,995 Gossiper.java:938 - InetAddress 
/127.0.0.1 is now DOWN
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:42,002 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,005 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,010 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,010 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,011 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,012 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,022 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:07,023 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
{noformat}

So, in summary, I am able to cause Gossiper/FD to DOWN nodes and have 2.0 
stress fail with the changes to OutboundTcpConnection/OutboundTcpConnectionPool 
(828496492c51d7437b690999205ecc941f41a0a9/144644bbf77a546c45db384e2dbc18e13f65c9ce)
 applied against (8896a70b015102c212d0a27ed1f4e1f0fabe85c4) which (detailed in 
previous comment on this ticket) I was able to successfully run

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

2015-04-24 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512166#comment-14512166
 ] 

Michael Kjellman edited comment on CASSANDRA-8789 at 4/25/15 1:46 AM:
--

I just tried the following.

Checkout 8896a70b015102c212d0a27ed1f4e1f0fabe85c4 (which I'm able to insert all 
100k records without issue) and then apply 
828496492c51d7437b690999205ecc941f41a0a9 and 
144644bbf77a546c45db384e2dbc18e13f65c9ce

I started seeing failures 1/3 of the way thru stress with messages like the 
following in the logs

h4. ccm node1 showlog
{noformat}
WARN  [GossipTasks:1] 2015-04-24 18:32:16,832 Gossiper.java:685 - Gossip stage 
has 3 pending tasks; skipping status check (no nodes will be marked down)
{noformat}

h4. ccm node2 showlog
{noformat}
INFO  [GossipTasks:1] 2015-04-24 18:32:40,995 Gossiper.java:938 - InetAddress 
/127.0.0.1 is now DOWN
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:42,002 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,005 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,010 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,010 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,011 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,012 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,022 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:07,023 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
{noformat}

So, in summary, I am able to cause Gossiper/FD to DOWN nodes and have 2.0 
stress fail with the changes to OutboundTcpConnection/OutboundTcpConnectionPool 
(828496492c51d7437b690999205ecc941f41a0a9/144644bbf77a546c45db384e2dbc18e13f65c9ce)
 applied against (8896a70b015102c212d0a27ed1f4e1f0fabe85c4) which (detailed in 
previous comment on this ticket) I was able to successfully run 
cassandra-stress -l 3 against without failure.


was (Author: mkjellman):
I just tried the following.

Checkout 8896a70b015102c212d0a27ed1f4e1f0fabe85c4 (which I'm able to insert all 
100k records without issue) and then apply 
828496492c51d7437b690999205ecc941f41a0a9 and 
144644bbf77a546c45db384e2dbc18e13f65c9ce

I started seeing failures 1/3 of the way thru stress with messages like the 
following in the logs

h4. ccm node1 showlog
{noformat}
WARN  [GossipTasks:1] 2015-04-24 18:32:16,832 Gossiper.java:685 - Gossip stage 
has 3 pending tasks; skipping status check (no nodes will be marked down)
INFO  [GossipTasks:1] 2015-04-24 18:32:40,995 Gossiper.java:938 - InetAddress 
/127.0.0.1 is now DOWN
{noformat}

h4. ccm node2 showlog
{noformat}
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:42,002 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,005 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,010 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,010 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,011 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,012 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,022 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:07,023 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
{noformat}

So, in summary, I am able to cause Gossiper/FD to DOWN nodes and have 2.0 
stress fail with the changes to OutboundTcpConnection/OutboundTcpConnectionPool 
(828496492c51d7437b690999205ecc941f41a0a9/144644bbf77a546c45db384e2dbc18e13f65c9ce)
 applied against (8896a70b015102c212d0a27ed1f4e1f0fabe85c4) which (detailed in 
previous

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

2015-04-24 Thread Michael Kjellman (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512166#comment-14512166
 ] 

Michael Kjellman edited comment on CASSANDRA-8789 at 4/25/15 1:39 AM:
--

I just tried the following.

Checkout 8896a70b015102c212d0a27ed1f4e1f0fabe85c4 (which I'm able to insert all 
100k records without issue) and then apply 
828496492c51d7437b690999205ecc941f41a0a9 and 
144644bbf77a546c45db384e2dbc18e13f65c9ce

I started seeing failures 1/3 of the way thru stress with messages like the 
following in the logs

{noformat}
WARN  [GossipTasks:1] 2015-04-24 18:32:16,832 Gossiper.java:685 - Gossip stage 
has 3 pending tasks; skipping status check (no nodes will be marked down)
INFO  [GossipTasks:1] 2015-04-24 18:32:40,995 Gossiper.java:938 - InetAddress 
/127.0.0.1 is now DOWN
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:42,002 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,005 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,010 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,010 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,011 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,012 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,022 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:07,023 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
{noformat}

So, in summary, I am able to cause Gossiper/FD to DOWN nodes and have 2.0 
stress fail with the changes to OutboundTcpConnection/OutboundTcpConnectionPool 
(828496492c51d7437b690999205ecc941f41a0a9/144644bbf77a546c45db384e2dbc18e13f65c9ce)
 applied against (8896a70b015102c212d0a27ed1f4e1f0fabe85c4) which (detailed in 
previous comment on this ticket) I was able to successfully run 
cassandra-stress -l 3 against without failure.


was (Author: mkjellman):
I just tried the following.

Checkout 8896a70b015102c212d0a27ed1f4e1f0fabe85c4 (which I'm able to insert all 
100k records without issue) and then apply 
828496492c51d7437b690999205ecc941f41a0a9 and 
144644bbf77a546c45db384e2dbc18e13f65c9ce

I started seeing failures 1/3 of the way thru stress with messages like the 
following in the logs

{noformat}
WARN  [GossipTasks:1] 2015-04-24 18:32:16,832 Gossiper.java:685 - Gossip stage 
has 3 pending tasks; skipping status check (no nodes will be marked down)
INFO  [GossipTasks:1] 2015-04-24 18:32:40,995 Gossiper.java:938 - InetAddress 
/127.0.0.1 is now DOWN
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:42,002 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:47,004 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,005 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:52,010 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,010 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:32:57,011 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,012 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:02,022 
OutboundTcpConnection.java:485 - Handshaking version with /127.0.0.1
INFO  [HANDSHAKE-/127.0.0.1] 2015-04-24 18:33:07,023 
OutboundTcpConnection.java:494 - Cannot handshake version with /127.0.0.1
{noformat}

So, in summary, I am able reproduces and have 2.0 stress fail with the changes 
to OutboundTcpConnection/OutboundTcpConnectionPool 
(828496492c51d7437b690999205ecc941f41a0a9/144644bbf77a546c45db384e2dbc18e13f65c9ce)
 applied against (8896a70b015102c212d0a27ed1f4e1f0fabe85c4) which I can 
successfully run cassandra-stress -l 3 without failure.

 OutboundTcpConnectionPool should route messages to sockets by size not type

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

2015-04-24 Thread Ariel Weisberg (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14511331#comment-14511331
]

Ariel Weisberg edited comment on CASSANDRA-8789 at 4/24/15 4:58 PM:

I was able to reproduce the OOM once in 2.1.2. I have found that the mutation
stage is filling up with tasks and they look like responses to writes. In 2.1.2
when it succeeds it kind of looks like it is just dropping the messages.

The reason it fails at 300k is that some 50k or so get processed and 250k back
up causing OOM. We could try and do some things to make this more robust
against overload. Say by having the producer (IncomingTcpConnection) detect
overload and start dropping messages without relying on the consumer
(MutationStage) to drop them.

I am leaning towards not trying to fix this wart because it requires somewhat
unrealistic conditions. There has to be no load balancing, a heap that is too
small, and an oversubscribed instance. The appropriate (if flawed) load
shedding mechanism is in place, and there are already tickets to deal with the
issue of having to much in flight data.

[~mkjellman] I created a CASSANDRA-9237 for the issue of Gossip sharing a
connection with most traffic.

was (Author: aweisberg):
I was able to reproduce the OOM once in 2.1.2. I have found that the mutation
stage is filling up with tasks and they look like responses to writes. In 2.1.2
when it succeeds it kind of looks like it is just dropping the messages.

I am leaning towards not trying to fix this wart because it requires somewhat
unrealistic conditions. There has to be no load balancing, a heap that is too
small, and an oversubscribed instance.

[~mkjellman] I created a CASSANDRA-9237 for the issue of Gossip sharing a
connection with most traffic.

OutboundTcpConnectionPool should route messages to sockets by size not type
---

Attachments: 8789.diff

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

2015-04-23 Thread Ariel Weisberg (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14509731#comment-14509731
]

Ariel Weisberg edited comment on CASSANDRA-8789 at 4/23/15 8:46 PM:

[~mkjellman] I tried this reverting the socket change and initially I thought
it mattered, but I think I was swapping when it passed with the change reverted.

I tried it three times and they do the same thing. The first node OOMs and the
heap dump blames tasks sitting in SEPExecutor.

I also ran with flight recorder and checked the node serving client traffic and
one of the other nodes. There is some significant blocking on the coordinating
node, but the longest pause was 300 milliseconds and total duration was 2
seconds for a 1 minute period (200 pauses). If I chased those down I bet they
are correlated with GC pauses.

I was able to get 2.1.2 to write hints, but not to fail the same way that trunk
does with SEPExecutor OOM. Still digging into why trunk fares worse.

I checked and disabling coalescing and reverting the change to
OutboundTcpConnectionPool doesn't make things better.

was (Author: aweisberg):
[~mkjellman] I tried this reverting the socket change and initially I thought
it mattered, but I think I was swapping when it passed with the change reverted.

I tried it three times and they do the same thing. The first node OOMs and the
heap dump blames tasks sitting in SEPExecutor.

I was able to get 2.1.2 to write hints, but not to fail the same way that trunk
does with SEPExecutor OOM. Still digging into why trunk fares worse.

OutboundTcpConnectionPool should route messages to sockets by size not type
---

Attachments: 8789.diff

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

2015-04-23 Thread Benedict (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14510016#comment-14510016
]

Benedict edited comment on CASSANDRA-8789 at 4/23/15 10:54 PM:
---

I should clarify here that I do think MUTATION messages could plausibly delay
gossip messages where they couldn't before. However REQUEST_RESPONSE messages
as mentioned above as the potential cause could always cause head of line
blocking for gossip messages. So my position is only that the head of line
blocking concern is not a new one, not that its characteristics are identical.
I don't however have any data/position on what these theoretical analyses have
on the perceived issue.

was (Author: benedict):
I should clarify here that I do think MUTATION messages could plausibly delay
gossip messages where they couldn't before. However REQUEST_RESPONSE messages
as mentioned above as the potential cause could always cause head of line
blocking for gossip messages. So my position is only that the head of line
blocking concern is not a new one, not that its characteristics are identical.

OutboundTcpConnectionPool should route messages to sockets by size not type
---

Attachments: 8789.diff

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

2015-04-20 Thread Benedict (JIRA)

[
https://issues.apache.org/jira/browse/CASSANDRA-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504158#comment-14504158
]

Benedict edited comment on CASSANDRA-8789 at 4/21/15 2:05 AM:
--

2.0 stress, AFAICR, does not load balance. By default 2.1 does (smart thrift
routing round-robins the owning nodes for any token). So all of the writes to
the cluster are likely being piped through a single node in the 2.0 experiment
(so over just two tcp connections), instead of evenly spread all three (i.e.
six tcp connections).

was (Author: benedict):
2.0 stress, AFAICR, does not load balance. By default 2.1 does (smart thrift
routing round-robins the owning nodes for any token). So all of the writes to
the cluster are likely being piped through a single node in the 2.0 experiment
(so over just two tcp connections), instead of evenly spread over six.

OutboundTcpConnectionPool should route messages to sockets by size not type
---

Attachments: 8789.diff

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

[jira] [Comment Edited] (CASSANDRA-8789) OutboundTcpConnectionPool should route messages to sockets by size not type

11 matches

Site Navigation

Mail list logo

Footer information