[ https://issues.apache.org/jira/browse/CASSANDRA-14358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464566#comment-16464566 ]
Jason Brown commented on CASSANDRA-14358: ----------------------------------------- [~alienth] nice detective work. it looks like {{10.0.161.40}} hasn't closed the previous sockets/connections. If you have the logs of {{10.0.161.40}} still handy, can you see if, during and after the bounce, there are log statements about {{10.0.107.88}} being down or alive (or restarted). Not looking at the code to see the exact log messages, but typically they are emitted by {{Gossiper}}. > OutboundTcpConnection can hang for many minutes when nodes restart > ------------------------------------------------------------------ > > Key: CASSANDRA-14358 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14358 > Project: Cassandra > Issue Type: Bug > Components: Streaming and Messaging > Environment: Cassandra 2.1.19 (also reproduced on 3.0.15), running > with {{internode_encryption: all}} and the EC2 multi region snitch on Linux > 4.13 within the same AWS region. Smallest cluster I've seen the problem on is > 12 nodes, reproduces more reliably on 40+ and 300 node clusters consistently > reproduce on at least one node in the cluster. > So all the connections are SSL and we're connecting on the internal ip > addresses (not the public endpoint ones). > Potentially relevant sysctls: > {noformat} > /proc/sys/net/ipv4/tcp_syn_retries = 2 > /proc/sys/net/ipv4/tcp_synack_retries = 5 > /proc/sys/net/ipv4/tcp_keepalive_time = 7200 > /proc/sys/net/ipv4/tcp_keepalive_probes = 9 > /proc/sys/net/ipv4/tcp_keepalive_intvl = 75 > /proc/sys/net/ipv4/tcp_retries2 = 15 > {noformat} > Reporter: Joseph Lynch > Assignee: Joseph Lynch > Priority: Major > Attachments: 10 Minute Partition.pdf > > > I've been trying to debug nodes not being able to see each other during > longer (~5 minute+) Cassandra restarts in 3.0.x and 2.1.x which can > contribute to {{UnavailableExceptions}} during rolling restarts of 3.0.x and > 2.1.x clusters for us. I think I finally have a lead. It appears that prior > to trunk (with the awesome Netty refactor) we do not set socket connect > timeouts on SSL connections (in 2.1.x, 3.0.x, or 3.11.x) nor do we set > {{SO_TIMEOUT}} as far as I can tell on outbound connections either. I believe > that this means that we could potentially block forever on {{connect}} or > {{recv}} syscalls, and we could block forever on the SSL Handshake as well. I > think that the OS will protect us somewhat (and that may be what's causing > the eventual timeout) but I think that given the right network conditions our > {{OutboundTCPConnection}} threads can just be stuck never making any progress > until the OS intervenes. > I have attached some logs of such a network partition during a rolling > restart where an old node in the cluster has a completely foobarred > {{OutboundTcpConnection}} for ~10 minutes before finally getting a > {{java.net.SocketException: Connection timed out (Write failed)}} and > immediately successfully reconnecting. I conclude that the old node is the > problem because the new node (the one that restarted) is sending ECHOs to the > old node, and the old node is sending ECHOs and REQUEST_RESPONSES to the new > node's ECHOs, but the new node is never getting the ECHO's. This appears, to > me, to indicate that the old node's {{OutboundTcpConnection}} thread is just > stuck and can't make any forward progress. By the time we could notice this > and slap TRACE logging on, the only thing we see is ~10 minutes later a > {{SocketException}} inside {{writeConnected}}'s flush and an immediate > recovery. It is interesting to me that the exception happens in > {{writeConnected}} and it's a _connection timeout_ (and since we see {{Write > failure}} I believe that this can't be a connection reset), because my > understanding is that we should have a fully handshaked SSL connection at > that point in the code. > Current theory: > # "New" node restarts, "Old" node calls > [newSocket|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L433] > # Old node starts [creating a > new|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnectionPool.java#L141] > SSL socket > # SSLSocket calls > [createSocket|https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/security/SSLFactory.java#L98], > which conveniently calls connect with a default timeout of "forever". We > could hang here forever until the OS kills us. > # If we continue, we get to > [writeConnected|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L263] > which eventually calls > [flush|https://github.com/apache/cassandra/blob/6f30677b28dcbf82bcd0a291f3294ddf87dafaac/src/java/org/apache/cassandra/net/OutboundTcpConnection.java#L341] > on the output stream and also can hang forever. I think the probability is > especially high when a node is restarting and is overwhelmed with SSL > handshakes and such. > I don't fully understand the attached traceback as it appears we are getting > a {{Connection Timeout}} from a {{send}} failure (my understanding is you can > only get a connection timeout prior to a send), but I think it's reasonable > that we have a timeout configuration issue. I'd like to try to make Cassandra > robust to networking issues like this via maybe: > # Change the {{SSLSocket}} {{getSocket}} methods to provide connection > timeouts of 2s (equivalent to trunk's > [timeout|https://github.com/apache/cassandra/blob/11496039fb18bb45407246602e31740c56d28157/src/java/org/apache/cassandra/net/async/NettyFactory.java#L329]) > # Appropriately set recv timeouts via {{SO_TIMEOUT}}, maybe something like 2 > minutes (in old versions via > [setSoTimeout|https://docs.oracle.com/javase/8/docs/api/java/net/Socket.html#setSoTimeout-int-], > in trunk via > [SO_TIMEOUT|http://netty.io/4.0/api/io/netty/channel/ChannelOption.html#SO_TIMEOUT] > # Since we can't set send timeouts afaik (thanks java) maybe we can have > some kind of watchdog that ensures OutboundTcpConnection is making progress > in its queue and if it doesn't make any progress for ~30s-1m, forces a > disconnect. > If anyone has insight or suggestions, I'd be grateful. I am going to rule out > if this is keepalive duration by setting tcp_keepalive_probes to like 1 and > maybe tcp_retries2 to like 8 get more information about the state of the tcp > connections the next time this happens. It's a very rare bug and when it does > happen I only have 10 minutes to jump on the nodes and fix it before it fixes > itself so I'll do my best. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org