[ https://issues.apache.org/jira/browse/CASSANDRA-17744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Brandon Williams updated CASSANDRA-17744: ----------------------------------------- Bug Category: Parent values: Availability(12983) Complexity: Normal Discovered By: User Report Fix Version/s: 4.0.x Severity: Normal Assignee: Brandon Williams Status: Open (was: Triage Needed) > Gossip issues following completion of upgrade to 4.0.4 > ------------------------------------------------------ > > Key: CASSANDRA-17744 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17744 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip > Reporter: Gil Ganz > Assignee: Brandon Williams > Priority: Normal > Fix For: 4.0.x > > > Recently upgrade a 3.11.9 cluster to 4.0.4 , server are onprem, centos 7 > All went well, the moment the last node came up, all hell broke loose. > Nodes starting seeing each other as down, gossip pending tasks was climbing > in all nodes. > Eventually we discovered we had a lot of outdated seeds addresses in the yaml > file (along with many good ones), after removing all the bad addresses, and > many restarts, we were able to get all nodes up and running. > The bad seeds configuration was there all along in the 3.11 setup, but only > once 4.0.4 was up on all nodes we had an issue. > INFO [Native-Transport-Requests-22] 2022-06-02 20:30:00,285 > QueryProcessor.java:547 - Fully upgraded to at least 4.0.4 > INFO [Messaging-EventLoop-3-18] 2022-06-02 20:30:00,314 > OutboundConnection.java:1150 - > /10.106.26.152:7000(/10.106.26.152:58984)->/10.110.48.16:7000-SMALL_MESSAGES-b7dbabb6 > successfully connected, version = 12, framing = CRC, encryption = unencrypted > INFO [Messaging-EventLoop-3-3] 2022-06-02 20:30:00,412 > InboundConnectionInitiator.java:464 - > /10.110.48.16:7000(/10.110.48.16:60706)->/10.106.26.152:7000-SMALL_MESSAGES-8023af58 > messaging connection established, version = 12, framing = CRC, encryption = > unencrypted > INFO [GossipStage:1] 2022-06-02 20:30:02,214 Gossiper.java:1312 - > InetAddress /10.110.48.16:7000 is now UP > INFO [Messaging-EventLoop-3-15] 2022-06-02 20:30:02,987 > OutboundConnection.java:1150 - > /10.106.26.152:7000(/10.106.26.152:38532)->/10.11.40.208:7000-SMALL_MESSAGES-3a3e215a > successfully connected, version = 12, framing = LZ4, encryption = unencrypted > INFO [Messaging-EventLoop-3-4] 2022-06-02 20:30:03,137 > InboundConnectionInitiator.java:464 - > /10.21.26.123:7000(/10.21.26.123:58102)->/10.106.26.152:7000-SMALL_MESSAGES-6e8f7fd3 > messaging connection established, version = 12, framing = LZ4, encryption = > unencrypted > INFO [Messaging-EventLoop-3-5] 2022-06-02 20:30:03,262 > InboundConnectionInitiator.java:464 - > /10.11.40.208:7000(/10.11.40.208:40720)->/10.106.26.152:7000-SMALL_MESSAGES-ebd509fe > messaging connection established, version = 12, framing = LZ4, encryption = > unencrypted > INFO [GossipStage:1] 2022-06-02 20:30:03,339 StorageService.java:2785 - Node > /10.110.48.16:7000 state jump to NORMAL > INFO [Messaging-EventLoop-3-6] 2022-06-02 20:30:04,533 > InboundConnectionInitiator.java:464 - > /10.40.0.135:7000(/10.40.0.135:43604)->/10.106.26.152:7000-SMALL_MESSAGES-72c1cc35 > messaging connection established, version = 12, framing = LZ4, encryption = > unencrypted > INFO [Messaging-EventLoop-3-6] 2022-06-02 20:30:05,011 > OutboundConnection.java:1150 - > /10.106.26.152:7000(/10.106.26.152:40678)->/10.106.48.97:7000-SMALL_MESSAGES-ad50944c > successfully connected, version = 12, framing = CRC, encryption = unencrypted > INFO [Messaging-EventLoop-3-7] 2022-06-02 20:30:05,011 > InboundConnectionInitiator.java:464 - > /10.106.48.97:7000(/10.106.48.97:57368)->/10.106.26.152:7000-SMALL_MESSAGES-c98513ad > messaging connection established, version = 12, framing = CRC, encryption = > unencrypted > INFO [Messaging-EventLoop-3-8] 2022-06-02 20:30:14,783 NoSpamLogger.java:92 > - /10.106.26.152:7000->/10.20.0.91:7000-URGENT_MESSAGES-[no-channel] failed > to connect > io.netty.channel.ConnectTimeoutException: connection timed out: > /10.20.0.91:7000 > at > io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:576) > at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) > at > io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) > at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(Thread.java:748) > WARN [GossipTasks:1] 2022-06-02 20:30:41,944 Gossiper.java:1036 - Gossip > stage has 1 pending tasks; skipping status check (no nodes will be marked > down) > INFO [Service Thread] 2022-06-02 20:30:43,977 GCInspector.java:299 - GPGC > New GC in 687ms. GenPauseless New Gen: 46198001664 -> 1505914880; > GenPauseless Old Gen: 4413739008 -> 4769091584; > INFO [ScheduledTasks:1] 2022-06-02 20:30:44,700 MessagingMetrics.java:206 - > READ_REQ messages were dropped in last 5000 ms: 34 internal and 1 cross node. > Mean internal dropped latency: 252 ms and Mean cross-node dropped latency: > 268 ms > WARN [GossipTasks:1] 2022-06-02 20:31:17,011 Gossiper.java:1036 - Gossip > stage has 20 pending tasks; skipping status check (no nodes will be marked > down) > WARN [GossipTasks:1] 2022-06-02 20:31:18,112 Gossiper.java:1036 - Gossip > stage has 21 pending tasks; skipping status check (no nodes will be marked > down) > WARN [GossipTasks:1] 2022-06-02 20:31:19,214 Gossiper.java:1036 - Gossip > stage has 21 pending tasks; skipping status check (no nodes will be marked > down) > WARN [GossipTasks:1] 2022-06-02 20:31:20,315 Gossiper.java:1036 - Gossip > stage has 21 pending tasks; skipping status check (no nodes will be marked > down) > WARN [GossipTasks:1] 2022-06-02 20:31:21,416 Gossiper.java:1036 - Gossip > stage has 22 pending tasks; skipping status check (no nodes will be marked > down) > WARN [GossipTasks:1] 2022-06-02 20:31:22,518 Gossiper.java:1036 - Gossip > stage has 21 pending tasks; skipping status check (no nodes will be marked > down) > WARN [GossipTasks:1] 2022-06-02 20:31:23,619 Gossiper.java:1036 - Gossip > stage has 21 pending tasks; skipping status check (no nodes will be marked > down) > WARN [GossipTasks:1] 2022-06-02 20:39:30,445 Gossiper.java:1036 - Gossip > stage has 279 pending tasks; skipping status check (no nodes will be marked > down) > WARN [GossipTasks:1] 2022-06-02 20:39:31,546 Gossiper.java:1036 - Gossip > stage has 280 pending tasks; skipping status check (no nodes will be marked > down) > WARN [MutationStage-96] 2022-06-02 20:39:31,962 NoSpamLogger.java:95 - > /10.106.26.152:7000->/10.110.46.37:7000-SMALL_MESSAGES-60dee8f0 dropping > message of type READ_RSP whose timeout expired before reaching the network > WARN [GossipTasks:1] 2022-06-02 20:39:32,648 Gossiper.java:1036 - Gossip > stage has 279 pending tasks; skipping status check (no nodes will be marked > down) > WARN [GossipTasks:1] 2022-06-02 20:39:33,749 Gossiper.java:1036 - Gossip > stage has 281 pending tasks; skipping status check (no nodes will be marked > down) > WARN [GossipTasks:1] 2022-06-02 20:39:34,850 Gossiper.java:1036 - Gossip > stage has 284 pending tasks; skipping status check (no nodes will be marked > down) > Half an hour later > WARN [GossipTasks:1] 2022-06-02 21:08:52,338 Gossiper.java:1036 - Gossip > stage has 3697 pending tasks; skipping status check (no nodes will be marked > down) > At the same time in debug log (10.20.0.91 was bad seed) > DEBUG [Native-Transport-Requests-10] 2022-06-02 20:30:36,900 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-29] 2022-06-02 20:30:36,914 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-28] 2022-06-02 20:30:36,931 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-41] 2022-06-02 20:30:36,969 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-18] 2022-06-02 20:30:36,977 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-1] 2022-06-02 20:30:37,007 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-11] 2022-06-02 20:30:37,028 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-37] 2022-06-02 20:30:37,034 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-47] 2022-06-02 20:30:37,038 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-38] 2022-06-02 20:30:37,061 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-12] 2022-06-02 20:30:37,066 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-61] 2022-06-02 20:30:37,069 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-58] 2022-06-02 20:30:37,112 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-10] 2022-06-02 20:30:37,117 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-24] 2022-06-02 20:30:37,124 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-30] 2022-06-02 20:30:37,127 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-28] 2022-06-02 20:30:37,147 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-17] 2022-06-02 20:30:37,158 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-16] 2022-06-02 20:30:37,191 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-21] 2022-06-02 20:30:37,204 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-44] 2022-06-02 20:30:37,232 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-31] 2022-06-02 20:30:37,232 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-43] 2022-06-02 20:30:37,243 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-25] 2022-06-02 20:30:37,253 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-29] 2022-06-02 20:30:37,261 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-5] 2022-06-02 20:31:15,803 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-6] 2022-06-02 20:31:15,826 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-32] 2022-06-02 20:31:15,840 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-28] 2022-06-02 20:31:15,867 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-51] 2022-06-02 20:31:15,875 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > WARN [GossipTasks:1] 2022-06-02 20:31:15,909 Gossiper.java:1036 - Gossip > stage has 19 pending tasks; skipping status check (no nodes will be marked > down) > DEBUG [Native-Transport-Requests-16] 2022-06-02 20:31:15,931 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-46] 2022-06-02 20:31:15,960 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > INFO [Messaging-EventLoop-3-8] 2022-06-02 20:31:16,005 NoSpamLogger.java:92 > - /10.106.26.152:7000->/10.20.0.91:7000-URGENT_MESSAGES-[no-channel] failed > to connect > io.netty.channel.ConnectTimeoutException: connection timed out: > /10.20.0.91:7000 > at > io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:576) > at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) > at > io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) > at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(Thread.java:748) > DEBUG [Native-Transport-Requests-13] 2022-06-02 20:31:16,092 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-24] 2022-06-02 20:31:16,104 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-9] 2022-06-02 20:31:16,109 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-31] 2022-06-02 20:31:16,113 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-48] 2022-06-02 20:31:16,134 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-3] 2022-06-02 20:31:16,140 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-25] 2022-06-02 20:31:16,155 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-45] 2022-06-02 20:31:16,155 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-70] 2022-06-02 20:31:16,162 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-10] 2022-06-02 20:31:16,773 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) > DEBUG [Native-Transport-Requests-2] 2022-06-02 20:31:16,774 > ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data) -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org