[ 
https://issues.apache.org/jira/browse/CASSANDRA-17744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-17744:
-----------------------------------------
     Bug Category: Parent values: Availability(12983)
       Complexity: Normal
    Discovered By: User Report
    Fix Version/s: 4.0.x
         Severity: Normal
         Assignee: Brandon Williams
           Status: Open  (was: Triage Needed)

> Gossip issues following completion of upgrade to 4.0.4
> ------------------------------------------------------
>
>                 Key: CASSANDRA-17744
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17744
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Gossip
>            Reporter: Gil Ganz
>            Assignee: Brandon Williams
>            Priority: Normal
>             Fix For: 4.0.x
>
>
> Recently upgrade a 3.11.9 cluster to 4.0.4 , server are onprem, centos 7
> All went well, the moment the last node came up, all hell broke loose.
> Nodes starting seeing each other as down, gossip pending tasks was climbing 
> in all nodes.
> Eventually we discovered we had a lot of outdated seeds addresses in the yaml 
> file (along with many good ones), after removing all the bad addresses, and 
> many restarts, we were able to get all nodes up and running.
> The bad seeds configuration was there all along in the 3.11 setup, but only 
> once 4.0.4 was up on all nodes we had an issue.
> INFO  [Native-Transport-Requests-22] 2022-06-02 20:30:00,285 
> QueryProcessor.java:547 - Fully upgraded to at least 4.0.4
> INFO  [Messaging-EventLoop-3-18] 2022-06-02 20:30:00,314 
> OutboundConnection.java:1150 - 
> /10.106.26.152:7000(/10.106.26.152:58984)->/10.110.48.16:7000-SMALL_MESSAGES-b7dbabb6
>  successfully connected, version = 12, framing = CRC, encryption = unencrypted
> INFO  [Messaging-EventLoop-3-3] 2022-06-02 20:30:00,412 
> InboundConnectionInitiator.java:464 - 
> /10.110.48.16:7000(/10.110.48.16:60706)->/10.106.26.152:7000-SMALL_MESSAGES-8023af58
>  messaging connection established, version = 12, framing = CRC, encryption = 
> unencrypted
> INFO  [GossipStage:1] 2022-06-02 20:30:02,214 Gossiper.java:1312 - 
> InetAddress /10.110.48.16:7000 is now UP
> INFO  [Messaging-EventLoop-3-15] 2022-06-02 20:30:02,987 
> OutboundConnection.java:1150 - 
> /10.106.26.152:7000(/10.106.26.152:38532)->/10.11.40.208:7000-SMALL_MESSAGES-3a3e215a
>  successfully connected, version = 12, framing = LZ4, encryption = unencrypted
> INFO  [Messaging-EventLoop-3-4] 2022-06-02 20:30:03,137 
> InboundConnectionInitiator.java:464 - 
> /10.21.26.123:7000(/10.21.26.123:58102)->/10.106.26.152:7000-SMALL_MESSAGES-6e8f7fd3
>  messaging connection established, version = 12, framing = LZ4, encryption = 
> unencrypted
> INFO  [Messaging-EventLoop-3-5] 2022-06-02 20:30:03,262 
> InboundConnectionInitiator.java:464 - 
> /10.11.40.208:7000(/10.11.40.208:40720)->/10.106.26.152:7000-SMALL_MESSAGES-ebd509fe
>  messaging connection established, version = 12, framing = LZ4, encryption = 
> unencrypted
> INFO  [GossipStage:1] 2022-06-02 20:30:03,339 StorageService.java:2785 - Node 
> /10.110.48.16:7000 state jump to NORMAL
> INFO  [Messaging-EventLoop-3-6] 2022-06-02 20:30:04,533 
> InboundConnectionInitiator.java:464 - 
> /10.40.0.135:7000(/10.40.0.135:43604)->/10.106.26.152:7000-SMALL_MESSAGES-72c1cc35
>  messaging connection established, version = 12, framing = LZ4, encryption = 
> unencrypted
> INFO  [Messaging-EventLoop-3-6] 2022-06-02 20:30:05,011 
> OutboundConnection.java:1150 - 
> /10.106.26.152:7000(/10.106.26.152:40678)->/10.106.48.97:7000-SMALL_MESSAGES-ad50944c
>  successfully connected, version = 12, framing = CRC, encryption = unencrypted
> INFO  [Messaging-EventLoop-3-7] 2022-06-02 20:30:05,011 
> InboundConnectionInitiator.java:464 - 
> /10.106.48.97:7000(/10.106.48.97:57368)->/10.106.26.152:7000-SMALL_MESSAGES-c98513ad
>  messaging connection established, version = 12, framing = CRC, encryption = 
> unencrypted
> INFO  [Messaging-EventLoop-3-8] 2022-06-02 20:30:14,783 NoSpamLogger.java:92 
> - /10.106.26.152:7000->/10.20.0.91:7000-URGENT_MESSAGES-[no-channel] failed 
> to connect
> io.netty.channel.ConnectTimeoutException: connection timed out: 
> /10.20.0.91:7000
>         at 
> io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:576)
>         at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
>         at 
> io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
>         at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>         at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>         at 
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>         at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>         at java.lang.Thread.run(Thread.java:748)
> WARN  [GossipTasks:1] 2022-06-02 20:30:41,944 Gossiper.java:1036 - Gossip 
> stage has 1 pending tasks; skipping status check (no nodes will be marked 
> down)
> INFO  [Service Thread] 2022-06-02 20:30:43,977 GCInspector.java:299 - GPGC 
> New GC in 687ms.  GenPauseless New Gen: 46198001664 -> 1505914880; 
> GenPauseless Old Gen: 4413739008 -> 4769091584;
> INFO  [ScheduledTasks:1] 2022-06-02 20:30:44,700 MessagingMetrics.java:206 - 
> READ_REQ messages were dropped in last 5000 ms: 34 internal and 1 cross node. 
> Mean internal dropped latency: 252 ms and Mean cross-node dropped latency: 
> 268 ms
> WARN  [GossipTasks:1] 2022-06-02 20:31:17,011 Gossiper.java:1036 - Gossip 
> stage has 20 pending tasks; skipping status check (no nodes will be marked 
> down)
> WARN  [GossipTasks:1] 2022-06-02 20:31:18,112 Gossiper.java:1036 - Gossip 
> stage has 21 pending tasks; skipping status check (no nodes will be marked 
> down)
> WARN  [GossipTasks:1] 2022-06-02 20:31:19,214 Gossiper.java:1036 - Gossip 
> stage has 21 pending tasks; skipping status check (no nodes will be marked 
> down)
> WARN  [GossipTasks:1] 2022-06-02 20:31:20,315 Gossiper.java:1036 - Gossip 
> stage has 21 pending tasks; skipping status check (no nodes will be marked 
> down)
> WARN  [GossipTasks:1] 2022-06-02 20:31:21,416 Gossiper.java:1036 - Gossip 
> stage has 22 pending tasks; skipping status check (no nodes will be marked 
> down)
> WARN  [GossipTasks:1] 2022-06-02 20:31:22,518 Gossiper.java:1036 - Gossip 
> stage has 21 pending tasks; skipping status check (no nodes will be marked 
> down)
> WARN  [GossipTasks:1] 2022-06-02 20:31:23,619 Gossiper.java:1036 - Gossip 
> stage has 21 pending tasks; skipping status check (no nodes will be marked 
> down)
> WARN  [GossipTasks:1] 2022-06-02 20:39:30,445 Gossiper.java:1036 - Gossip 
> stage has 279 pending tasks; skipping status check (no nodes will be marked 
> down)
> WARN  [GossipTasks:1] 2022-06-02 20:39:31,546 Gossiper.java:1036 - Gossip 
> stage has 280 pending tasks; skipping status check (no nodes will be marked 
> down)
> WARN  [MutationStage-96] 2022-06-02 20:39:31,962 NoSpamLogger.java:95 - 
> /10.106.26.152:7000->/10.110.46.37:7000-SMALL_MESSAGES-60dee8f0 dropping 
> message of type READ_RSP whose timeout expired before reaching the network
> WARN  [GossipTasks:1] 2022-06-02 20:39:32,648 Gossiper.java:1036 - Gossip 
> stage has 279 pending tasks; skipping status check (no nodes will be marked 
> down)
> WARN  [GossipTasks:1] 2022-06-02 20:39:33,749 Gossiper.java:1036 - Gossip 
> stage has 281 pending tasks; skipping status check (no nodes will be marked 
> down)
> WARN  [GossipTasks:1] 2022-06-02 20:39:34,850 Gossiper.java:1036 - Gossip 
> stage has 284 pending tasks; skipping status check (no nodes will be marked 
> down)
> Half an hour later
> WARN  [GossipTasks:1] 2022-06-02 21:08:52,338 Gossiper.java:1036 - Gossip 
> stage has 3697 pending tasks; skipping status check (no nodes will be marked 
> down)
> At the same time in debug log (10.20.0.91 was bad seed)
> DEBUG [Native-Transport-Requests-10] 2022-06-02 20:30:36,900 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-29] 2022-06-02 20:30:36,914 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-28] 2022-06-02 20:30:36,931 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-41] 2022-06-02 20:30:36,969 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-18] 2022-06-02 20:30:36,977 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-1] 2022-06-02 20:30:37,007 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-11] 2022-06-02 20:30:37,028 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-37] 2022-06-02 20:30:37,034 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-47] 2022-06-02 20:30:37,038 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-38] 2022-06-02 20:30:37,061 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-12] 2022-06-02 20:30:37,066 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-61] 2022-06-02 20:30:37,069 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-58] 2022-06-02 20:30:37,112 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-10] 2022-06-02 20:30:37,117 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-24] 2022-06-02 20:30:37,124 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-30] 2022-06-02 20:30:37,127 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-28] 2022-06-02 20:30:37,147 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-17] 2022-06-02 20:30:37,158 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-16] 2022-06-02 20:30:37,191 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-21] 2022-06-02 20:30:37,204 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-44] 2022-06-02 20:30:37,232 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-31] 2022-06-02 20:30:37,232 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-43] 2022-06-02 20:30:37,243 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-25] 2022-06-02 20:30:37,253 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-29] 2022-06-02 20:30:37,261 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-5] 2022-06-02 20:31:15,803 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-6] 2022-06-02 20:31:15,826 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-32] 2022-06-02 20:31:15,840 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-28] 2022-06-02 20:31:15,867 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-51] 2022-06-02 20:31:15,875 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> WARN  [GossipTasks:1] 2022-06-02 20:31:15,909 Gossiper.java:1036 - Gossip 
> stage has 19 pending tasks; skipping status check (no nodes will be marked 
> down)
> DEBUG [Native-Transport-Requests-16] 2022-06-02 20:31:15,931 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-46] 2022-06-02 20:31:15,960 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> INFO  [Messaging-EventLoop-3-8] 2022-06-02 20:31:16,005 NoSpamLogger.java:92 
> - /10.106.26.152:7000->/10.20.0.91:7000-URGENT_MESSAGES-[no-channel] failed 
> to connect
> io.netty.channel.ConnectTimeoutException: connection timed out: 
> /10.20.0.91:7000
>         at 
> io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:576)
>         at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
>         at 
> io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
>         at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
>         at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
>         at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>         at 
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>         at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>         at java.lang.Thread.run(Thread.java:748)
> DEBUG [Native-Transport-Requests-13] 2022-06-02 20:31:16,092 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-24] 2022-06-02 20:31:16,104 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-9] 2022-06-02 20:31:16,109 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-31] 2022-06-02 20:31:16,113 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-48] 2022-06-02 20:31:16,134 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-3] 2022-06-02 20:31:16,140 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-25] 2022-06-02 20:31:16,155 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-45] 2022-06-02 20:31:16,155 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-70] 2022-06-02 20:31:16,162 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-10] 2022-06-02 20:31:16,773 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)
> DEBUG [Native-Transport-Requests-2] 2022-06-02 20:31:16,774 
> ReadCallback.java:119 - Timed out; received 1 of 2 responses (including data)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to