Only errors I see in the logs prior to gossip pending issue are things like
this

INFO  [Messaging-EventLoop-3-32] 2022-06-02 20:29:44,833
NoSpamLogger.java:92 - /X:7000->/Y:7000-URGENT_MESSAGES-[no-channel] failed
to connect
io.netty.channel.AbstractChannel$AnnotatedConnectException:
finishConnect(..) failed: No route to host: /Y:7000
Caused by: java.net.ConnectException: finishConnect(..) failed: No route to
host
        at
io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
        at io.netty.channel.unix.Socket.finishConnect(Socket.java:251)
        at
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:673)
        at
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:650)
        at
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:530)
        at
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)
        at
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
        at
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)

Remote ip mentioned here is an ip that is appearing in the seed list (there
are 20 other valid ip addresses in the seed clause), but it's no longer a
valid ip, it's an old ip of an existing server (it's not in the peers
table). I will try to reproduce the issue with this this ip removed from
seed list


On Mon, Jun 6, 2022 at 9:39 PM C. Scott Andreas <sc...@paradoxica.net>
wrote:

> Hi Gil, thanks for reaching out.
>
> Can you check Cassandra's logs to see if any uncaught exceptions are being
> thrown? What you described suggests the possibility of an uncaught
> exception being thrown in the Gossiper thread, preventing further tasks
> from making progress; however I'm not aware of any open issues in 4.0.4
> that would result in this.
>
> Would be eager to investigate immediately if so.
>
> – Scott
>
> On Jun 6, 2022, at 11:04 AM, Gil Ganz <gilg...@gmail.com> wrote:
>
>
> Hey
> We have a big cluster (>500 nodes, onprem, multiple datacenters, most with
> vnodes=32, but some with 128), that was recently upgraded from 3.11.9 to
> 4.0.4. Servers are all centos 7.
>
> We have been dealing with a few issues related to gossip since :
> 1 - The moment the last node in the cluster was up with 4.0.4, and all
> nodes were in the same version, gossip pending tasks started to climb to
> very high numbers (>1M) in all nodes in the cluster, and quickly the
> cluster was practically down. Took us a few hours of stopping/starting up
> nodes, and adding more nodes to the seed list, to finally get the cluster
> back up.
> 2 - We notice that pending gossip tasks go up to very high numbers (50k),
> in random nodes in the cluster, without any meaningful event that
> happened and it doesn't look like it will go down on its own. After a few
> hours we restart those nodes and it goes back to 0.
> 3 - Doing a rolling restart to a list of servers is now an issue, more
> often then not, what will happen is one of the nodes we restart goes up
> with gossip issues, and we need a 2nd restart to get the gossip pending
> tasks to 0.
>
> Is there a known issue related to gossip in big clusters, in recent
> versions?
> Is there any tuning that can be done?
>
> Just to give a sense of how big the gossip information in this cluster, 
> "*nodetool
> gossipinfo*" output size is ~300kb
>
> gil
>
>
>

Reply via email to