Re: Gossip issues after upgrading to 4.0.4

Gil Ganz Tue, 07 Jun 2022 08:19:47 -0700

Will do.

On Tue, Jun 7, 2022 at 6:12 PM Jeff Jirsa <jji...@gmail.com> wrote:


> This deserves a JIRA ticket please.
>
> (I assume the sending host is randomly choosing the bad IP and blocking on
> it for some period of time, causing other tasks to pile up, but it should
> be investigated as a regression).
>
>
>
> On Tue, Jun 7, 2022 at 7:52 AM Gil Ganz <gilg...@gmail.com> wrote:
>
>> Yes, I know the issue with the peers table, we had it in different
>> clusters, in this case it appears the cause of the problem was indeed a bad
>> ip in the seed list.
>> After removing it from all nodes and reloading seeds, running a rolling
>> restart does not cause any gossip issues, and in general the number of
>> gossip pending tasks is 0 all the time, vs jumping to 2-5 pending tasks
>> every once in a while before this change.
>>
>> Interesting that this bad ip didn't cause an issue in 3.11.9, I guess
>> something in the way gossip works in c*4 made it so it caused a real issue
>> after the upgrade.
>>
>> On Tue, Jun 7, 2022 at 12:04 PM Bowen Song <bo...@bso.ng> wrote:
>>
>>> Regarding the "ghost IP", you may want to check the system.peers_v2
>>> table by doing "select * from system.peers_v2 where peer =
>>> '123.456.789.012';"
>>>
>>> I've seen this (non-)issue many times, and I had to do "delete from
>>> system.peers_v2 where peer=..." to fix it, as on our client side, the
>>> Python cassandra-driver, reads the token ring information from this table
>>> and uses it for routing requests.
>>> On 07/06/2022 05:22, Gil Ganz wrote:
>>>
>>> Only errors I see in the logs prior to gossip pending issue are things
>>> like this
>>>
>>> INFO  [Messaging-EventLoop-3-32] 2022-06-02 20:29:44,833
>>> NoSpamLogger.java:92 - /X:7000->/Y:7000-URGENT_MESSAGES-[no-channel] failed
>>> to connect
>>> io.netty.channel.AbstractChannel$AnnotatedConnectException:
>>> finishConnect(..) failed: No route to host: /Y:7000
>>> Caused by: java.net.ConnectException: finishConnect(..) failed: No route
>>> to host
>>>         at
>>> io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
>>>         at io.netty.channel.unix.Socket.finishConnect(Socket.java:251)
>>>         at
>>> io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:673)
>>>         at
>>> io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:650)
>>>         at
>>> io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:530)
>>>         at
>>> io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)
>>>         at
>>> io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>>>         at
>>> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>>>         at
>>> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>>>         at
>>> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>>>         at java.lang.Thread.run(Thread.java:748)
>>>
>>> Remote ip mentioned here is an ip that is appearing in the seed list
>>> (there are 20 other valid ip addresses in the seed clause), but it's no
>>> longer a valid ip, it's an old ip of an existing server (it's not in the
>>> peers table). I will try to reproduce the issue with this this ip removed
>>> from seed list
>>>
>>>
>>> On Mon, Jun 6, 2022 at 9:39 PM C. Scott Andreas <sc...@paradoxica.net>
>>> wrote:
>>>
>>>> Hi Gil, thanks for reaching out.
>>>>
>>>> Can you check Cassandra's logs to see if any uncaught exceptions are
>>>> being thrown? What you described suggests the possibility of an uncaught
>>>> exception being thrown in the Gossiper thread, preventing further tasks
>>>> from making progress; however I'm not aware of any open issues in 4.0.4
>>>> that would result in this.
>>>>
>>>> Would be eager to investigate immediately if so.
>>>>
>>>> – Scott
>>>>
>>>> On Jun 6, 2022, at 11:04 AM, Gil Ganz <gilg...@gmail.com> wrote:
>>>>
>>>>
>>>> Hey
>>>> We have a big cluster (>500 nodes, onprem, multiple datacenters, most
>>>> with vnodes=32, but some with 128), that was recently upgraded from 3.11.9
>>>> to 4.0.4. Servers are all centos 7.
>>>>
>>>> We have been dealing with a few issues related to gossip since :
>>>> 1 - The moment the last node in the cluster was up with 4.0.4, and all
>>>> nodes were in the same version, gossip pending tasks started to climb to
>>>> very high numbers (>1M) in all nodes in the cluster, and quickly the
>>>> cluster was practically down. Took us a few hours of stopping/starting up
>>>> nodes, and adding more nodes to the seed list, to finally get the cluster
>>>> back up.
>>>> 2 - We notice that pending gossip tasks go up to very high
>>>> numbers (50k), in random nodes in the cluster, without any meaningful event
>>>> that happened and it doesn't look like it will go down on its own. After a
>>>> few hours we restart those nodes and it goes back to 0.
>>>> 3 - Doing a rolling restart to a list of servers is now an issue, more
>>>> often then not, what will happen is one of the nodes we restart goes up
>>>> with gossip issues, and we need a 2nd restart to get the gossip pending
>>>> tasks to 0.
>>>>
>>>> Is there a known issue related to gossip in big clusters, in recent
>>>> versions?
>>>> Is there any tuning that can be done?
>>>>
>>>> Just to give a sense of how big the gossip information in this cluster,
>>>> "*nodetool gossipinfo*" output size is ~300kb
>>>>
>>>> gil
>>>>
>>>>
>>>>

Re: Gossip issues after upgrading to 4.0.4

Reply via email to