[Bug 1903745] Re: upgrade from 1.1.14-2ubuntu1.8 to 1.1.14-2ubuntu1.9 breaks clusters

Trent Lloyd Tue, 10 Nov 2020 20:41:54 -0800

I reviewed the sosreports and provide some general analysis below.


[sosreport-juju-machine-2-lxc-1-2020-11-10-tayyude]

I don't see any sign in this log of package upgrades or VIP stop/starts,
I suspect this host may be unrelated.

[sosreport-juju-caae6f-19-lxd-6-20201110230352.tar.xz]

This is a charm-keystone node

Looking at this sosreport my general finding is that everything worked
correctly on this specific host.

unattended-upgrades.log:
We can see the upgrade starts at 2020-11-10 06:17:03 and finishes at 
"2020-11-10 06:17:48"

syslog.1:
Nov 10 06:17:41 juju-caae6f-19-lxd-6 crmd[41203]:   notice: Result of probe 
operation for res_ks_680cfdf_vip on juju-caae6f-19-lxd-6: 7 (not running)
Nov 10 06:19:44 juju-caae6f-19-lxd-6 crmd[41203]:   notice: Result of start 
operation for res_ks_680cfdf_vip on juju-caae6f-19-lxd-6: 0 (ok)

We also see that the VIP moved around to different hosts a few times,
likely as a result of each host successively upgrading. Which makes
sense. I don't see any sign in this log of the mentioned lrmd issue.

[mysql issue]

What we do see however is issues with "Too many connections" from MySQL
in the keystone logs. This generally happens because when the VIP moves
from one host to another, all the old connections are left behind and
just go stale (because the VIP was removed, the traffic for these
connections just disappears and is sent to the new VIP owner which
doens't have those TCP connections) and sit there until wait_timeout is
reached (typically either 180s/3 min or 3600s/1 hour in our deployments)
as the node will never get the TCP reset when the remote end sends it.
The problem happens when it fails *back* to a host it already failed
away from, now many of the connection slots are still used by the stale
connections and you run our of connections if your max_connections limit
is not at least double your normal connection count. This problem will
eventually self resolve once the connections timeout but may take an
hour.

Note that this sosreport is from a keystone node that *also* has charm-
hacluster/corosync/pacemaker but the above discussed mysql issue would
have occurred on the percona mysql nodes. To analyse the number of
failovers we would need to get sosreports from the mysql node(s).

[summary]

I think we have likely 2 potential issues here from what I can see
described so far.

Firstly the networkd issue is likely not related to this specific case,
as that happens specifically when systemd is upgraded and thus networkd
is restarted, that shouldn't have happened here.

(Issue 1) The first is that we hit max_connections due to the multiple
successive MySQL VIP failovers where max_connections is not at least 2x
the steady state connection count. It also seems possible in some cases
the VIP may shift back to the same host a 3rd time by chance and you may
end up needing 3x. I think we could potentially improve that by
modifying the pacemaker resource scripts to kill active connections when
the VIP departs, or, ensuring that you have 2-3x max_connections of the
steady state active connection count. That should go into a new bug
likely against charm-percona-cluster as it ships it's own resource
agent. We could also potentially add a configurable nagios check for
having active connections in excess of 50% of max_connections.

(Issue 2) It was described that pacemaker got into a bad state during
the restart and the lrmd didn't exit, and didn't work correctly until it
was manually killed and restarted. I think we need to get more
logs/sosreports from the nodes that had that specific issue, it sounds
like something that may be a bug specific to a certain scenario or
perhaps the older xenial version [This USN-4623-1 update happened for
all LTS releases, 16.04/18.04/20.04].

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1903745

Title:
  upgrade from 1.1.14-2ubuntu1.8 to 1.1.14-2ubuntu1.9 breaks clusters

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1903745/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 1903745] Re: upgrade from 1.1.14-2ubuntu1.8 to 1.1.14-2ubuntu1.9 breaks clusters

Reply via email to