Okay, I made test packages for Bionic and Xenial based on the above:

The ppa is available here:
https://launchpad.net/~mruffell/+archive/ubuntu/lp1874075-test

It contains (based off of -updates):
Xenial:
rabbitmq-server         3.5.7-1ubuntu0.16.04.2+lp1874075v20200629b1
Bionic:
rabbitmq-server         3.6.10-1ubuntu0.1+lp1874075v20200629b1 

Debdiffs for the above builds are:
Xenial: https://paste.ubuntu.com/p/Jm8ZctJzny/
Bionic: https://paste.ubuntu.com/p/j6cBPzgWMD/

On Bionic:
When you install the test packages on both nodes and reboot them, then attempt 
to reproduce, the node which is attempting to rejoin the cluster will stay in 
the systemd activating state, and the wrapper script terminates after 5 minutes 
or 300 seconds, i.e. 10x 3000ms timeouts. When the wrapper script terminates, 
it terminates with a error exit code, and systemd restarts the service. This 
continues forever until the node joins the cluster, at which stage the systemd 
status turns active. Problem is fixed.

On Xenial:
When you install the test packages on both nodes and reboot them, then attempt 
to reproduce, the node which is attempting to rejoin the cluster will stay in 
the systemd activating state, and the wrapper script terminates after 60 
seconds. This is much shorter than Bionic. When the wrapper script terminates, 
it terminates with a error exit code, and systemd restarts the service. This 
continues forever until the node joins the cluster, at which stage the systemd 
status turns active. Problem is fixed.

It seems the timeouts happen at the mercy of
mnesia_table_loading_retry_limit and mnesia_table_loading_retry_timeout
values, ignoring the -t 600 that we pass into 'rabbitmqctl wait'.
Nicolas, it seems you are right, and that if we didn't want our services
to restart every 60 (xenial) or 300 (bionic) seconds, we would need to
adjust these timeouts. The problem is, we would have to introduce new
configuration files to do this, which is normally frowned on when doing
a SRU.

Now that we have Restart=on-failure and RestartSec=10 would I add config
to change mnesia_table_loading_retry_timeout? To be honest I am happy
with leaving them as is, and just relying on Restart=on-failure to do
its job. @ddstreet do you have any strong opinions? Is a service
restarting every 60 seconds unacceptable until the node can rejoin the
cluster?

Nicolas, can you install and test these packages and double check that
you also see what I see. If everything is good, you can submit new
debdiffs for Xenial and Bionic based on my ones, and we can get some new
builds into -proposed.

Nicolas, I think you are more or less right all along, and all you were
missing is Restart=on-failure and RestartSec=10 in the service file.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1874075

Title:
  rabbitmq-server startup timeouts differ between SysV and systemd

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/rabbitmq-server/+bug/1874075/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to