Okay, I made test packages for Bionic and Xenial based on the above: The ppa is available here: https://launchpad.net/~mruffell/+archive/ubuntu/lp1874075-test
It contains (based off of -updates): Xenial: rabbitmq-server 3.5.7-1ubuntu0.16.04.2+lp1874075v20200629b1 Bionic: rabbitmq-server 3.6.10-1ubuntu0.1+lp1874075v20200629b1 Debdiffs for the above builds are: Xenial: https://paste.ubuntu.com/p/Jm8ZctJzny/ Bionic: https://paste.ubuntu.com/p/j6cBPzgWMD/ On Bionic: When you install the test packages on both nodes and reboot them, then attempt to reproduce, the node which is attempting to rejoin the cluster will stay in the systemd activating state, and the wrapper script terminates after 5 minutes or 300 seconds, i.e. 10x 3000ms timeouts. When the wrapper script terminates, it terminates with a error exit code, and systemd restarts the service. This continues forever until the node joins the cluster, at which stage the systemd status turns active. Problem is fixed. On Xenial: When you install the test packages on both nodes and reboot them, then attempt to reproduce, the node which is attempting to rejoin the cluster will stay in the systemd activating state, and the wrapper script terminates after 60 seconds. This is much shorter than Bionic. When the wrapper script terminates, it terminates with a error exit code, and systemd restarts the service. This continues forever until the node joins the cluster, at which stage the systemd status turns active. Problem is fixed. It seems the timeouts happen at the mercy of mnesia_table_loading_retry_limit and mnesia_table_loading_retry_timeout values, ignoring the -t 600 that we pass into 'rabbitmqctl wait'. Nicolas, it seems you are right, and that if we didn't want our services to restart every 60 (xenial) or 300 (bionic) seconds, we would need to adjust these timeouts. The problem is, we would have to introduce new configuration files to do this, which is normally frowned on when doing a SRU. Now that we have Restart=on-failure and RestartSec=10 would I add config to change mnesia_table_loading_retry_timeout? To be honest I am happy with leaving them as is, and just relying on Restart=on-failure to do its job. @ddstreet do you have any strong opinions? Is a service restarting every 60 seconds unacceptable until the node can rejoin the cluster? Nicolas, can you install and test these packages and double check that you also see what I see. If everything is good, you can submit new debdiffs for Xenial and Bionic based on my ones, and we can get some new builds into -proposed. Nicolas, I think you are more or less right all along, and all you were missing is Restart=on-failure and RestartSec=10 in the service file. -- You received this bug notification because you are a member of STS Sponsors, which is subscribed to the bug report. https://bugs.launchpad.net/bugs/1874075 Title: rabbitmq-server startup timeouts differ between SysV and systemd Status in rabbitmq-server package in Ubuntu: Fix Released Status in rabbitmq-server source package in Xenial: Fix Committed Status in rabbitmq-server source package in Bionic: Fix Committed Status in rabbitmq-server source package in Eoan: Won't Fix Status in rabbitmq-server source package in Focal: Fix Committed Status in rabbitmq-server source package in Groovy: Fix Released Status in rabbitmq-server package in Debian: New Bug description: The startup timeouts were recently adjusted and synchronized between the SysV and systemd startup files. https://github.com/rabbitmq/rabbitmq-server-release/pull/129 The new startup files should be included in this package. [Impact] After starting the RabbitMQ server process, the startup script will wait for the server to start by calling `rabbitmqctl wait` and will time out after 10 s. The startup time of the server depends on how quickly the Mnesia database becomes available and the server will time out after `mnesia_table_loading_retry_timeout` ms times `mnesia_table_loading_retry_limit` retries. By default this wait is 30,000 ms times 10 retries, i.e. 300 s. The mismatch between these two timeout values might lead to the startup script failing prematurely while the server is still waiting for the Mnesia tables. This change introduces variable `RABBITMQ_STARTUP_TIMEOUT` and the `--timeout` option into the startup script. The default value for this timeout is set to 10 minutes (600 seconds). This change also updates the systemd service file to match the timeout values between the two service management methods. [Scope] Upstream patch: https://github.com/rabbitmq/rabbitmq-server- release/pull/129 * Fix is not included in the Debian package * Fix is not included in any Ubuntu series * Groovy and Focal can apply the upstream patch as is * Bionic and Xenial need an additional fix in the systemd service file to set the `RABBITMQ_STARTUP_TIMEOUT` variable for the `rabbitmq-server-wait` helper script. [Test Case] In a clustered setup with two nodes, A and B. 1. create queue on A 2. shut down B 3. shut down A 4. boot B The broker on B will wait for A. The systemd service will wait for 10 seconds and then fail. Boot A and the rabbitmq-server process on B will complete startup. [Regression Potential] This change alters the behavior of the startup scripts when the Mnesia database takes long to become available. This might lead to failures further down the service dependency chain. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/rabbitmq-server/+bug/1874075/+subscriptions -- Mailing list: https://launchpad.net/~sts-sponsors Post to : [email protected] Unsubscribe : https://launchpad.net/~sts-sponsors More help : https://help.launchpad.net/ListHelp

