The debugging was long⦠but the short version is: when running QA, sometimes the RAPI /version check which is run right after masterfailover was failing with "Can't connect". Debugging lead to identifying that RAPI returns before the (forked) child is ready to serve answers, and if the QA machine is close to the QA cluster, then it will end trying to talk to RAPI before it finished startup.
Or, so I believe, from the logs. This patch series changes the startup sequence so that: - the parent returns only after the "preparation" work has been done, which for RAPI, confd and noded means after we bound to the socket and called listen(), but before the mainloop started - the functionality needed for the above feature also gives us error reporting for free; before, ganeti-rapi (parent) exited cleanly and without error, no matter if the child was unable to actually start Iustin Pop (8): Abstract some daemon functionality Abstract daemon file descriptor setup Use only one version of WritePidFile Change utils.GenericMain protocol Change daemon.GenericMain/utils.Daemonize workflow Convert ganeti daemons to the three-stage startup Enhance the error reporting Fix a rare bug in StartDaemonChild and GenericMain daemons/ganeti-confd | 15 +++- daemons/ganeti-masterd | 16 +++- daemons/ganeti-noded | 13 +++- daemons/ganeti-rapi | 15 +++- lib/daemon.py | 78 +++++++++++++++---- lib/utils.py | 172 +++++++++++++++++++++++++--------------- test/ganeti.utils_unittest.py | 11 ++- 7 files changed, 223 insertions(+), 97 deletions(-)
