The debugging was long… but the short version is: when running QA,
sometimes the RAPI /version check which is run right after
masterfailover was failing with "Can't connect". Debugging lead to
identifying that RAPI returns before the (forked) child is ready to
serve answers, and if the QA machine is close to the QA cluster, then it
will end trying to talk to RAPI before it finished startup.

Or, so I believe, from the logs.

This patch series changes the startup sequence so that:

- the parent returns only after the "preparation" work has been done,
  which for RAPI, confd and noded means after we bound to the socket and
  called listen(), but before the mainloop started
- the functionality needed for the above feature also gives us error
  reporting for free; before, ganeti-rapi (parent) exited cleanly and
  without error, no matter if the child was unable to actually start

Iustin Pop (8):
  Abstract some daemon functionality
  Abstract daemon file descriptor setup
  Use only one version of WritePidFile
  Change utils.GenericMain protocol
  Change daemon.GenericMain/utils.Daemonize workflow
  Convert ganeti daemons to the three-stage startup
  Enhance the error reporting
  Fix a rare bug in StartDaemonChild and GenericMain

 daemons/ganeti-confd          |   15 +++-
 daemons/ganeti-masterd        |   16 +++-
 daemons/ganeti-noded          |   13 +++-
 daemons/ganeti-rapi           |   15 +++-
 lib/daemon.py                 |   78 +++++++++++++++----
 lib/utils.py                  |  172 +++++++++++++++++++++++++---------------
 test/ganeti.utils_unittest.py |   11 ++-
 7 files changed, 223 insertions(+), 97 deletions(-)

Reply via email to