Good morning... we have two Cloudstack 4.16.0 environments where HA k8s
clusters (meaning clusters with > 1 control node) consistently fail to
provision successfully.

Clusters with 1 control reliably deploy their VMs, networking, and start...
however, when allocating 2+ control nodes, invariably the K8s cluster
remains indefinitely in the "Starting" state despite all of the VMs being
started.

Logging into the nodes reveals that not all of the nodes are running
deploy-kube-system successfully. The failed nodes lack a "success" file in
the core user's home directory. In every case, we can manually
re-run deploy-kube-system and the process will complete on that node.

When looking at the cloud-init.log file, we see:

###

2022-01-06 13:57:58,275 - subp.py[DEBUG]: Running command
['/var/lib/cloud/instance/scripts/runcmd'] with allowed return codes [0]
(shell=False, capture=False)
2022-01-06 13:57:58,296 - subp.py[DEBUG]: Unexpected error while running
command.
Command: ['/var/lib/cloud/instance/scripts/runcmd']
Exit code: 5
Reason: -
Stdout: -
Stderr: -
2022-01-06 13:57:58,296 - cc_scripts_user.py[WARNING]: Failed to run module
scripts-user (scripts in /var/lib/cloud/instance/scripts)
2022-01-06 13:57:58,296 - handlers.py[DEBUG]: finish:
modules-final/config-scripts-user: FAIL: running config-scripts-user with
frequency once-per-instance
2022-01-06 13:57:58,296 - util.py[WARNING]: Running module scripts-user
(<module 'cloudinit.config.cc_scripts_user' from
'/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>)
failed
2022-01-06 13:57:58,296 - util.py[DEBUG]: Running module scripts-user
(<module 'cloudinit.config.cc_scripts_user' from
'/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>)
failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 848, in
_run_modules
    ran, _r = cc.run(run_name, mod.handle, func_args,
  File "/usr/lib/python3/dist-packages/cloudinit/cloud.py", line 54, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3/dist-packages/cloudinit/helpers.py", line 185, in
run
    results = functor(*args)
  File
"/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py", line
45, in handle
    subp.runparts(runparts_path)
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 384, in
runparts
    raise RuntimeError(
RuntimeError: Runparts: 1 failures (runcmd) in 1 attempted commands
2022-01-06 13:57:58,300 - stages.py[DEBUG]: Running module
ssh-authkey-fingerprints (<module
'cloudinit.config.cc_ssh_authkey_fingerprints' from
'/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh_authkey_fingerprints.py'>)
with frequency once-per-instance

###

Given that /var/lib/cloud/instance/scripts/runcmd contains the setup and
deploy scripts, this seems like evidence of the problem. However, other
than Exit code 5 we're not seeing much to go on.

We also have several other CS 4.16.0 environments where this problem does
*not* occur -- they are identically deployed with the same underlying
scripts.

Any thoughts or suggestions on where we can look to troubleshoot are
greatly appreciated!

-- 
William (B.J.) Lawson, MD

Reply via email to