Good morning... we have two Cloudstack 4.16.0 environments where HA k8s clusters (meaning clusters with > 1 control node) consistently fail to provision successfully.
Clusters with 1 control reliably deploy their VMs, networking, and start... however, when allocating 2+ control nodes, invariably the K8s cluster remains indefinitely in the "Starting" state despite all of the VMs being started. Logging into the nodes reveals that not all of the nodes are running deploy-kube-system successfully. The failed nodes lack a "success" file in the core user's home directory. In every case, we can manually re-run deploy-kube-system and the process will complete on that node. When looking at the cloud-init.log file, we see: ### 2022-01-06 13:57:58,275 - subp.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/runcmd'] with allowed return codes [0] (shell=False, capture=False) 2022-01-06 13:57:58,296 - subp.py[DEBUG]: Unexpected error while running command. Command: ['/var/lib/cloud/instance/scripts/runcmd'] Exit code: 5 Reason: - Stdout: - Stderr: - 2022-01-06 13:57:58,296 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts) 2022-01-06 13:57:58,296 - handlers.py[DEBUG]: finish: modules-final/config-scripts-user: FAIL: running config-scripts-user with frequency once-per-instance 2022-01-06 13:57:58,296 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed 2022-01-06 13:57:58,296 - util.py[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) failed Traceback (most recent call last): File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 848, in _run_modules ran, _r = cc.run(run_name, mod.handle, func_args, File "/usr/lib/python3/dist-packages/cloudinit/cloud.py", line 54, in run return self._runners.run(name, functor, args, freq, clear_on_fail) File "/usr/lib/python3/dist-packages/cloudinit/helpers.py", line 185, in run results = functor(*args) File "/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py", line 45, in handle subp.runparts(runparts_path) File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 384, in runparts raise RuntimeError( RuntimeError: Runparts: 1 failures (runcmd) in 1 attempted commands 2022-01-06 13:57:58,300 - stages.py[DEBUG]: Running module ssh-authkey-fingerprints (<module 'cloudinit.config.cc_ssh_authkey_fingerprints' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh_authkey_fingerprints.py'>) with frequency once-per-instance ### Given that /var/lib/cloud/instance/scripts/runcmd contains the setup and deploy scripts, this seems like evidence of the problem. However, other than Exit code 5 we're not seeing much to go on. We also have several other CS 4.16.0 environments where this problem does *not* occur -- they are identically deployed with the same underlying scripts. Any thoughts or suggestions on where we can look to troubleshoot are greatly appreciated! -- William (B.J.) Lawson, MD