Hi Pearl, thanks for your help. You are correct -- in all cases, running the deploy-kube-system script manually on the failed node completes successfully and without issues.
I suspect you are also correct that there may be some race condition where setup-kube-system has not completed successfully before deploy-kube-system first runs, although by the time we log into a node that has failed initial setup, the setup-kube system appears to have completed normally: ● setup-kube-system.service Loaded: loaded (/etc/systemd/system/setup-kube-system.service; static) Active: inactive (dead) We are running the kubernetes version 1.22.2 and have tried templates from CloudStack and ShapeBlue -- although the checksums are identical so didn't expect a difference there. Are there any longer-running operations in setup-kube-system that might be worth trying to explore in more detail? I've seen in the source code that there are separate cloud-init files for k8s-control-node-add (think that means "additional"?) and k8s-control-node (which might be the primary control node)... so was thinking there might be a possible process in the k8s-control-node-add file that was tripping things up since the issue only shows up with > 1 control node. However, the node(s) that fail to complete seem to be random -- typically it is one more more of the additional control nodes, and compute nodes also will fail to complete. Appreciate any thoughts / logging techniques to look for more information! BJ On Thu, Jan 6, 2022 at 11:29 PM Pearl d'Silva <pearl.dsi...@shapeblue.com> wrote: > Hi, > > Could you please try running the deploy-kube-system script manually on the > node. Maybe this would give us a hint as to what the issue could be. It > could so happen that setup-kube-system service may have not completed > successfully - particularly the 'kubeadm init' operation and > deploy-kube-system requires setup-kube-system service to have run > successfully. So, it also may be worth checking the status of the > setup-kube-system service. Could also please share the Kubernetes version. > > Thanks, > Pearl > > > ________________________________ > From: William (B.J.) Lawson, MD <lawson...@gmail.com> > Sent: Thursday, January 6, 2022 8:04 PM > To: users@cloudstack.apache.org <users@cloudstack.apache.org> > Subject: 4.16.0: Unpredictable failure to successfully complete > deploy-kube-system on HA k8s clusters > > Good morning... we have two Cloudstack 4.16.0 environments where HA k8s > clusters (meaning clusters with > 1 control node) consistently fail to > provision successfully. > > Clusters with 1 control reliably deploy their VMs, networking, and start... > however, when allocating 2+ control nodes, invariably the K8s cluster > remains indefinitely in the "Starting" state despite all of the VMs being > started. > > Logging into the nodes reveals that not all of the nodes are running > deploy-kube-system successfully. The failed nodes lack a "success" file in > the core user's home directory. In every case, we can manually > re-run deploy-kube-system and the process will complete on that node. > > When looking at the cloud-init.log file, we see: > > ### > > 2022-01-06 13:57:58,275 - subp.py[DEBUG]: Running command > ['/var/lib/cloud/instance/scripts/runcmd'] with allowed return codes [0] > (shell=False, capture=False) > 2022-01-06 13:57:58,296 - subp.py[DEBUG]: Unexpected error while running > command. > Command: ['/var/lib/cloud/instance/scripts/runcmd'] > Exit code: 5 > Reason: - > Stdout: - > Stderr: - > 2022-01-06 13:57:58,296 - cc_scripts_user.py[WARNING]: Failed to run module > scripts-user (scripts in /var/lib/cloud/instance/scripts) > 2022-01-06 13:57:58,296 - handlers.py[DEBUG]: finish: > modules-final/config-scripts-user: FAIL: running config-scripts-user with > frequency once-per-instance > 2022-01-06 13:57:58,296 - util.py[WARNING]: Running module scripts-user > (<module 'cloudinit.config.cc_scripts_user' from > '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) > failed > 2022-01-06 13:57:58,296 - util.py[DEBUG]: Running module scripts-user > (<module 'cloudinit.config.cc_scripts_user' from > '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>) > failed > Traceback (most recent call last): > File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 848, in > _run_modules > ran, _r = cc.run(run_name, mod.handle, func_args, > File "/usr/lib/python3/dist-packages/cloudinit/cloud.py", line 54, in run > return self._runners.run(name, functor, args, freq, clear_on_fail) > File "/usr/lib/python3/dist-packages/cloudinit/helpers.py", line 185, in > run > results = functor(*args) > File > "/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py", line > 45, in handle > subp.runparts(runparts_path) > File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 384, in > runparts > raise RuntimeError( > RuntimeError: Runparts: 1 failures (runcmd) in 1 attempted commands > 2022-01-06 13:57:58,300 - stages.py[DEBUG]: Running module > ssh-authkey-fingerprints (<module > 'cloudinit.config.cc_ssh_authkey_fingerprints' from > > '/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh_authkey_fingerprints.py'>) > with frequency once-per-instance > > ### > > Given that /var/lib/cloud/instance/scripts/runcmd contains the setup and > deploy scripts, this seems like evidence of the problem. However, other > than Exit code 5 we're not seeing much to go on. > > We also have several other CS 4.16.0 environments where this problem does > *not* occur -- they are identically deployed with the same underlying > scripts. > > Any thoughts or suggestions on where we can look to troubleshoot are > greatly appreciated! > > -- > William (B.J.) Lawson, MD > > > > -- William (B.J.) Lawson, MD 919.335.3107 (direct)