Hi Pearl, thanks for your help.

You are correct -- in all cases, running the deploy-kube-system script
manually on the failed node completes successfully and without issues.

I suspect you are also correct that there may be some race condition where
setup-kube-system has not completed successfully before deploy-kube-system
first runs, although by the time we log into a node that has failed initial
setup, the setup-kube system appears to have completed normally:

● setup-kube-system.service
     Loaded: loaded (/etc/systemd/system/setup-kube-system.service; static)
     Active: inactive (dead)

We are running the kubernetes version 1.22.2 and have tried templates from
CloudStack and ShapeBlue -- although the checksums are identical so didn't
expect a difference there.

Are there any longer-running operations in setup-kube-system that might be
worth trying to explore in more detail?

I've seen in the source code that there are separate cloud-init files
for k8s-control-node-add (think that means "additional"?)
and k8s-control-node (which might be the primary control node)... so was
thinking there might be a possible process in the k8s-control-node-add file
that was tripping things up since the issue only shows up with > 1 control
node.

However, the node(s) that fail to complete seem to be random -- typically
it is one more more of the additional control nodes, and compute nodes also
will fail to complete.

Appreciate any thoughts / logging techniques to look for more information!

BJ

On Thu, Jan 6, 2022 at 11:29 PM Pearl d'Silva <pearl.dsi...@shapeblue.com>
wrote:

> Hi,
>
> Could you please try running the deploy-kube-system script manually on the
> node. Maybe this would give us a hint as to what the issue could be. It
> could so happen that setup-kube-system service may have not completed
> successfully - particularly the 'kubeadm init' operation and
> deploy-kube-system requires setup-kube-system service to have run
> successfully. So, it also may be worth checking the status of the
> setup-kube-system service. Could also please share the Kubernetes version.
>
> Thanks,
> Pearl
>
>
> ________________________________
> From: William (B.J.) Lawson, MD <lawson...@gmail.com>
> Sent: Thursday, January 6, 2022 8:04 PM
> To: users@cloudstack.apache.org <users@cloudstack.apache.org>
> Subject: 4.16.0: Unpredictable failure to successfully complete
> deploy-kube-system on HA k8s clusters
>
> Good morning... we have two Cloudstack 4.16.0 environments where HA k8s
> clusters (meaning clusters with > 1 control node) consistently fail to
> provision successfully.
>
> Clusters with 1 control reliably deploy their VMs, networking, and start...
> however, when allocating 2+ control nodes, invariably the K8s cluster
> remains indefinitely in the "Starting" state despite all of the VMs being
> started.
>
> Logging into the nodes reveals that not all of the nodes are running
> deploy-kube-system successfully. The failed nodes lack a "success" file in
> the core user's home directory. In every case, we can manually
> re-run deploy-kube-system and the process will complete on that node.
>
> When looking at the cloud-init.log file, we see:
>
> ###
>
> 2022-01-06 13:57:58,275 - subp.py[DEBUG]: Running command
> ['/var/lib/cloud/instance/scripts/runcmd'] with allowed return codes [0]
> (shell=False, capture=False)
> 2022-01-06 13:57:58,296 - subp.py[DEBUG]: Unexpected error while running
> command.
> Command: ['/var/lib/cloud/instance/scripts/runcmd']
> Exit code: 5
> Reason: -
> Stdout: -
> Stderr: -
> 2022-01-06 13:57:58,296 - cc_scripts_user.py[WARNING]: Failed to run module
> scripts-user (scripts in /var/lib/cloud/instance/scripts)
> 2022-01-06 13:57:58,296 - handlers.py[DEBUG]: finish:
> modules-final/config-scripts-user: FAIL: running config-scripts-user with
> frequency once-per-instance
> 2022-01-06 13:57:58,296 - util.py[WARNING]: Running module scripts-user
> (<module 'cloudinit.config.cc_scripts_user' from
> '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>)
> failed
> 2022-01-06 13:57:58,296 - util.py[DEBUG]: Running module scripts-user
> (<module 'cloudinit.config.cc_scripts_user' from
> '/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>)
> failed
> Traceback (most recent call last):
>   File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 848, in
> _run_modules
>     ran, _r = cc.run(run_name, mod.handle, func_args,
>   File "/usr/lib/python3/dist-packages/cloudinit/cloud.py", line 54, in run
>     return self._runners.run(name, functor, args, freq, clear_on_fail)
>   File "/usr/lib/python3/dist-packages/cloudinit/helpers.py", line 185, in
> run
>     results = functor(*args)
>   File
> "/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py", line
> 45, in handle
>     subp.runparts(runparts_path)
>   File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 384, in
> runparts
>     raise RuntimeError(
> RuntimeError: Runparts: 1 failures (runcmd) in 1 attempted commands
> 2022-01-06 13:57:58,300 - stages.py[DEBUG]: Running module
> ssh-authkey-fingerprints (<module
> 'cloudinit.config.cc_ssh_authkey_fingerprints' from
>
> '/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh_authkey_fingerprints.py'>)
> with frequency once-per-instance
>
> ###
>
> Given that /var/lib/cloud/instance/scripts/runcmd contains the setup and
> deploy scripts, this seems like evidence of the problem. However, other
> than Exit code 5 we're not seeing much to go on.
>
> We also have several other CS 4.16.0 environments where this problem does
> *not* occur -- they are identically deployed with the same underlying
> scripts.
>
> Any thoughts or suggestions on where we can look to troubleshoot are
> greatly appreciated!
>
> --
> William (B.J.) Lawson, MD
>
>
>
>

-- 
William (B.J.) Lawson, MD
919.335.3107 (direct)

Reply via email to