Hi,

Could you please try running the deploy-kube-system script manually on the 
node. Maybe this would give us a hint as to what the issue could be. It could 
so happen that setup-kube-system service may have not completed successfully - 
particularly the 'kubeadm init' operation and deploy-kube-system requires 
setup-kube-system service to have run successfully. So, it also may be worth 
checking the status of the setup-kube-system service. Could also please share 
the Kubernetes version.

Thanks,
Pearl


________________________________
From: William (B.J.) Lawson, MD <lawson...@gmail.com>
Sent: Thursday, January 6, 2022 8:04 PM
To: users@cloudstack.apache.org <users@cloudstack.apache.org>
Subject: 4.16.0: Unpredictable failure to successfully complete 
deploy-kube-system on HA k8s clusters

Good morning... we have two Cloudstack 4.16.0 environments where HA k8s
clusters (meaning clusters with > 1 control node) consistently fail to
provision successfully.

Clusters with 1 control reliably deploy their VMs, networking, and start...
however, when allocating 2+ control nodes, invariably the K8s cluster
remains indefinitely in the "Starting" state despite all of the VMs being
started.

Logging into the nodes reveals that not all of the nodes are running
deploy-kube-system successfully. The failed nodes lack a "success" file in
the core user's home directory. In every case, we can manually
re-run deploy-kube-system and the process will complete on that node.

When looking at the cloud-init.log file, we see:

###

2022-01-06 13:57:58,275 - subp.py[DEBUG]: Running command
['/var/lib/cloud/instance/scripts/runcmd'] with allowed return codes [0]
(shell=False, capture=False)
2022-01-06 13:57:58,296 - subp.py[DEBUG]: Unexpected error while running
command.
Command: ['/var/lib/cloud/instance/scripts/runcmd']
Exit code: 5
Reason: -
Stdout: -
Stderr: -
2022-01-06 13:57:58,296 - cc_scripts_user.py[WARNING]: Failed to run module
scripts-user (scripts in /var/lib/cloud/instance/scripts)
2022-01-06 13:57:58,296 - handlers.py[DEBUG]: finish:
modules-final/config-scripts-user: FAIL: running config-scripts-user with
frequency once-per-instance
2022-01-06 13:57:58,296 - util.py[WARNING]: Running module scripts-user
(<module 'cloudinit.config.cc_scripts_user' from
'/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>)
failed
2022-01-06 13:57:58,296 - util.py[DEBUG]: Running module scripts-user
(<module 'cloudinit.config.cc_scripts_user' from
'/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py'>)
failed
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/stages.py", line 848, in
_run_modules
    ran, _r = cc.run(run_name, mod.handle, func_args,
  File "/usr/lib/python3/dist-packages/cloudinit/cloud.py", line 54, in run
    return self._runners.run(name, functor, args, freq, clear_on_fail)
  File "/usr/lib/python3/dist-packages/cloudinit/helpers.py", line 185, in
run
    results = functor(*args)
  File
"/usr/lib/python3/dist-packages/cloudinit/config/cc_scripts_user.py", line
45, in handle
    subp.runparts(runparts_path)
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 384, in
runparts
    raise RuntimeError(
RuntimeError: Runparts: 1 failures (runcmd) in 1 attempted commands
2022-01-06 13:57:58,300 - stages.py[DEBUG]: Running module
ssh-authkey-fingerprints (<module
'cloudinit.config.cc_ssh_authkey_fingerprints' from
'/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh_authkey_fingerprints.py'>)
with frequency once-per-instance

###

Given that /var/lib/cloud/instance/scripts/runcmd contains the setup and
deploy scripts, this seems like evidence of the problem. However, other
than Exit code 5 we're not seeing much to go on.

We also have several other CS 4.16.0 environments where this problem does
*not* occur -- they are identically deployed with the same underlying
scripts.

Any thoughts or suggestions on where we can look to troubleshoot are
greatly appreciated!

--
William (B.J.) Lawson, MD

 

Reply via email to