[I] CKS: NPE when trying to remove a external node from a cks cluster [cloudstack]

via GitHub Thu, 04 Sep 2025 11:16:08 -0700


kiranchavala opened a new issue, #11581:
URL: https://github.com/apache/cloudstack/issues/11581


   ### problem
   
   CKS: NPE when trying to remove a external node from a cks cluster
   
   ### versions
   
   ACS 4.20.x
   
   ### The steps to reproduce the bug
   
   1. Register a external cks template 
   
   
https://download.cloudstack.org/testing/custom_templates/ubuntu/22.04/22.04/cks-ubuntu-2204-kvm.qcow2.bz2
   
   2. Launch a cks cluster
   
   3. Launch a Ubuntu vm with the template mentioned above
   
   4. Add the management server public key, once the Ubuntu VM boots up 
   
   5. Add the Ubuntu vm as external node to the cks cluster 
   
   <img width="590" height="384" alt="Image" 
src="https://github.com/user-attachments/assets/e561e608-75f6-406e-9681-b3c4eddd2745";
 />
   
   6. CKS cluster will be in importing state 
   
   
   7. The external node will go be in not-ready state , due to disk  issue 
   
   Login to the external node and check the cloud-init-output.log 
   
   ```
   unpacking registry.k8s.io/etcd:3.5.21-0 
(sha256:d58c035df557080a27387d687092e3fc2b64c6d0e3162dc51453a115f847d121)...time="2025-09-04T09:32:05Z"
 level=info msg="apply failure, attempting cleanup" error="failed to extract 
layer sha256:edcdf51bd97dae2c7c6a75ab21cf445d5997888402357f5cc36e7582543431ac: 
write 
/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/45/fs/usr/local/bin/etcd-3.4.18:
 no space left on device: unknown" key="extract-939483347-ZepJ 
sha256:c6230a0bcc0db1264e316a45b18c0a8dfab3c4818a4245035770b4e58967e035"
   time="2025-09-04T09:32:05Z" level=warning msg="extraction snapshot removal 
failed" error="write 
/var/lib/containerd/io.containerd.metadata.v1.bolt/meta.db: no space left on 
device: unknown" key="extract-939483347-ZepJ 
sha256:c6230a0bcc0db1264e316a45b18c0a8dfab3c4818a4245035770b4e58967e035"
   ctr: failed to extract layer 
sha256:edcdf51bd97dae2c7c6a75ab21cf445d5997888402357f5cc36e7582543431ac: write 
/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/45/fs/usr/local/bin/etcd-3.4.18:
 no space left on device: unknown
   ctr: failed to ingest 
"blobs/sha256/0038afa1c30b6e7c6ed64ebbb3593756f0a5328da72cf3304be62b27cb40139a":
 failed to open writer: mkdir 
/var/lib/containerd/io.containerd.content.v1.content/ingest/f2e067fbba1183a1b7465f8eb36511a50660708c165d166d396d63d79880c233:
 no space left on device: unknown
   ctr: failed to ingest 
"blobs/sha256/0038afa1c30b6e7c6ed64ebbb3593756f0a5328da72cf3304be62b27cb40139a":
 failed to open writer: mkdir 
/var/lib/containerd/io.containerd.content.v1.content/ingest/f2e067fbba1183a1b7465f8eb36511a50660708c165d166d396d63d79880c233:
 no space left on device: unknown
   Loading docker image /mnt/k8sdisk//docker/etcd:3.5.21-0.tar failed!
   ctr: failed to ingest 
"blobs/sha256/0bab15eea81d0fe6ab56ebf5fba14e02c4c1775a7f7436fbddd3505add4e18fa":
 failed to open writer: mkdir 
/var/lib/containerd/io.containerd.content.v1.content/ingest/e32451d5bcc4edc020e31aecb569918a2e63aca4585fb9735e3e19ad02b818a0:
 no space left on device: unknown
   ctr: failed to ingest 
"blobs/sha256/0bab15eea81d0fe6ab56ebf5fba14e02c4c1775a7f7436fbddd3505add4e18fa":
 failed to open writer: mkdir 
/var/lib/containerd/io.containerd.content.v1.content/ingest/e32451d5bcc4edc020e31aecb569918a2e63aca4585fb9735e3e19ad02b818a0:
 no space left on device: unknown
   ctr: failed to ingest 
"blobs/sha256/0bab15eea81d0fe6ab56ebf5fba14e02c4c1775a7f7436fbddd3505add4e18fa":
 failed to open writer: mkdir 
/var/lib/containerd/io.containerd.content.v1.content/ingest/e32451d5bcc4edc020e31aecb569918a2e63aca4585fb9735e3e19ad02b818a0:
 no space left on device: unknown
   
   ```
   
   8. Stop the CKS cluster, in oder to remove the external node
   
   9. CKS cluster goes into stop state , but the addition of the external node 
job still carries on 
   
   ```
   [root@ref-trl-9383-k-Mol8-kiran-chavala-mgmt1 ~]# tail -f     
/var/log/cloudstack/management/management-server.log |grep job-295
   2025-09-04 10:07:11,732 DEBUG [c.c.k.c.u.KubernetesClusterUtil] 
(API-Job-Executor-52:[ctx-09df8212, job-295, ctx-993d72f7]) (logid:00b69bc8) 
Checking ready nodes for the Kubernetes cluster KubernetesCluster 
{"id":9,"name":"isolated","uuid":"8663f80d-3ce7-47f4-b852-4eb145453e0a"} with 
total 3 provisioned nodes
   2025-09-04 10:07:12,529 DEBUG [c.c.k.c.u.KubernetesClusterUtil] 
(API-Job-Executor-52:[ctx-09df8212, job-295, ctx-993d72f7]) (logid:00b69bc8) 
Kubernetes cluster KubernetesCluster 
{"id":9,"name":"isolated","uuid":"8663f80d-3ce7-47f4-b852-4eb145453e0a"} has 
total 3 provisioned nodes while 2 ready now
   2025-09-04 10:07:16,880 WARN  [o.a.c.f.j.i.AsyncJobMonitor] 
(Timer-0:[ctx-2caef2bd]) (logid:299149e7) Task (job-295) has been pending for 
2208 seconds
   2025-09-04 10:07:27,530 DEBUG [c.c.k.c.u.KubernetesClusterUtil] 
(API-Job-Executor-52:[ctx-09df8212, job-295, ctx-993d72f7]) (logid:00b69bc8) 
Checking ready nodes for the Kubernetes cluster KubernetesCluster 
{"id":9,"name":"isolated","uuid":"8663f80d-3ce7-47f4-b852-4eb145453e0a"} with 
total 3 provisioned nodes
   2025-09-04 10:07:28,209 DEBUG [c.c.k.c.u.KubernetesClusterUtil] 
(API-Job-Executor-52:[ctx-09df8212, job-295, ctx-993d72f7]) (logid:00b69bc8) 
Kubernetes cluster KubernetesCluster 
{"id":9,"name":"isolated","uuid":"8663f80d-3ce7-47f4-b852-4eb145453e0a"} has 
total 3 provisioned nodes while 2 ready now
   
   ```
   
   9.  Destroy the external node and start the cks cluster 
   
   
   Exception observed
   
   <img width="521" height="226" alt="Image" 
src="https://github.com/user-attachments/assets/658e4714-f933-4a36-a93f-97da589490e5";
 />
   
   10. The cks cluster remains in alert state
   
   11. Destroy the cks cluster 
   
   Exception
   
   <img width="1470" height="956" alt="Image" 
src="https://github.com/user-attachments/assets/bee7acb9-6f9c-4165-8ca1-935324b832d9";
 />
   
   ```
   
   2025-09-04 11:01:47,260 ERROR [c.c.a.ApiAsyncJobDispatcher] 
(API-Job-Executor-64:[ctx-bc12e86e, job-333]) (logid:2e067630) Unexpected 
exception while executing 
org.apache.cloudstack.api.command.user.kubernetes.cluster.DeleteKubernetesClusterCmd
 java.lang.NullPointerException: Cannot invoke 
"com.cloud.vm.VMInstanceVO.getBackupOfferingId()" because "vm" is null
        at 
com.cloud.kubernetes.cluster.KubernetesClusterManagerImpl.checkIfVmsAssociatedWithBackupOffering(KubernetesClusterManagerImpl.java:2010)
        at 
com.cloud.kubernetes.cluster.actionworkers.KubernetesClusterDestroyWorker.destroy(KubernetesClusterDestroyWorker.java:267)
        at 
com.cloud.kubernetes.cluster.KubernetesClusterManagerImpl.destroyKubernetesCluster(KubernetesClusterManagerImpl.java:2384)
        at 
com.cloud.kubernetes.cluster.KubernetesClusterManagerImpl.destroyKubernetesCluster(KubernetesClusterManagerImpl.java:2392)
        at 
com.cloud.kubernetes.cluster.KubernetesClusterManagerImpl.deleteKubernetesCluster(KubernetesClusterManagerImpl.java:1969)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:569)
        at 
org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
        at 
org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
        at 
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
        at 
org.apache.cloudstack.network.contrail.management.EventUtils$EventInterceptor.invoke(EventUtils.java:109)
        at 
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175)
        at 
com.cloud.event.ActionEventInterceptor.invoke(ActionEventInterceptor.java:52)
        at 
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175)
        at 
org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97)
        at 
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at 
org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215)
        at jdk.proxy3/jdk.proxy3.$Proxy534.deleteKubernetesCluster(Unknown 
Source)
        at 
org.apache.cloudstack.api.command.user.kubernetes.cluster.DeleteKubernetesClusterCmd.execute(DeleteKubernetesClusterCmd.java:95)
        at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:173)
        at 
com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:110)
   
   ````
   
   
   
   ...
   
   
   ### What to do about it?
   
   Workaround
   
   Deploy the external ubuntu node with root disk size greater than 20 gb 
   
   Need to fix the NPE as the cks cluster remains in alert state


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] CKS: NPE when trying to remove a external node from a cks cluster [cloudstack]

Reply via email to