[GitHub] [cloudstack] zap51 commented on issue #7829: Kubernetes cluster sometimes end up in Error state: fails to attach Binary ISO

via GitHub Mon, 14 Aug 2023 12:08:39 -0700


zap51 commented on issue #7829:
URL: https://github.com/apache/cloudstack/issues/7829#issuecomment-1677912427


   > I created a new cluster to test this with 1 + 10 nodes. In this cluster, 
   > 
   > 1. I can find the "org.apache.cloudstack.storage.command.AttachCommand" 
entries for every node. **Even the failed ones**
   > 
   > 2. The logs and the CloudStack GUI reports the jobs being finished (both 
attach and detach jobs):
   > ```
   > Aug 14 12:27:37 se-flem-001 java[595738]: INFO  
[c.c.k.c.a.KubernetesClusterActionWorker] (API-Job-Executor-14:ctx-63b2bc93 
job-26834 ctx-15a4fa84) (logid:bc9fab70) Attached binaries ISO for VM : 
test-cluster-control-189f3756fa2 in cluster: test-cluster
   > Aug 14 12:27:37 se-flem-001 java[595738]: INFO  
[c.c.k.c.a.KubernetesClusterActionWorker] (API-Job-Executor-14:ctx-63b2bc93 
job-26834 ctx-15a4fa84) (logid:bc9fab70) Attached binaries ISO for VM : 
test-cluster-node-189f375d12e in cluster: test-cluster
   > ... same for the nodes in between
   > Aug 14 12:27:43 se-flem-001 java[595738]: INFO  
[c.c.k.c.a.KubernetesClusterActionWorker] (API-Job-Executor-14:ctx-63b2bc93 
job-26834 ctx-15a4fa84) (logid:bc9fab70) Attached binaries ISO for VM : 
test-cluster-node-189f3787958 in cluster: test-cluster
   > Aug 14 12:27:43 se-flem-001 java[595738]: INFO  
[c.c.k.c.a.KubernetesClusterActionWorker] (API-Job-Executor-14:ctx-63b2bc93 
job-26834 ctx-15a4fa84) (logid:bc9fab70) Attached binaries ISO for VM : 
test-cluster-node-189f378ccf0 in cluster: test-cluster
   > ```
   > ```
   > Aug 14 12:56:11 se-flem-001 java[595738]: INFO  
[c.c.k.c.a.KubernetesClusterActionWorker] (API-Job-Executor-14:ctx-63b2bc93 
job-26834 ctx-15a4fa84) (logid:bc9fab70) Detached Kubernetes binaries from VM : 
test-cluster-control-189f3756fa2 in the Kubernetes cluster : test-cluster
   > Aug 14 12:56:12 se-flem-001 java[595738]: INFO  
[c.c.k.c.a.KubernetesClusterActionWorker] (API-Job-Executor-14:ctx-63b2bc93 
job-26834 ctx-15a4fa84) (logid:bc9fab70) Detached Kubernetes binaries from VM : 
test-cluster-node-189f375d12e in the Kubernetes cluster : test-cluster
   > ... same for the nodes in between
   > Aug 14 12:56:54 se-flem-001 java[595738]: INFO  
[c.c.k.c.a.KubernetesClusterActionWorker] (API-Job-Executor-14:ctx-63b2bc93 
job-26834 ctx-15a4fa84) (logid:bc9fab70) Detached Kubernetes binaries from VM : 
test-cluster-node-189f3787958 in the Kubernetes cluster : test-cluster
   > Aug 14 12:56:56 se-flem-001 java[595738]: INFO  
[c.c.k.c.a.KubernetesClusterActionWorker] (API-Job-Executor-14:ctx-63b2bc93 
job-26834 ctx-15a4fa84) (logid:bc9fab70) Detached Kubernetes binaries from VM : 
test-cluster-node-189f378ccf0 in the Kubernetes cluster : test-cluster
   > ```
   > 
   > I've tried to find relevant logs after the Attach and Detach jobs 
completes, but I only find one log entry after the last detach saying that the 
job failed:
   > 
   > ```
   > 2023-08-14 12:56:56,831 INFO  [c.c.k.c.a.KubernetesClusterActionWorker] 
(API-Job-Executor-14:ctx-63b2bc93 job-26834 ctx-15a4fa84) (logid:bc9fab70) 
Detached Kubernetes binaries from VM : test-cluster-node-189f378ccf0 in the 
Kubernetes cluster : test-cluster
   > 2023-08-14 12:56:56,832 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] 
(API-Job-Executor-14:ctx-63b2bc93 job-26834) (logid:bc9fab70) Complete async 
job-26834, jobStatus: FAILED, resultCode: 530, result: 
org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":"530","errortext":"Failed
 to setup Kubernetes cluster : test-cluster in usable state as unable to 
provision API endpoint for the cluster"}
   > ```
   > 
   > So it seems as we get a timeout  (_Waiting for Binaries directory 
/mnt/k8sdisk/ to be available, sleeping for 15 seconds, attempt: 100_) even if 
the logs says the attach/detach jobs runs
   
   I'll try to reproduce this issue and come back. @weizhouapache has provided 
the patch to only consider images / ISOs if the secondary storage has the image 
in READY state but it still doesn't work.
   
   @saffronjam would you be able to check if the binary ISOs present on the 
secondary storage match on all the NFS shares (RW and RO) , perhaps a checksum 
would be good. Also, do we happen to see any network issues by any chance 
between the hypervisors and the NFS server? I haven't took a total look on the 
logs, I'll come back.
   
   Thanks,
   Jayanth


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [cloudstack] zap51 commented on issue #7829: Kubernetes cluster sometimes end up in Error state: fails to attach Binary ISO

Reply via email to