Re: [slurm-users] Jobs stuck in "completing" (CG) state
On 10/24/20 9:22 am, Kimera Rodgers wrote: [root@kla-ac-ohpc-01 critical]# srun -c 8 --pty bash -i srun: error: slurm_receive_msgs: Socket timed out on send/recv operation srun: error: Task launch for 37.0 failed on node c-node3: Socket timed out on send/recv operation srun: error: Application launch failed: Socket timed out on send/recv operation srun: Job step aborted: Waiting up to 32 seconds for job step to finish. To me this looks like networking issues, perhaps firewall/iptables rules blocking connections. Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Jobs stuck in "completing" (CG) state
This can happen if the underlying storage is wedged. I would check that it is working properly. Usually the only way to clear this state is either fix the stuck storage or reboot the node. -Paul Edmon- On 10/24/2020 12:22 PM, Kimera Rodgers wrote: I'm setting up slume on OpenHPC cluster with one master node and 5 compute nodes. When I run test jobs, the jobs completely get stuck in the CG state. Can someone help me hint on where I might have gone wrong? [root@kla-ac-ohpc-01 critical]# srun -c 8 --pty bash -i srun: error: slurm_receive_msgs: Socket timed out on send/recv operation srun: error: Task launch for 37.0 failed on node c-node3: Socket timed out on send/recv operation srun: error: Application launch failed: Socket timed out on send/recv operation srun: Job step aborted: Waiting up to 32 seconds for job step to finish. [root@kla-ac-ohpc-01 critical]# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 36 normal bash test CG 0:53 2 c-node[1-2] 37 normal bash root CG 0:52 1 c-node3 Thank you. Regards, Rodgers
[slurm-users] Jobs stuck in "completing" (CG) state
I'm setting up slume on OpenHPC cluster with one master node and 5 compute nodes. When I run test jobs, the jobs completely get stuck in the CG state. Can someone help me hint on where I might have gone wrong? [root@kla-ac-ohpc-01 critical]# srun -c 8 --pty bash -i srun: error: slurm_receive_msgs: Socket timed out on send/recv operation srun: error: Task launch for 37.0 failed on node c-node3: Socket timed out on send/recv operation srun: error: Application launch failed: Socket timed out on send/recv operation srun: Job step aborted: Waiting up to 32 seconds for job step to finish. [root@kla-ac-ohpc-01 critical]# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 36normal bash test CG 0:53 2 c-node[1-2] 37normal bash root CG 0:52 1 c-node3 Thank you. Regards, Rodgers