[slurm-users] Cloud nodes remain in state "alloc#"
Hi, I'm using Slurm's elastic compute functionality to spin up nodes in the cloud, alongside a controller which is also in the cloud. When executing a job, Slurm correctly places a node into the state "alloc#" and calls my resume program. My resume program successfully provisions the cloud node and slurmd comes up without a problem. My resume program then retrieves the ip address of my cloud node and updates the controller as follows: scontrol update nodename=foo nodeaddr=bar And then nothing happens! The node remains in the state "alloc#" until the ResumeTimeout is reached at which point the controller gives up. I'm fairly confident that slurmd is able to talk to the controller because if I specify an incorrect hostname for the controller in my slurm.conf, then slurmd immediately errors on startup and exits with a message saying something like "unable to contact controller" What am I missing? Thanks very much in advance if anybody has any ideas!
Re: [slurm-users] Jobs stuck in "completing" (CG) state
On 10/24/20 9:22 am, Kimera Rodgers wrote: [root@kla-ac-ohpc-01 critical]# srun -c 8 --pty bash -i srun: error: slurm_receive_msgs: Socket timed out on send/recv operation srun: error: Task launch for 37.0 failed on node c-node3: Socket timed out on send/recv operation srun: error: Application launch failed: Socket timed out on send/recv operation srun: Job step aborted: Waiting up to 32 seconds for job step to finish. To me this looks like networking issues, perhaps firewall/iptables rules blocking connections. Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] Jobs stuck in "completing" (CG) state
This can happen if the underlying storage is wedged. I would check that it is working properly. Usually the only way to clear this state is either fix the stuck storage or reboot the node. -Paul Edmon- On 10/24/2020 12:22 PM, Kimera Rodgers wrote: I'm setting up slume on OpenHPC cluster with one master node and 5 compute nodes. When I run test jobs, the jobs completely get stuck in the CG state. Can someone help me hint on where I might have gone wrong? [root@kla-ac-ohpc-01 critical]# srun -c 8 --pty bash -i srun: error: slurm_receive_msgs: Socket timed out on send/recv operation srun: error: Task launch for 37.0 failed on node c-node3: Socket timed out on send/recv operation srun: error: Application launch failed: Socket timed out on send/recv operation srun: Job step aborted: Waiting up to 32 seconds for job step to finish. [root@kla-ac-ohpc-01 critical]# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 36 normal bash test CG 0:53 2 c-node[1-2] 37 normal bash root CG 0:52 1 c-node3 Thank you. Regards, Rodgers
[slurm-users] Jobs stuck in "completing" (CG) state
I'm setting up slume on OpenHPC cluster with one master node and 5 compute nodes. When I run test jobs, the jobs completely get stuck in the CG state. Can someone help me hint on where I might have gone wrong? [root@kla-ac-ohpc-01 critical]# srun -c 8 --pty bash -i srun: error: slurm_receive_msgs: Socket timed out on send/recv operation srun: error: Task launch for 37.0 failed on node c-node3: Socket timed out on send/recv operation srun: error: Application launch failed: Socket timed out on send/recv operation srun: Job step aborted: Waiting up to 32 seconds for job step to finish. [root@kla-ac-ohpc-01 critical]# squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 36normal bash test CG 0:53 2 c-node[1-2] 37normal bash root CG 0:52 1 c-node3 Thank you. Regards, Rodgers
Re: [slurm-users] pam_slurm_adopt always claims now active jobs even when they do
Hi Paul, maybe this is totally unrelated but we also have a similar issue with pam_slurm_adopt in case that ConstrainRAMSpace=no is set in cgroup.conf and more than one job is running on that node. There is a bug report open at: https://bugs.schedmd.com/show_bug.cgi?id=9355 As a workaround we currently advise users to not use ssh but attach an interactive shell under an already allocated job by running the following command: srun --jobid --pty /bin/bash For a single node job the user does not even need to know the node that the job is running on. For a multinode job, the user can still use '-w ' option to specify a specific node. Best regards Jürgen -- Jürgen Salk Scientific Software & Compute Services (SSCS) Kommunikations- und Informationszentrum (kiz) Universität Ulm Telefon: +49 (0)731 50-22478 Telefax: +49 (0)731 50-22471 * Paul Raines [201023 13:13]: > > I am running Slurm 20.02.3 on CentOS 7 systems. I have pam_slurm_adopt > setup in /etc/pam.d/system-auth and slurm.conf has PrologFlags=Contain,X11 > I also have masked systemd-logind > > But pam_slurm_adopt always denies login with "Access denied by > pam_slurm_adopt: you have no active jobs on this node" even when the > user most definitely has a job running on the node via srun > > Any clues as to why pam_slurm_adopt thinks there is no job? > > serena [raines] squeue > JOBID PARTITION NAME USER ST TIME NODES > NODELIST(REASON) >785lcnrtx tcsh raines R 19:44:51 1 rtx-03 > serena [raines] ssh rtx-03 > Access denied by pam_slurm_adopt: you have no active jobs on this node > Authentication failed. > > -- GPG A997BA7A | 87FC DA31 5F00 C885 0DC3 E28F BD0D 4B33 A997 BA7A