[slurm-users] Cloud nodes remain in state "alloc#"

2020-10-24 Thread Rupert Madden-Abbott
Hi,

I'm using Slurm's elastic compute functionality to spin up nodes in the
cloud, alongside a controller which is also in the cloud.

When executing a job, Slurm correctly places a node into the state "alloc#"
and calls my resume program. My resume program successfully provisions the
cloud node and slurmd comes up without a problem.

My resume program then retrieves the ip address of my cloud node and
updates the controller as follows:

scontrol update nodename=foo nodeaddr=bar

And then nothing happens! The node remains in the state "alloc#" until the
ResumeTimeout is reached at which point the controller gives up.

I'm fairly confident that slurmd is able to talk to the controller because
if I specify an incorrect hostname for the controller in my slurm.conf,
then slurmd immediately errors on startup and exits with a message saying
something like "unable to contact controller"

What am I missing?

Thanks very much in advance if anybody has any ideas!


Re: [slurm-users] Jobs stuck in "completing" (CG) state

2020-10-24 Thread Chris Samuel

On 10/24/20 9:22 am, Kimera Rodgers wrote:


[root@kla-ac-ohpc-01 critical]# srun -c 8 --pty bash -i
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: Task launch for 37.0 failed on node c-node3: Socket timed 
out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv 
operation

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.


To me this looks like networking issues, perhaps firewall/iptables rules 
blocking connections.


Best of luck,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Jobs stuck in "completing" (CG) state

2020-10-24 Thread Paul Edmon
This can happen if the underlying storage is wedged.  I would check that 
it is working properly.


Usually the only way to clear this state is either fix the stuck storage 
or reboot the node.


-Paul Edmon-

On 10/24/2020 12:22 PM, Kimera Rodgers wrote:
I'm setting up slume on OpenHPC cluster with one master node and 5 
compute nodes.

When I run test jobs, the jobs completely get stuck in the CG state.

Can someone help me hint on where I might have gone wrong?

[root@kla-ac-ohpc-01 critical]# srun -c 8 --pty bash -i
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: Task launch for 37.0 failed on node c-node3: Socket timed 
out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv 
operation

srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

[root@kla-ac-ohpc-01 critical]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
                36    normal     bash     test CG       0:53    2 
c-node[1-2]

                37    normal     bash     root CG       0:52    1 c-node3

Thank you.

Regards,
Rodgers




[slurm-users] Jobs stuck in "completing" (CG) state

2020-10-24 Thread Kimera Rodgers
I'm setting up slume on OpenHPC cluster with one master node and 5 compute
nodes.
When I run test jobs, the jobs completely get stuck in the CG state.

Can someone help me hint on where I might have gone wrong?

[root@kla-ac-ohpc-01 critical]# srun -c 8 --pty bash -i
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation
srun: error: Task launch for 37.0 failed on node c-node3: Socket timed out
on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv
operation
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

[root@kla-ac-ohpc-01 critical]# squeue
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
36normal bash test CG   0:53  2
c-node[1-2]
37normal bash root CG   0:52  1 c-node3

Thank you.

Regards,
Rodgers


Re: [slurm-users] pam_slurm_adopt always claims now active jobs even when they do

2020-10-24 Thread Juergen Salk
Hi Paul,

maybe this is totally unrelated but we also have a similar issue with
pam_slurm_adopt in case that ConstrainRAMSpace=no is set in
cgroup.conf and more than one job is running on that node. There is a
bug report open at:

  https://bugs.schedmd.com/show_bug.cgi?id=9355

As a workaround we currently advise users to not use ssh but attach an
interactive shell under an already allocated job by running the
following command: 

  srun --jobid  --pty /bin/bash

For a single node job the user does not even need to know the node
that the job is running on. For a multinode job, the user can still
use '-w ' option to specify a specific node.

Best regards
Jürgen

-- 
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471



* Paul Raines  [201023 13:13]:
> 
> I am running Slurm 20.02.3 on CentOS 7 systems.  I have pam_slurm_adopt
> setup in /etc/pam.d/system-auth and slurm.conf has PrologFlags=Contain,X11
> I also have masked systemd-logind
> 
> But pam_slurm_adopt always denies login with "Access denied by
> pam_slurm_adopt: you have no active jobs on this node" even when the
> user most definitely has a job running on the node via srun
> 
> Any clues as to why pam_slurm_adopt thinks there is no job?
> 
> serena [raines] squeue
>  JOBID PARTITION NAME USER ST   TIME  NODES
> NODELIST(REASON)
>785lcnrtx tcsh   raines  R   19:44:51  1 rtx-03
> serena [raines] ssh rtx-03
> Access denied by pam_slurm_adopt: you have no active jobs on this node
> Authentication failed.
> 
> 

-- 
GPG A997BA7A | 87FC DA31 5F00 C885 0DC3  E28F BD0D 4B33 A997 BA7A