Make sure you don’t have a firewall blocking connections back to the login node 
from the cluster. We had that problem at Rutgers before.

Sent from my iPhone

On May 19, 2023, at 13:13, Prentice Bisbal <pbis...@pppl.gov> wrote:



Brian,

Thanks for the reply, and I was hoping that would be the fix, but that doesn't 
seem to be the case. I'm using 22.05.8, which isn't that old. I double-checked 
the documentation archives for version 22.05.08's documetation, and setting

LaunchParameters=use_interactive_step



should be valid here. From  
https://slurm.schedmd.com/archive/slurm-22.05.8/slurm.conf.html:

use_interactive_step
Have salloc use the Interactive Step to launch a shell on an allocated compute 
node rather than locally to wherever salloc was invoked. This is accomplished 
by launching the srun command with InteractiveStepOptions as options.

This does not affect salloc called with a command as an argument. These jobs 
will continue to be executed as the calling user on the calling host.

and

InteractiveStepOptions
When LaunchParameters=use_interactive_step is enabled, launching salloc will 
automatically start an srun process with InteractiveStepOptions to launch a 
terminal on a node in the job allocation. The default value is "--interactive 
--preserve-env --pty $SHELL". The "--interactive" option is intentionally not 
documented in the srun man page. It is meant only to be used in 
InteractiveStepOptions in order to create an "interactive step" that will not 
consume resources so that other steps may run in parallel with the interactive 
step.

According to that, setting LaunchParameters=use_interactive_step should be 
enough, since "--interactive --preserve-env --pty $SHELL" is the default.

A colleague pointed out that my slurm.conf was setting LaunchParameters to 
"user_interactive_step" when it should be "use_interactive_step", but changing 
that didn't fix my problem, just changed it. Now when I try to start an 
interactive shell, it just hangs and eventually returns an error:

[pbisbal@ranger ~]$ salloc -n 1 -t 00:10:00 --mem=1G
salloc: Granted job allocation 29
salloc: Waiting for resource configuration
salloc: Nodes ranger-s22-07 are ready for job
srun: error: timeout waiting for task launch, started 0 of 1 tasks
srun: launch/slurm: launch_p_step_launch: StepId=29.interactive aborted before 
step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
salloc: Relinquishing job allocation 29
[pbisbal@ranger ~]$




On 5/19/23 11:28 AM, Brian Andrus wrote:

Defaulting to a shell for salloc is a newer feature.

For your version, you should:

    srun -n 1 -t 00:10:00 --mem=1G --pty bash

Brian Andrus

On 5/19/2023 8:24 AM, Ryan Novosielski wrote:
I’m not at a computer, and we run an older version of Slurm yet so I can’t say 
with 100% confidence that his this has changed and I can’t be too specific, but 
I know that this is the behavior you should expect from that command. I believe 
that there are configuration options to make it behave differently.

Otherwise, you can use srun to run commands on the assigned node.

I think if you search this list for “interactive,” or search the Slurm bugs 
database, you will see some other conversations about this.

Sent from my iPhone

On May 19, 2023, at 10:35, Prentice Bisbal 
<pbis...@pppl.gov><mailto:pbis...@pppl.gov> wrote:



I'm setting up Slurm from scratch for the first time ever. Using 22.05.8 since 
I haven't had a changed to upgrade our DB server to 23.02 yet. When I try to 
use salloc to get a shell on a compute node (ranger-s22-07), I end up with a 
shell on the login node (ranger):

[pbisbal@ranger ~]$ salloc -n 1 -t 00:10:00  --mem=1G
salloc: Granted job allocation 23
salloc: Waiting for resource configuration
salloc: Nodes ranger-s22-07 are ready for job
[pbisbal@ranger ~]$



Any ideas what's going wrong here? I have the following line in my slurm.conf:

LaunchParameters=user_interactive_step


When I run salloc with -vvvvv, here's what I see:

[pbisbal@ranger ~]$ salloc -vvvvv -n 1 -t 00:10:00  --mem=1G
salloc: defined options
salloc: -------------------- --------------------
salloc: mem                 : 1G
salloc: ntasks              : 1
salloc: time                : 00:10:00
salloc: verbose             : 5
salloc: -------------------- --------------------
salloc: end of defined options
salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_cons_res.so
salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin 
name:Consumable Resources (CR) Node Selection plugin type:select/cons_res 
version:0x160508
salloc: select/cons_res: common_init: select/cons_res loaded
salloc: debug3: Success.
salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_cons_tres.so
salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin 
name:Trackable RESources (TRES) Selection plugin type:select/cons_tres 
version:0x160508
salloc: select/cons_tres: common_init: select/cons_tres loaded
salloc: debug3: Success.
salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_cray_aries.so
salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin 
name:Cray/Aries node selection plugin type:select/cray_aries version:0x160508
salloc: select/cray_aries: init: Cray/Aries node selection plugin loaded
salloc: debug3: Success.
salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_linear.so
salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin 
name:Linear node selection plugin type:select/linear version:0x160508
salloc: select/linear: init: Linear node selection plugin loaded with argument 
20
salloc: debug3: Success.
salloc: debug:  Entering slurm_allocation_msg_thr_create()
salloc: debug:  port from net_stream_listen is 43881
salloc: debug:  Entering _msg_thr_internal
salloc: debug4: eio: handling events for 1 objects
salloc: debug3: eio_message_socket_readable: shutdown 0 fd 6
salloc: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so
salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin 
name:Munge authentication plugin type:auth/munge version:0x160508
salloc: debug:  auth/munge: init: Munge authentication plugin loaded
salloc: debug3: Success.
salloc: debug3: Trying to load plugin /usr/lib64/slurm/hash_k12.so
salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin 
name:KangarooTwelve hash plugin type:hash/k12 version:0x160508
salloc: debug:  hash/k12: init: init: KangarooTwelve hash plugin loaded
salloc: debug3: Success.
salloc: Granted job allocation 24
salloc: Waiting for resource configuration
salloc: Nodes ranger-s22-07 are ready for job
salloc: debug:  laying out the 1 tasks on 1 hosts ranger-s22-07 dist 8192
[pbisbal@ranger ~]$

This is all I see in /var/log/slurm/slurmd.log on the compute node:

[2023-05-19T10:21:36.898] [24.extern] task/cgroup: _memcg_initialize: job: 
alloc=1024MB mem.limit=1024MB memsw.limit=unlimited
[2023-05-19T10:21:36.899] [24.extern] task/cgroup: _memcg_initialize: step: 
alloc=1024MB mem.limit=1024MB memsw.limit=unlimited



And this is all I see in /var/log/slurm/slurmctld.log on the controller:


[2023-05-19T10:18:16.815] sched: _slurm_rpc_allocate_resources JobId=23 
NodeList=ranger-s22-07 usec=1136
[2023-05-19T10:18:22.423] Time limit exhausted for JobId=22
[2023-05-19T10:21:36.861] sched: _slurm_rpc_allocate_resources JobId=24 
NodeList=ranger-s22-07 usec=1039

Here's my slurm.conf file:


# grep -v ^# /etc/slurm/slurm.conf  | grep -v ^$

ClusterName=ranger
SlurmctldHost=ranger-master
EnforcePartLimits=ALL
JobSubmitPlugins=lua,require_timelimit
LaunchParameters=user_interactive_step
MaxStepCount=2500
MaxTasksPerNode=32
MpiDefault=none
ProctrackType=proctrack/cgroup
PrologFlags=contain
ReturnToService=0
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/affinity,task/cgroup
TopologyPlugin=topology/tree
CompleteWait=32
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0|
DefMemPerCPU=5000
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
PriorityType=priority/multifactor
PriorityDecayHalfLife=15-0
PriorityCalcPeriod=15
PriorityFavorSmall=NO
PriorityMaxAge=180-0
PriorityWeightAge=5000
PriorityWeightFairshare=5000
PriorityWeightJobSize=5000
AccountingStorageEnforce=all
AccountingStorageHost=slurm.pppl.gov
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreFlags=job_script
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherParams=UsePss
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
NodeName=ranger-s22-07 CPUs=72 Boards=1 SocketsPerBoard=4 CoresPerSocket=18 
ThreadsPerCore=1 RealMemory=384880 State=UNKNOWN
PartitionName=all Nodes=ALL Default=YES GraceTime=300 MaxTime=24:00:00 State=UP

--
Prentice

Reply via email to