Re: [slurm-users] Can sinfo/scontrol be called from job_submit.lua?

2022-10-12 Thread Groner, Rob
Well, there are numerous ways to do it, but I was trying to do it as much as 
possible from within the slurm infrastructure.

Basically, I want to react when someone submits a job requesting specific 
features that aren't actively available yet, and some of the actions I need to 
take will involve slurm commands.  This seems a bit like the cloud scheduling 
interface, but it's not a cloud service I'm talking about...it's our own 
hardware.

Otherwise, I would think that gathering information to make a decision while in 
the job_submit.lua would be a normal expectation.  Is there really no way to 
know how many nodes are up or what features are on the system while I'm 
processing in the job submit?  sacctmgr seems to work fine in there.

Rob


From: slurm-users  on behalf of Thomas 
M. Payerle 
Sent: Tuesday, October 11, 2022 5:31 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] Can sinfo/scontrol be called from job_submit.lua?

Running scontrol/sinfo from within a job_submit.lua script seems to be  opening 
a big can of worms --- it might be doable, but it would scare me.  Since it 
sounds like you are only doing such for a fairly limited amount of information 
which presumably does not change frequently, perhaps it would be better to have 
a cron job periodically output the desired information to a file, and have the 
job_submit.lua read the information from the file?

On Tue, Oct 11, 2022 at 5:17 PM Groner, Rob 
mailto:rug...@psu.edu>> wrote:
I am testing a method where, when a job gets submitted asking for specific 
features, then, if those features don't exist, I'll do something.

The job_submit.lua plugin has worked to determine when a job is submitted 
asking for the specific features.  I'm at the point of checking if those 
features exist already (the features are part of a nodeset and part of a 
partitionso jobs submitted asking for those features will just go to 
pending if no nodes exist that offer those features).  I thought to use "sinfo" 
to get a list of existing features on the system...but it fails to run.  The 
same for trying to use scontrol.

When I submit a job that requests the features, and so the sinfo command runs, 
it all hangs for about 10 seconds and then says:

[me@testsch (RC) slurm] sbatch ./gctest_account_test.sh
sbatch: error: Batch job submission failed: Socket timed out on send/recv 
operation

In the slurmctld.log, I see:
[2022-10-10T17:12:13.933] error: slurm_msg_sendto: 
address:port=10.6.88.99:40100
 msg_type=4004: Unexpected missing socket error


I'll note that "sinfo -V" works...but I suspect it's because it's not trying to 
communicate outside of itself with the slurmctld.

Any suggestions on what to try?  Or is there a better slurm-ic way to do what 
I'm trying to do?

Rob




--
Tom Payerle
DIT-ACIGS/Mid-Atlantic Crossroadspaye...@umd.edu
5825 University Research Park   (301) 405-6135
University of Maryland
College Park, MD 20740-3831


Re: [slurm-users] Can sinfo/scontrol be called from job_submit.lua?

2022-10-12 Thread Ole Holm Nielsen

Hi Rob,
On 10/12/22 15:40, Groner, Rob wrote:
Otherwise, I would think that gathering information to make a decision 
while in the job_submit.lua would be a normal expectation.  Is there 
really no way to know how many nodes are up or what features are on the 
system while I'm processing in the job submit? sacctmgr seems to work fine 
in there.


My 2 cents:

The slurmctld calls the Lua function slurm_job_submit(job_desc, part_list, 
submit_uid) which you provide in /etc/slurm/job_submit.lua.  This means 
that the Lua script only has access to the job's data "job_desc", the 
partition list, and the userid.  That's it.


It seems that slurm_job_submit was not designed to provide the kind of 
information that you are asking for.


/Ole



Re: [slurm-users] Check consistency

2022-10-12 Thread Davide DelVento
Thanks. I don't see anything wrong from that log.


On Fri, Oct 7, 2022 at 7:32 AM Paul Edmon  wrote:
>
> The slurmctld log will print out if hosts are out of sync with the
> slurmctld slurm.conf.  That said it doesn't report on cgroup consistency
> changes like that.  It's possible that dialing up the verbosity on the
> slurmd logs may give that info but I haven't seen it in normal operating.
>
> -Paul Edmon-
>
> On 10/6/22 5:47 PM, Davide DelVento wrote:
> > Is there a simple way to check that whas slurm is running is what the
> > config say it should be?
> >
> > For example, my understanding is that changing cgroup.conf should be
> > followed by 'systemctl stop slurmd' on all compute nodes, then
> > 'systemctl restart slurmctld' on the head node, then 'systemctl start
> > slurmd' on the compute nodes.
> >
> > Assuming this is correct, is there a way to query the nodes and ask if
> > they are indeed running what the config is saying (or alternatively
> > have them dump their config files somewhere for me to manually run a
> > diff on)?
> >
> > Thanks,
> > Davide
> >
>



[slurm-users] Trying to troubleshoot slurmctld start failure

2022-10-12 Thread Sopena Ballesteros Manuel
Dear Slurm user community,


I am new to slurm and trying to start a slurmd and slurmctld on same machine. I 
started with slurmctld which is having issues.


$ slurmctld -D -f /etc/slurm/slurm.conf -vvv
slurmctld: debug:  slurmctld log levels: stderr=debug2 logfile=debug2 
syslog=quiet
slurmctld: debug:  Log file re-opened
slurmctld: pidfile not locked, assuming no running daemon
slurmctld: debug:  slurmscriptd: Got ack from slurmctld, initialization 
successful
slurmctld: debug:  slurmctld: slurmscriptd fork()'d and initialized.
slurmctld: debug:  _slurmscriptd_mainloop: started
slurmctld: debug:  _slurmctld_listener_thread: started listening to slurmscriptd
slurmctld: slurmctld version 22.05.4 started on cluster cluster-nomad
slurmctld: cred/munge: init: Munge credential signature plugin loaded
slurmctld: debug:  auth/munge: init: Munge authentication plugin loaded
slurmctld: select/cons_res: common_init: select/cons_res loaded
slurmctld: select/cons_tres: common_init: select/cons_tres loaded
slurmctld: select/cray_aries: init: Cray/Aries node selection plugin loaded
slurmctld: preempt/none: init: preempt/none loaded
slurmctld: debug:  acct_gather_energy/none: init: AcctGatherEnergy NONE plugin 
loaded
slurmctld: debug:  acct_gather_profile/none: init: AcctGatherProfile NONE 
plugin loaded
slurmctld: debug:  acct_gather_interconnect/none: init: AcctGatherInterconnect 
NONE plugin loaded
slurmctld: debug:  acct_gather_filesystem/none: init: AcctGatherFilesystem NONE 
plugin loaded
slurmctld: debug2: No acct_gather.conf file (/etc/slurm/acct_gather.conf)
slurmctld: debug:  jobacct_gather/none: init: Job accounting gather NOT_INVOKED 
plugin loaded
slurmctld: ext_sensors/none: init: ExtSensors NONE plugin loaded
slurmctld: debug:  MPI: Loading all types
slurmctld: error:  mpi/pmix_v3: init: (null) [0]: mpi_pmix.c:195: pmi/pmix: can 
not load PMIx library
slurmctld: error: Couldn't load specified plugin name for mpi/pmix_v3: Plugin 
init() callback failed
slurmctld: error: MPI: Cannot create context for mpi/pmix_v3
slurmctld: debug2: No mpi.conf file (/etc/slurm/mpi.conf)
slurmctld: accounting_storage/none: init: Accounting storage NOT INVOKED plugin 
loaded
slurmctld: debug:  create_mmap_buf: Failed to mmap file 
`/var/spool/slurmctld/assoc_usage`, No such device
slurmctld: debug2: No Assoc usage file (/var/spool/slurmctld/assoc_usage) to 
recover
slurmctld: debug:  switch Cray/Aries plugin loaded.
slurmctld: debug:  switch/none: init: switch NONE plugin loaded
slurmctld: debug:  Reading slurm.conf file: /etc/slurm/slurm.conf
slurmctld: debug:  NodeNames=x1004c1s5b0n0 setting Sockets=10 based on 
CPUs(10)/(CoresPerSocket(1)/ThreadsPerCore(1))
slurmctld: No memory enforcing mechanism configured.
slurmctld: topology/none: init: topology NONE plugin loaded
slurmctld: debug:  No DownNodes
slurmctld: debug:  slurmctld log levels: stderr=debug2 logfile=debug2 
syslog=quiet
slurmctld: debug:  Log file re-opened
slurmctld: sched: Backfill scheduler plugin loaded
slurmctld: route/default: init: route default plugin loaded
slurmctld: debug:  _slurmscriptd_mainloop: finished
Segmentation fault

Could someone please help me understand what the issue is?


thank you


Re: [slurm-users] Trying to troubleshoot slurmctld start failure

2022-10-12 Thread Kevin Buckley

On 2022/10/13 03:42, Sopena Ballesteros Manuel wrote:

Dear Slurm user community,


I am new to slurm and trying to start a slurmd and slurmctld on same machine. I 
started with slurmctld which is having issues.


slurmctld: ext_sensors/none: init: ExtSensors NONE plugin loaded
slurmctld: debug:  MPI: Loading all types
slurmctld: error:  mpi/pmix_v3: init: (null) [0]: mpi_pmix.c:195: pmi/pmix: can 
not load PMIx library
slurmctld: error: Couldn't load specified plugin name for mpi/pmix_v3: Plugin 
init() callback failed
slurmctld: error: MPI: Cannot create context for mpi/pmix_v3
slurmctld: debug2: No mpi.conf file (/etc/slurm/mpi.conf)


We don't use PMIx here but this bit in the mpi.conf manpage

  PMIxEnv=
  Comma separated list of environment variables to be set in job
  environments to be used by PMIx. Defaults to not being set.

suggests that you could set a LD_LIBRARY_PATH or similar EnvVar that
might expose your local PMIx library to Slurm jobs, so maybe the
daemons need similar.

Maybe try running the daemon startup with a lookup path set?

Kevin Buckley
--
Supercomputing Systems Administrator
Pawsey Supercomputing Centre



[slurm-users] GPU-node not waking up after power-save

2022-10-12 Thread Loris Bennett
Hi,

We use Slurm's power saving mechanism to switch of idle nodes.  However,
we don't currently use it for our GPU nodes.  This is because in the
past these nodes failed to wake up again when jobs were submitted to the
GPU partition.  Before we look at the issue due to the current energy
situation, I was wondering whether this a problem others have (had).

So does power-saving work in general for GPU nodes and, if so, are there
any extra steps one needs to take in order to set things up properly?

Cheers,

Loris

-- 
Dr. Loris Bennett (Herr/Mr)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de