Re: [slurm-users] slurm sinfo format memory

2023-07-20 Thread Feng Zhang
Looks like Slurm itself only supports that format(in MB unit). Slurm
commands output format is not very user friendly to me. If it can add
some easy options, like for the output info of sinfo command in this email
thread, how about adding support for lazy options, like sinfo -ABC, etc.

For the desired format in GB, one workaround may be to prepare a
wrapper shell script, read in the  sinfo output, and convert the MB to GB
and print out to the screen.


On Thu, Jul 20, 2023 at 12:28 PM Arsene Marian Alain 
wrote:

>
>
> Dear slurm users,
>
>
>
> I would like to see the following information of my nodes "hostname, total
> mem, free mem and cpus". So, I used  ‘sinfo -o "%8n %8m %8e %C"’ but in the
> output it shows me the memory in MB like "190560" and I need it in GB
> (without decimals if possible) like "190GB". Any ideas or suggestions on
> how I can do that?
>
>
>
> current output:
>
>
>
> HOSTNAME MEMORY   FREE_MEM CPUS(A/I/O/T)
>
> node01   190560   125249   60/4/0/64
>
> node02   190560   171944   40/24/0/64
>
> node05   93280 91584 0/40/0/40
>
> node06   513120   509448   0/96/0/96
>
> node07   513120   512086   0/96/0/96
>
> node08   513120   512328   0/96/0/96
>
> node09   513120   512304   0/96/0/96
>
>
>
> desired output:
>
>
>
> HOSTNAME MEMORY   FREE_MEM CPUS(A/I/O/T)
>
> node01   190GB   125GB   60/4/0/64
>
> node02   190GB   171GB   40/24/0/64
>
> node05   93GB  91GB0/40/0/40
>
> node06   512GB   500GB   0/96/0/96
>
> node07   512GB   512GB   0/96/0/96
>
> node08   512GB   512GB   0/96/0/96
>
> node09   512GB   512GB   0/96/0/96
>
>
>
>
>
>
>
> I would appreciate any help.
>
>
>
> Thank you.
>
>
>
> Best Regards,
>
>
>
> Alain
>


[slurm-users] slurm sinfo format memory

2023-07-20 Thread Arsene Marian Alain

Dear slurm users,

I would like to see the following information of my nodes "hostname, total mem, 
free mem and cpus". So, I used  'sinfo -o "%8n %8m %8e %C"' but in the output 
it shows me the memory in MB like "190560" and I need it in GB (without 
decimals if possible) like "190GB". Any ideas or suggestions on how I can do 
that?

current output:

HOSTNAME MEMORY   FREE_MEM CPUS(A/I/O/T)
node01   190560   125249   60/4/0/64
node02   190560   171944   40/24/0/64
node05   93280 91584 0/40/0/40
node06   513120   509448   0/96/0/96
node07   513120   512086   0/96/0/96
node08   513120   512328   0/96/0/96
node09   513120   512304   0/96/0/96

desired output:

HOSTNAME MEMORY   FREE_MEM CPUS(A/I/O/T)
node01   190GB   125GB   60/4/0/64
node02   190GB   171GB   40/24/0/64
node05   93GB  91GB0/40/0/40
node06   512GB   500GB   0/96/0/96
node07   512GB   512GB   0/96/0/96
node08   512GB   512GB   0/96/0/96
node09   512GB   512GB   0/96/0/96



I would appreciate any help.

Thank you.

Best Regards,

Alain


Re: [slurm-users] GRES and GPUs

2023-07-20 Thread Xaver Stiensmeier

Hey everyone,

I am answering my own question:
It wasn't working because I need to *reload slurmd* on the machine, too.
So the full "test gpu management without gpu" workflow is:

1. Start your slurm cluster.
2. Add a gpu to an instance of your choice in the *slurm.conf*

For example:*
*

   *DebugFlags=GRES *# consider this for initial setup.
   *SelectType=select/cons_tres**
   **GresTypes=gpu*
   NodeName=master SocketsPerBoard=8 CoresPerSocket=1 RealMemory=8000
   *GRES=gpu:1* State=UNKNOWN

3. Register it at *gres.conf *and give it *some file*

   NodeName=master Name=gpu File=/dev/tty0 Count=1 # count seems to be
   optional

4. Reload slurmctld (on the master) and slurmd (on the gpu node)*
*

   *sudo systemctl restart slurmctld**
   **sudo systemctl restart slurmd*

I haven't tested this solution thoroughly yet, but at least commands like:*
*

   *sudo systemctl restart slurmd*
   # master

run without any issues afterwards.

Thank you for all your help!

Best regards,
Xaver

On 19.07.23 17:05, Xaver Stiensmeier wrote:


Hi Hermann,

count doesn't make a difference, but I noticed that when I reconfigure
slurm and do reloads afterwards, the error "gpu count lower than
configured" no longer appears - so maybe it is just because a
reconfigure is needed after reloading slurmctld - or maybe it doesn't
show the error anymore, because the node is still invalid? However, I
still get the error:

    error: _slurm_rpc_node_registration node=NName: Invalid argument

If I understand correctly, this is telling me that there's something
wrong with my slurm.conf. I know that all pre-existing parameters are
correct, so I assume it must be the gpus entry, but I don't see where
it's wrong:

NodeName=NName SocketsPerBoard=8 CoresPerSocket=1 RealMemory=8000
Gres=gpu:1 State=CLOUD # bibiserv

Thanks for all the help,
Xaver

On 19.07.23 15:04, Hermann Schwärzler wrote:

Hi Xaver,

I think you are missing the "Count=..." part in gres.conf

It should read

NodeName=NName Name=gpu File=/dev/tty0 Count=1

in your case.

Regards,
Hermann

On 7/19/23 14:19, Xaver Stiensmeier wrote:

Okay,

thanks to S. Zhang I was able to figure out why nothing changed.
While I did restart systemctld at the beginning of my tests, I
didn't do so later, because I felt like it was unnecessary, but it
is right there in the fourth line of the log that this is needed.
Somehow I misread it and thought it automatically restarted slurmctld.

Given the setup:

slurm.conf
...
GresTypes=gpu
NodeName=NName SocketsPerBoard=8 CoresPerSocket=1 RealMemory=8000
GRES=gpu:1 State=UNKNOWN
...

gres.conf
NodeName=NName Name=gpu File=/dev/tty0

When restarting, I get the following error:

error: Setting node NName state to INVAL with reason:gres/gpu count
reported lower than configured (0 < 1)

So it is still not working, but at least I get a more helpful log
message. Because I know that this /dev/tty trick works, I am still
unsure where the current error lies, but I will try to investigate
it further. I am thankful for any ideas in that regard.

Best regards,
Xaver

On 19.07.23 10:23, Xaver Stiensmeier wrote:


Alright,

I tried a few more things, but I still wasn't able to get past:
srun: error: Unable to allocate resources: Invalid generic resource
(gres) specification.

I should mention that the node I am trying to test GPU with,
doesn't really have a gpu, but Rob was so kind to find out that you
do not need a gpu as long as you just link to a file in /dev/ in
the gres.conf. As mentioned: This is just for testing purposes - in
the end we will run this on a node with a gpu, but it is not
available at the moment.

*The error isn't changing*

If I omitt "GresTypes=gpu" and "Gres=gpu:1", I still get the same
error.

*Debug Info*

I added the gpu debug flag and logged the following:

[2023-07-18T14:59:45.026] restoring original state of nodes
[2023-07-18T14:59:45.026] select/cons_tres: part_data_create_array:
select/cons_tres: preparing for 2 partitions
[2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to
gpu ignored
[2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to
change GresPlugins
[2023-07-18T14:59:45.026] read_slurm_conf: backup_controller not
specified
[2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to
gpu ignored
[2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to
change GresPlugins
[2023-07-18T14:59:45.026] select/cons_tres: select_p_reconfigure:
select/cons_tres: reconfigure
[2023-07-18T14:59:45.027] select/cons_tres: part_data_create_array:
select/cons_tres: preparing for 2 partitions
[2023-07-18T14:59:45.027] No parameter for mcs plugin, default
values set
[2023-07-18T14:59:45.027] mcs: MCSParameters = (null). ondemand set.
[2023-07-18T14:59:45.028] _slurm_rpc_reconfigure_controller:
completed usec=5898
[2023-07-18T14:59:45.952]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2

I am a bit unsure wh

Re: [slurm-users] Notify users about job submit plugin actions

2023-07-20 Thread Ole Holm Nielsen
Hi Lorenzo,

On 7/20/23 12:16, Lorenzo Bosio wrote:
> One more thing I'd like to point out, is that I need to monitor jobs going 
> from pending to running state (after waiting in the jobs queue). I 
> currently have a separate pthread to achieve this, but I think at this 
> point the job_submit()/job_modify() function has already exited.
> I do get the output of the slurm_kill_job() function when called, but 
> that's not useful for the user and I couldn't find a way to append custom 
> messages.

Maybe it's useful to have E-mail notifications sent to your users when the 
job transitions its state?

According to the sbatch man-page the user can specify himself what mail 
alerts he would like:

--mail-type=
   Notify user by email when certain event types occur.
   Valid type values are NONE, BEGIN, END, FAIL, REQUEUE,
   ALL (equivalent to BEGIN, END, FAIL, INVALID_DEPEND, REQUEUE,
   and STAGE_OUT), INVALID_DEPEND  (dependency  never  satisfied),
   STAGE_OUT  (burst  buffer  stage  out  and  teardown  completed),
   TIME_LIMIT,  TIME_LIMIT_90  (reached  90 percent of time limit),
   TIME_LIMIT_80 (reached 80 percent of time limit),
   TIME_LIMIT_50 (reached 50 percent of time limit) and
   ARRAY_TASKS (send emails for each array task).

/Ole


Re: [slurm-users] MIG-Slice: Unavailable GRES

2023-07-20 Thread Vogt, Timon

Hi Rob,

thank you very much for that hint. I tried setting the MIG slices 
manually in the gres.conf and it works now.


Thank you very much.
Best regards,
Timon

--
Timon Vogt
Arbeitsgruppe "Computing"
Nationales Hochleistungsrechnen (NHR)
Scientific Employee NHR
Tel.: +49 551 39-30146, E-Mail:timon.v...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Burckhardtweg 4, 37077 Göttingen, URL:https://gwdg.de

Support: Tel.: +49 551 39-3, URL:https://gwdg.de/support
Sekretariat: Tel.: +49 551 39-30001, E-Mail:g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001 und ISO 27001
-

Am 19.07.23 um 21:21 schrieb Groner, Rob:
At some  point when we were experimenting with MIG, I was being 
entirely frustrated in getting it to work until I finally removed the 
autodetect from gres.conf and explicitly listed the stuff instead.  
THEN it worked.  I think you can find the list of files that are the 
device files using nvidia-smi.


Here is the entry we use in our gres.conf for one of the nodes:

NodeName=p-gc-3037 Name=gpu Type=1g.5gb 
File=/dev/nvidia-caps/nvidia-cap[66,75,84,102,111,120,129,201,210,219,228,237,246,255]


Something to TRY anyway.  Odd that 3g.20gb works.  You might try 
reconfiguring the node for that instead and see if it works then.  
We've used 3g.20gb and 1g.5gb on our nodes and it works fine, never 
tried 2g.10gb.


Rob



*From:* slurm-users on behalf of Vogt, Timon
*Sent:* Wednesday, July 19, 2023 3:08 PM
*To:* slurm-us...@schedmd.com
*Subject:* [slurm-users] MIG-Slice: Unavailable GRES

Dear Slurm Mailing List,

I am experiencing a problem which affects our cluster and for which I am
completely out of ideas by now, so I would like to ask the community for
hints or ideas.

We run a partition on our cluster containing multiple nodes with Nvidia
A100 GPUs (40GB), which we have sliced up using Nvidia Multi-Instance
GPUs (MIG) into one 3g.20gb slice and two 2g.10gb slices per GPU.

Now, when submitting a job to it and requesting the 3g.20gb slice (like
with "srun -p mig-partition -G 3g.20gb:1 hostname"), the job runs fine,
but when a job requests one of the 2g.10gb slices instead (like with
"srun -p mig-partition -G 2g.10gb:1 hostname"), the job does not get
scheduled and the controller repeatedly outputs the error:

slurmctld[28945]: error: _set_job_bits1: job 4780824 failed to find any
available GRES on node 1471
slurmctld[28945]: error: gres_select_filter_select_and_set job 4780824
failed to satisfy gres-per-job counter

Our cluster uses the AutoDetect=nvml feature for the nodes in the
gres.conf and both slice types are defined in "AccountingStorageTRES"
and in the GRES parameter of the node definition. The slurmd on the node
also finds both types of slices and reports the correct amounts. They
are also visible in the "Gres=" section of "scontrol show node", again
in correct amounts.

I have also ensured that the nodes are not used otherwise by creating a
reservation on them accessible only to me, and I have restarted all
slurmd's and the slurmctld.

By now, I am out of ideas. Does someone here have a suggestion on what
else I can try? Has someone already seen this error and knows more 
about it?


Thank you very much in advance and
best regards,
Timon

--
Timon Vogt
Arbeitsgruppe "Computing"
Nationales Hochleistungsrechnen (NHR)
Scientific Employee NHR
Tel.: +49 551 39-30146, E-Mail: timon.v...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Burckhardtweg 4, 37077 Göttingen, URL: https://gwdg.de

Support: Tel.: +49 551 39-3, URL: https://gwdg.de/support
Sekretariat: Tel.: +49 551 39-30001, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001 und ISO 27001
-





OpenPGP_0x6441BD7DD0CD6C40.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


Re: [slurm-users] Notify users about job submit plugin actions

2023-07-20 Thread Lorenzo Bosio

Hello everyone,

thanks for all the answers.

To elaborate further: I'm developing in C, but that's not a problem 
since I can find an equivalent to LUA as Jeffrey T Frey said.
One more thing I'd like to point out, is that I need to monitor jobs 
going from pending to running state (after waiting in the jobs queue). I 
currently have a separate pthread to achieve this, but I think at this 
point the job_submit()/job_modify() function has already exited.
I do get the output of the slurm_kill_job() function when called, but 
that's not useful for the user and I couldn't find a way to append 
custom messages.


Again, thanks everyone who helped.
Regards,
Lorenzo

On 19/07/23 16:00, Jeffrey T Frey wrote:

In case you're developing the plugin in C and not LUA, behind the scenes the 
LUA mechanism is concatenating all log_user() strings into a single variable 
(user_msg).  When the LUA code completes, the C code sets the *err_msg argument 
to the job_submit()/job_modify() function to that string, then NULLs-out 
user-msg.  (There's a mutex around all of that code so slurmctld never executes 
LUA job submit/modify scripts concurrently.)  The slurmctld then communicates 
that returned string back to sbatch/salloc/srun for display to the user.


Your C plugin would do likewise — set *err_msg before returning from 
job_submit()/job_modify() — and needn't be mutex'ed if the code is reentrant.







On Jul 19, 2023, at 08:37, Angel de Vicente  wrote:

Hello Lorenzo,

Lorenzo Bosio  writes:


I'm developing a job submit plugin to check if some conditions are met before a 
job runs.
I'd need a way to notify the user about the plugin actions (i.e. why its jobs 
was killed and what to do), but after a lot of research I could only write to 
logs and not the user shell.
The user gets the output of slurm_kill_job but I can't find a way to add a 
custom note.

Can anyone point me to the right api/function in the code?

In our "job_submit.lua" script we have the following for that purpose:

,
|   slurm.log_user("%s: WARNING: [...]", log_prefix)
`

--
Ángel de Vicente
Research Software Engineer (Supercomputing and BigData)
Tel.: +34 922-605-747
Web.:http://research.iac.es/proyecto/polmag/

GPG: 0x8BDC390B69033F52



--
*/Dott. Mag. Lorenzo Bosio/*
Tecnico di Ricerca
Dipartimento di Informatica


Università degli Studi di Torino
Corso Svizzera, 185 - 10149 Torino
tel. +39 011 670 6836