If those sruns are wrapped in salloc, they work correctly. The first srun can
be eliminated by adding SallocDefaultCommand for salloc (disabled in this
example with --no-shell)
SallocDefaultCommand="srun -n1 -N1 --mem-per-cpu=0 --gres=gpu:0 --mpi=none --pty
$SHELL"
[user@login005 ~]$ salloc -p GPU --gres=gpu:p100:1 --no-shell
salloc: Good day
salloc: Pending job allocation 7052366
salloc: job 7052366 queued and waiting for resources
salloc: job 7052366 has been allocated resources
salloc: Granted job allocation 7052366
[user@login005 ~]$ srun --jobid 7052366 --gres=gpu:0 --pty bash
[user@gpu045 ~]$ nvidia-smi
No devices were found
[user@gpu045 ~]$ srun nvidia-smi
Fri Dec 13 14:19:45 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... On | 00000000:87:00.0 Off | 0 |
| N/A 31C P0 26W / 250W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[user@gpu045 ~]$ exit
exit
[user@login005 ~]$ scancel 7052366
[user@login005 ~]$
On 12/13/19 11:48 AM, Kraus, Sebastian wrote:
Dear Valantis,
thanks for the explanation. But, I have to correct you about the second
alternate approach:
srun -ppartition -N1 -n4 --gres=gpu:0 --time=00:30:00 --mem=1G -Jjobname
--pty /bin/bash -il
srun --gres=gpu:1 -l hostname
Naturally, this is not working and in consequence the "inner" srun job step
throws an error about the generic resource being not available/allocatable:
user@frontend02#-bash_4.2:~:[2]$ srun -pgpu -N1 -n4 --time=00:30:00 --mem=5G
--gres=gpu:0 -Jjobname --pty /bin/bash -il
user@gpu006#bash_4.2:~:[1]$ srun --gres=gpu:1 hostname
srun: error: Unable to create step for job 18044554: Invalid generic resource
(gres) specification
Test it yourself. ;-)
Best
Sebastian
Sebastian Kraus
Team IT am Institut für Chemie
Gebäude C, Straße des 17. Juni 115, Raum C7
Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin
Tel.: +49 30 314 22263
Fax: +49 30 314 29309
Email: sebastian.kr...@tu-berlin.de
________________________________________
From: Chrysovalantis Paschoulas <c.paschou...@fz-juelich.de>
Sent: Friday, December 13, 2019 13:05
To: Kraus, Sebastian
Subject: Re: [slurm-users] srun: job steps and generic resources
Hi Sebastian,
the first srun uses the gres you requested and the second waits for it
to be available again.
You have to do either
```
srun -ppartition -N1 -n4 --gres=gpu:1 --time=00:30:00 --mem=1G -Jjobname
--pty /bin/bash -il
srun --gres=gpu:0 -l hostname
```
or
```
srun -ppartition -N1 -n4 --gres=gpu:0 --time=00:30:00 --mem=1G -Jjobname
--pty /bin/bash -il
srun --gres=gpu:1 -l hostname
```
Best Regards,
Valantis
On 13.12.19 12:44, Kraus, Sebastian wrote:
Dear all,
I am facing the following nasty problem.
I use to start interactive batch jobs via:
srun -ppartition -N1 -n4 --time=00:30:00 --mem=1G -Jjobname --pty /bin/bash -il
Then, explicitly starting a job step within such a session via:
srun -l hostname
works fine.
But, as soon as I add a generic resource to the job allocation as with:
srun -ppartition -N1 -n4 --gres=gpu:1 --time=00:30:00 --mem=1G -Jjobname --pty
/bin/bash -il
an explict job step lauched as above via:
srun -l hostname
infinitely stalls/blocks.
Hope, anyone out there able to explain me this behavior.
Thanks and best
Sebastian
Sebastian Kraus
Team IT am Institut für Chemie
Technische Universität Berlin
Fakultät II
Institut für Chemie
Sekretariat C3
Straße des 17. Juni 135
10623 Berlin
Email: sebastian.kr...@tu-berlin.de