[slurm-users] Re: scrun: Failed to run the container due to GID mapping configuration

2024-04-04 Thread Markus Kötter via slurm-users

Hi,


On 04.04.24 04:46, Toshiki Sonoda (Fujitsu) via slurm-users wrote:
We set up scrun (slurm 23.11.5) integrated with rootless podman, 



I'd recommend looking into nvidia enroot instead.

https://slurm.schedmd.com/SLUG19/NVIDIA_Containers.pdf 





MfG
--
Markus Kötter, +49 681 870832434
30159 Hannover, Lange Laube 6
Helmholtz Center for Information Security


smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


Re: [slurm-users] Multifactor fair-share with single account

2024-01-04 Thread Markus Kötter

Hi,


On 03.01.24 23:47, Kamil Wilczek wrote:

But what if my organisation structure is flat and I have only one
account where all my user reside.


As do I.


Is fair-share algorithm working
in this situation -- does it take into account users (associations)
from this single account, and tries to assing a fair-factor to each
user?


Yes.


And what if I have, say 3 accounts, but I do not wan't to calculate
fair-share between accounts, but between all associations from all
3 accounts? In other words, is there a fair-share factor for
users/associations instead of accounts?



FairShare=parent
https://slurm.schedmd.com/sacctmgr.html

I do not use this.


MfG
--
Markus Kötter, +49 681 870832434
30159 Hannover, Lange Laube 6
Helmholtz Center for Information Security


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Releasing stale allocated TRES

2023-11-23 Thread Markus Kötter

Hi,

On 23.11.23 10:56, Schneider, Gerald wrote:

I have a recurring problem with allocated TRES, which are not
released after all jobs on that node are finished. The TRES are still
marked as allocated and no new jobs can't be scheduled on that node
using those TRES.


Remove the node from slurm.conf and restart slurmctld, re-add, restart.
Remove from Partition definitions as well.


MfG
--
Markus Kötter, +49 681 870832434
30159 Hannover, Lange Laube 6
Helmholtz Center for Information Security


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Guarantee minimum amount of GPU resources to a Slurm account

2023-09-13 Thread Markus Kötter

Hi,


currently reservations do not work for gres.

https://bugs.schedmd.com/show_bug.cgi?id=5771

23.11 might change this.


MfG
--
Markus Kötter, +49 681 870832434
30159 Hannover, Lange Laube 6
Helmholtz Center for Information Security


OpenPGP_0x4571F02A83828A0F.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


Re: [slurm-users] Keep CPU Jobs Off GPU Nodes

2023-03-29 Thread Markus Kötter

Hello,

On 29.03.23 10:08, René Sitt wrote:
While the cited procedure works great in general, it gets more 
complicated for heterogeneous setups
, i.e. if you have several GPU types 
defined in gres.conf, since the 'tres_per_' fields can then take the 
form of either 'gres:gpu:N' or 'gres:gpu::N' - depending on 
whether the job script specifies a GPU type or not.


Using lua match:

> for g in job_desc.gres:gmatch("[^,]*") do
>   count = g:match("gres:gpu:%w+:(%d+)$") or g:match("gres:gpu:(%d+)$")
>     if count then


MfG
--
Markus Kötter, +49 681 870832434
30159 Hannover, Lange Laube 6
Helmholtz Center for Information Security


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] slurm and singularity

2023-02-07 Thread Markus Kötter

Hi,


On 08.02.23 05:00, Carl Ponder wrote:

Take a look at this extension to SLURM:

https://github.com/NVIDIA/pyxis


https://slurm.schedmd.com/SLUG19/NVIDIA_Containers.pdf

enroot & pyxis - great recommendation for rootless containerized runtime 
environments in HPC.


Free software, no license or DGX required.


Some things to consider

Cache in /tmp so it's free'd upon reboot:
# /etc/enroot/enroot.conf
ENROOT_RUNTIME_PATH /tmp/enroot/user-$(id -u)
ENROOT_CACHE_PATH /tmp/enroot-cache/user-$(id -u)
ENROOT_DATA_PATH /tmp/enroot-data/user-$(id -u)


When using a local container repo, the image urls port is seperated 
using # :



srun … --container-image mygitlab:5005#path/pytorch:22.12-py3 …



MfG
--
Markus Kötter, +49 681 870832434
30159 Hannover, Lange Laube 6
Helmholtz Center for Information Security


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Enforce gpu usage limits (with GRES?)

2023-02-03 Thread Markus Kötter

Hi,


limits ain't easy.


https://support.ceci-hpc.be/doc/_contents/SubmittingJobs/SlurmLimits.html#precedence



I think there is multiple options, starting with not having GPU 
resources in the CPU partition.


Or creating qos the partition and have
MaxTRES=gres/gpu:A100=0,gres/gpu:K80=0,gres/gpu=0
attaching it to the CPU partition.

And the configuration will require some values as well,

# slurm.conf
AccountingStorageEnforce=associations,limits,qos,safe
AccountingStorageTRES=gres/gpu,gres/gpu:A100,gres/gpu:K80

# cgroups.conf
ConstrainDevices=yes

most likely some others I miss.


MfG
--
Markus Kötter, +49 681 870832434
30159 Hannover, Lange Laube 6
Helmholtz Center for Information Security


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] GPU utilization of running jobs

2022-10-20 Thread Markus Kötter

Hi,


I've shared my take on this here:
> 
https://forums.developer.nvidia.com/t/job-statistics-with-nvidia-data-center-gpu-manager-and-slurm/148768


It lacks graphing.

MfG
--
Markus Kötter, +49 681 870832434
30159 Hannover, Lange Laube 6
Helmholtz Center for Information Security


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] Deb packages for Ubuntu

2022-07-21 Thread Markus Kötter

Hi,


I maintain debian packages for

  * slurm
  * auks
  * pyxis

on

  * debian bullseye
  * ubuntu bionic / 18.04 (DGXOS 4.x)
  * ubuntu focal / 20.04 (DGXOS 5.x)

I (think I) started of using the scibian packages and moved to gpb, 
using gitlab & the ci.

Every "version" has it's own branch, for SLURM the ci builds the packages.
auks & pyxis is left for the local docker container as it requires the 
SLURM libs to compile/link.


It allows tracking upstream, maintaining patches for auks.

Should have chosen github in the first place.
Interested?


MfG
--
Markus Kötter, +49 681 870832434
30159 Hannover, Lange Laube 6
Helmholtz Center for Information Security


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [slurm-users] container on slurm cluster

2022-05-18 Thread Markus Kötter

Hi,

On 18.05.22 08:25, Stephan Roth wrote:

Personal note: I'm not sure what I'd choose as a successor to 
Singularity 3.8, yet. Thoughts are welcome.


I can recommend nvidia enroot/pyxis.
enroot does unprivileged sandboxes/containers, pyxis is the slurm SPANK 
glue.


https://slurm.schedmd.com/SLUG19/NVIDIA_Containers.pdf

https://github.com/NVIDIA/enroot
https://github.com/NVIDIA/pyxis


Notes from operation:
I recommend using nvme for the container storage, default configuration 
uses tmpfs.



…/enroot.conf


…

ENROOT_RUNTIME_PATH /tmp/enroot/user-$(id -u)



ENROOT_DATA_PATH /tmp/enroot-data/user-$(id -u)





tmpfiles.d/…

d /tmp/enroot/ 0777 root root - -
d /tmp/enroot-data/ 0777 root root - -
d /tmp/pyaxis-runtime/ 0777 root root - -




dest: /etc/slurm/plugstack.conf.d/pyxis.conf



content: |



  required /usr/lib/x86_64-linux-gnu/slurm/spank_pyxis.so 
runtime_path=/tmp/pyxis-runtime/




Time savers:
The container url formatting is … unexpected, # is used as seperator for 
the path in host:port#path/file - and may need to be escaped to avoid 
getting interpreted as comment.


It uses a netrc formatted .credentials file for container registries 
with authentication.

Insert the credentials twice - with and without port.

Can do more than documented. (e.g. #SBATCH --container-image)



MfG
--
Markus Kötter, +49 681 870832434
30159 Hannover, Lange Laube 6
Helmholtz Center for Information Security


smime.p7s
Description: S/MIME Cryptographic Signature