[slurm-users] Re: scrun: Failed to run the container due to GID mapping configuration
Hi, On 04.04.24 04:46, Toshiki Sonoda (Fujitsu) via slurm-users wrote: We set up scrun (slurm 23.11.5) integrated with rootless podman, I'd recommend looking into nvidia enroot instead. https://slurm.schedmd.com/SLUG19/NVIDIA_Containers.pdf MfG -- Markus Kötter, +49 681 870832434 30159 Hannover, Lange Laube 6 Helmholtz Center for Information Security smime.p7s Description: S/MIME Cryptographic Signature -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
Re: [slurm-users] Multifactor fair-share with single account
Hi, On 03.01.24 23:47, Kamil Wilczek wrote: But what if my organisation structure is flat and I have only one account where all my user reside. As do I. Is fair-share algorithm working in this situation -- does it take into account users (associations) from this single account, and tries to assing a fair-factor to each user? Yes. And what if I have, say 3 accounts, but I do not wan't to calculate fair-share between accounts, but between all associations from all 3 accounts? In other words, is there a fair-share factor for users/associations instead of accounts? FairShare=parent https://slurm.schedmd.com/sacctmgr.html I do not use this. MfG -- Markus Kötter, +49 681 870832434 30159 Hannover, Lange Laube 6 Helmholtz Center for Information Security smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] Releasing stale allocated TRES
Hi, On 23.11.23 10:56, Schneider, Gerald wrote: I have a recurring problem with allocated TRES, which are not released after all jobs on that node are finished. The TRES are still marked as allocated and no new jobs can't be scheduled on that node using those TRES. Remove the node from slurm.conf and restart slurmctld, re-add, restart. Remove from Partition definitions as well. MfG -- Markus Kötter, +49 681 870832434 30159 Hannover, Lange Laube 6 Helmholtz Center for Information Security smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] Guarantee minimum amount of GPU resources to a Slurm account
Hi, currently reservations do not work for gres. https://bugs.schedmd.com/show_bug.cgi?id=5771 23.11 might change this. MfG -- Markus Kötter, +49 681 870832434 30159 Hannover, Lange Laube 6 Helmholtz Center for Information Security OpenPGP_0x4571F02A83828A0F.asc Description: OpenPGP public key OpenPGP_signature Description: OpenPGP digital signature
Re: [slurm-users] Keep CPU Jobs Off GPU Nodes
Hello, On 29.03.23 10:08, René Sitt wrote: While the cited procedure works great in general, it gets more complicated for heterogeneous setups , i.e. if you have several GPU types defined in gres.conf, since the 'tres_per_' fields can then take the form of either 'gres:gpu:N' or 'gres:gpu::N' - depending on whether the job script specifies a GPU type or not. Using lua match: > for g in job_desc.gres:gmatch("[^,]*") do > count = g:match("gres:gpu:%w+:(%d+)$") or g:match("gres:gpu:(%d+)$") > if count then MfG -- Markus Kötter, +49 681 870832434 30159 Hannover, Lange Laube 6 Helmholtz Center for Information Security smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] slurm and singularity
Hi, On 08.02.23 05:00, Carl Ponder wrote: Take a look at this extension to SLURM: https://github.com/NVIDIA/pyxis https://slurm.schedmd.com/SLUG19/NVIDIA_Containers.pdf enroot & pyxis - great recommendation for rootless containerized runtime environments in HPC. Free software, no license or DGX required. Some things to consider Cache in /tmp so it's free'd upon reboot: # /etc/enroot/enroot.conf ENROOT_RUNTIME_PATH /tmp/enroot/user-$(id -u) ENROOT_CACHE_PATH /tmp/enroot-cache/user-$(id -u) ENROOT_DATA_PATH /tmp/enroot-data/user-$(id -u) When using a local container repo, the image urls port is seperated using # : srun … --container-image mygitlab:5005#path/pytorch:22.12-py3 … MfG -- Markus Kötter, +49 681 870832434 30159 Hannover, Lange Laube 6 Helmholtz Center for Information Security smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] Enforce gpu usage limits (with GRES?)
Hi, limits ain't easy. https://support.ceci-hpc.be/doc/_contents/SubmittingJobs/SlurmLimits.html#precedence I think there is multiple options, starting with not having GPU resources in the CPU partition. Or creating qos the partition and have MaxTRES=gres/gpu:A100=0,gres/gpu:K80=0,gres/gpu=0 attaching it to the CPU partition. And the configuration will require some values as well, # slurm.conf AccountingStorageEnforce=associations,limits,qos,safe AccountingStorageTRES=gres/gpu,gres/gpu:A100,gres/gpu:K80 # cgroups.conf ConstrainDevices=yes most likely some others I miss. MfG -- Markus Kötter, +49 681 870832434 30159 Hannover, Lange Laube 6 Helmholtz Center for Information Security smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] GPU utilization of running jobs
Hi, I've shared my take on this here: > https://forums.developer.nvidia.com/t/job-statistics-with-nvidia-data-center-gpu-manager-and-slurm/148768 It lacks graphing. MfG -- Markus Kötter, +49 681 870832434 30159 Hannover, Lange Laube 6 Helmholtz Center for Information Security smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] Deb packages for Ubuntu
Hi, I maintain debian packages for * slurm * auks * pyxis on * debian bullseye * ubuntu bionic / 18.04 (DGXOS 4.x) * ubuntu focal / 20.04 (DGXOS 5.x) I (think I) started of using the scibian packages and moved to gpb, using gitlab & the ci. Every "version" has it's own branch, for SLURM the ci builds the packages. auks & pyxis is left for the local docker container as it requires the SLURM libs to compile/link. It allows tracking upstream, maintaining patches for auks. Should have chosen github in the first place. Interested? MfG -- Markus Kötter, +49 681 870832434 30159 Hannover, Lange Laube 6 Helmholtz Center for Information Security smime.p7s Description: S/MIME Cryptographic Signature
Re: [slurm-users] container on slurm cluster
Hi, On 18.05.22 08:25, Stephan Roth wrote: Personal note: I'm not sure what I'd choose as a successor to Singularity 3.8, yet. Thoughts are welcome. I can recommend nvidia enroot/pyxis. enroot does unprivileged sandboxes/containers, pyxis is the slurm SPANK glue. https://slurm.schedmd.com/SLUG19/NVIDIA_Containers.pdf https://github.com/NVIDIA/enroot https://github.com/NVIDIA/pyxis Notes from operation: I recommend using nvme for the container storage, default configuration uses tmpfs. …/enroot.conf … ENROOT_RUNTIME_PATH /tmp/enroot/user-$(id -u) ENROOT_DATA_PATH /tmp/enroot-data/user-$(id -u) tmpfiles.d/… d /tmp/enroot/ 0777 root root - - d /tmp/enroot-data/ 0777 root root - - d /tmp/pyaxis-runtime/ 0777 root root - - dest: /etc/slurm/plugstack.conf.d/pyxis.conf content: | required /usr/lib/x86_64-linux-gnu/slurm/spank_pyxis.so runtime_path=/tmp/pyxis-runtime/ Time savers: The container url formatting is … unexpected, # is used as seperator for the path in host:port#path/file - and may need to be escaped to avoid getting interpreted as comment. It uses a netrc formatted .credentials file for container registries with authentication. Insert the credentials twice - with and without port. Can do more than documented. (e.g. #SBATCH --container-image) MfG -- Markus Kötter, +49 681 870832434 30159 Hannover, Lange Laube 6 Helmholtz Center for Information Security smime.p7s Description: S/MIME Cryptographic Signature