[slurm-users] Re: How to exclude master from computing? Set to DRAINED?

2024-06-24 Thread Stephan Roth via slurm-users
Dear Xaver, Could you clarify the function of what you call "master"? If it's the Slurm controller, i.e. running slurmctld: Why do you need slurmd running on it as well? Best, Stephan On 24.06.24 13:54, Xaver Stiensmeier via slurm-users wrote: Dear Slurm users, in our project we exclude

Re: [slurm-users] Guarantee minimum amount of GPU resources to a Slurm account

2023-09-13 Thread Stephan Roth
Markus, thanks for the heads-up. I intend to either reserve specific nodes with GPUs or use features. Best, Stephan On 13.09.23 09:08, Markus Kötter wrote: Hi, currently reservations do not work for gres. https://bugs.schedmd.com/show_bug.cgi?id=5771 23.11 might change this. MfG

Re: [slurm-users] Guarantee minimum amount of GPU resources to a Slurm account

2023-09-13 Thread Stephan Roth
Thanks Chris, this completes what I was looking for. Should have had a better look at the scontrol man page. Best, Stephan On 13.09.23 02:24, Chris Samuel wrote: On 12/9/23 9:22 am, Stephan Roth wrote: Thanks Noam, this looks promising! I would suggest that as was as the "magnetic&

Re: [slurm-users] Guarantee minimum amount of GPU resources to a Slurm account

2023-09-12 Thread Stephan Roth
that they specify the name of the reservation. The reservation will only "attract" jobs that meet the access control requirements. (from https://slurm.schedmd.com/reservations.html <https://slurm.schedmd.com/reservations.html>) On Sep 12, 2023, at 10:14 AM, Stephan Roth &

[slurm-users] Guarantee minimum amount of GPU resources to a Slurm account

2023-09-12 Thread Stephan Roth
Dear Slurm users, I'm looking to fulfill the requirement of guaranteeing availability of GPU resources to a Slurm account, while allowing this account to use other available GPU resources as well. The guaranteed GPU resources should be of at least 1 type, optionally up to 3 types, as in:

Re: [slurm-users] Rolling upgrade of compute nodes

2022-05-30 Thread Stephan Roth
mileage may vary depending on job types! Question: Does anyone have bad experiences with upgrading slurmd while the cluster is running production? /Ole -- ETH Zurich Stephan Roth Systems Administrator IT Support Group (ISG) D-ITET ETF D 104 Sternwartstrasse 7 8092 Zurich Phone +41 44 632 30 59

Re: [slurm-users] Rolling upgrade of compute nodes

2022-05-30 Thread Stephan Roth
Hi Byron, If you have the means to set up a test environment to try the upgrade first, I recommend to do it. The upgrade from 19.05 to 20.11 worked for two clusters I maintain with a similar NFS based setup, except we keep the Slurm configuration separated from the Slurm software

Re: [slurm-users] container on slurm cluster

2022-05-18 Thread Stephan Roth
On 17.05.22 17:17, Timo Rothenpieler wrote: On 17.05.2022 15:58, Brian Andrus wrote: You are starting to understand a major issue with most containers. I suggest you check out Singularity, which was built from the ground up to address most issues. And it can run other container types (eg:

Re: [slurm-users] FW: gres/gpu count lower than reported

2022-05-03 Thread Stephan Roth
please contact the sender by return electronic mail and delete all copies of this communication -- ETH Zurich Stephan Roth Systems Administrator IT Support Group (ISG) D-ITET ETF D 104 Sternwartstrasse 7 8092 Zurich Phone +41 44 632 30 59 stephan.r...@ee.ethz.ch www.isg.ee.ethz.ch Work

Re: [slurm-users] MPICH

2022-04-28 Thread Stephan Roth
Hi Diego, I don't know about MPICH, but in case you haven't done this already, you might check the Slurm side if everything is ready: Did you make sure your Slurm was built with PMI support (as in `configure ... --with-pmix=/path/to/pmix`)? Do you see MPI types: srun --mpi=list Does a

Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-02-02 Thread Stephan Roth
On 02.02.22 18:32, Michael Di Domenico wrote: On Mon, Jan 31, 2022 at 3:57 PM Stephan Roth wrote: The problem is to identify the cards physically from the information we have, like what's reported with nvidia-smi or available in /proc/driver/nvidia/gpus/*/information The serial number isn't

Re: [slurm-users] How to tell SLURM to ignore specific GPUs

2022-01-31 Thread Stephan Roth
Not a solution, but some ideas & experiences concerning the same topic: A few of our older GPUs used to show the error message "has fallen off the bus" which was only resolved by a full power cycle as well. Something changed, nowadays the error messages is "GPU lost" and a normal reboot

Re: [slurm-users] Use gres to handle permissions of /dev/dri/card* and /dev/dri/renderD*?

2022-01-06 Thread Stephan Roth
devices /dev/dri/card* and /dev/dri/renderD* . Is there a way to give access to these devices along with /dev/nvidia* which we use for CUDA? Ideally as a single generic resource that would give permissions to all three files at once. Thank you for any tips. -

Re: [slurm-users] Specify a gpu ID

2021-06-04 Thread Stephan Roth
On 03.06.21 07:11, Ahmad Khalifa wrote: How to send a job to a particular gpu card using its ID (0,1,2...etc)? Why do you need to access a GPU based on its ID? If its to select a certain GPU type, there are other methods you can use. You could create partitions for the same GPU types or add

Re: [slurm-users] AutoDetect=nvml throwing an error message

2021-04-16 Thread Stephan Roth
ctrack/cgroup >> >> ## Nodes list >> ## use native GPUs >> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=2 RealMemory=1024000 State=UNKNOWN Gres=gpu:8 Feature=ht,gpu >> >> ## Partitions list &

Re: [slurm-users] changes in slurm.

2020-07-10 Thread Stephan Roth
job. I > am not deleting the partition here. > > Regards > Navin. > > > > > > > --- Stephan Roth | ISG.EE D-ITET ETH Zurich | http://www.isg.ee.ethz.ch +4144 632 30 59 | ETF D 104 | Sternwartstrasse 7 | 8092 Zurich ---

[slurm-users] Automatically cancel jobs not utilizing their GPUs

2020-07-02 Thread Stephan Roth
Hi all, Does anyone have ideas or suggestions on how to automatically cancel jobs which don't utilize the GPUs allocated to them? The Slurm version in use is 19.05. I'm thinking about collecting GPU utilization per process on all nodes with NVML/nvidia-smi, update a mean value of the

Re: [slurm-users] How to view GPU indices of the completed jobs?

2020-06-26 Thread Stephan Roth
In regard to Kota's initial question ... "Is there any way (commands, configurations, etc...) to see the allocated GPU indices for completed jobs?" ... I was in need of the same kind of information and found the following: If - ConstrainDevices is on - SlurmdDebug is set to at least "debug"

[slurm-users] How to detect Job submission by srun / interactive jobs

2020-05-18 Thread Stephan Roth
Dear all, Does anybody know of a way to detect whether a job is submitted with srun, preferrably in job_submit.lua? The goal is to allow interactive jobs only on specific partitions. Any recommendation or best practice on how to handle interactive jobs is welcome. Thank you, Stephan

Re: [slurm-users] How to use Autodetect=nvml in gres.conf

2020-02-07 Thread Stephan Roth
ing in it that has nv or cuda in its name. Are you sure that slurm distributes nvidia binaries? -Original Message- From: slurm-users On Behalf Of Stephan Roth Sent: Friday, February 7, 2020 2:23 AM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] How to use Autodetect=nvml in gres.c

Re: [slurm-users] How to use Autodetect=nvml in gres.conf

2020-02-07 Thread Stephan Roth
ote: I just checked the .deb package that I build from source and there is nothing in it that has nv or cuda in its name. Are you sure that slurm distributes nvidia binaries? -Original Message- From: slurm-users On Behalf Of Stephan Roth Sent: Friday, February 7, 2020 2:23 AM To: sl

Re: [slurm-users] How to use Autodetect=nvml in gres.conf

2020-02-07 Thread Stephan Roth
On 05.02.20 21:06, Dean Schulze wrote: > I need to dynamically configure gpus on my nodes. The gres.conf doc > says to use > > Autodetect=nvml That's all you need in gres.conf provided you don't configure any Gres=... entries for your nodes in your slurm.conf. If you do, make sure the string