Re: [slurm-users] GPU Jobs with Slurm

2021-01-14 Thread Loris Bennett
Hi Abhiram, Abhiram Chintangal writes: > Hello, > > I recently set up a small cluster at work using Warewulf/Slurm. Currently, I > am not able to get the scheduler to > work well with GPU's (Gres). > > While slurm is able to filter by GPU type, it allocates all the GPU's on the > node. See

Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-14 Thread Sean Crosby
Hi Abhiram, You need to configure cgroup.conf to constrain the devices a job has access to. See https://slurm.schedmd.com/cgroup.conf.html My cgroup.conf is CgroupAutomount=yes AllowedDevicesFile="/usr/local/slurm/etc/cgroup_allowed_devices_file.conf" ConstrainCores=yes ConstrainRAMSpace=yes Co

Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-14 Thread Ole Holm Nielsen
Hi Sean, On 1/14/21 9:19 AM, Sean Crosby wrote: Hi Abhiram, You need to configure cgroup.conf to constrain the devices a job has access to. See https://slurm.schedmd.com/cgroup.conf.html My cgroup.conf is CgroupAutomount=yes AllowedDevicesFile="

Re: [slurm-users] [EXTERNAL] Possible to copy sacctmgr info from one cluster to another?

2021-01-14 Thread Ole Holm Nielsen
On 1/13/21 7:19 PM, Mando Rodriguez wrote: You can ‘dump’ the info in the slurm database to a file and reload the file (here named cluster.cfg). Dump the info with: sacctmgr dump slurm_cluster file=cluster.cfg You can load the info with: sacctmgr load file=cluster.cfg It saves all accounts,

[slurm-users] How to get CPU tres report by group without getting multiple lines by users?

2021-01-14 Thread ichebo...@univ.haifa.ac.il
Hi, I am trying to collect information for our cluster CPU hours usage formatted by groups, but i always get multiple groups lines. for example: sreport cluster UserUtilizationByAccount start=01/01/20 end=12/31/20 -t hours -T cpu format=Accounts,Used gives me output: GROUP-A 12492 GROUP

[slurm-users] problems using all cores (MPI) / cgroups / tasks problem

2021-01-14 Thread Tina Friedrich
Hello All, I've recently upgraded one of my testing systems to 20.11.2. I seem to have a problem - to me it looks as if it's with cgroups, tasks & task affinity/binding - that I can't figure out. What I'm seeing is this, in a nutshell: [arc-login single]$ srun -M arc -p short --exclusive --p

Re: [slurm-users] How to get CPU tres report by group without getting multiple lines by users?

2021-01-14 Thread Ole Holm Nielsen
On 1/14/21 4:20 PM, ichebo...@univ.haifa.ac.il wrote: I am trying to collect information for our cluster CPU hours usage formatted by groups, but i always get multiple groups lines. for example: sreport cluster UserUtilizationByAccount start=01/01/20 end=12/31/20 -t hours -T cpu format=Accoun

Re: [slurm-users] GPU Jobs with Slurm

2021-01-14 Thread Abhiram Chintangal
Loris, You are correct! Instead of using nvidia-smi as a check, I confirmed the GPU allocation by printing out the environment variable, CUDA_VISIBILE_DEVICES, and it was as expected. Thanks for your help! On Thu, Jan 14, 2021 at 12:18 AM Loris Bennett wrote: > Hi Abhiram, > > Abhiram Chintang

Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-14 Thread Abhiram Chintangal
Sean, Thanks for the clarification.I noticed that I am missing the "AllowedDevices" option in mine. After adding this, the GPU allocations started working. (Slurm version 18.08.8) I was also incorrectly using "nvidia-smi" as a check. Regards, Abhiram On Thu, Jan 14, 2021 at 12:22 AM Sean Crosb

Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-14 Thread Ryan Novosielski
AFAIK, if you have this set up correctly, nvidia-smi will be restricted too, though I think we were seeing a bug there at one time in this version. -- #BlackLivesMatter || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielsk

Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-14 Thread Abhiram Chintangal
Ryan, That's good to know! It would be great to get this working as users are used to checking via nvidia-smi. For now, I have a few jobs ready for the coming weekend! Will check on this later. Thanks for your help! Abhiram On Thu, Jan 14, 2021 at 3:20 PM Ryan Novosielski wrote: > AFAIK, if

Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-14 Thread Fulcomer, Samuel
AllowedDevicesFile should not be necessary. The relevant devices are identified in gres.conf. "ConstrainDevices=yes" should be all that's needed. nvidia-smi will only see the allocated GPUs. Note that a single allocated GPU will always be shown by nvidia-smi to be GPU 0, regardless of its actual h

Re: [slurm-users] [EXT] GPU Jobs with Slurm

2021-01-14 Thread Fulcomer, Samuel
Also note that there was a bug in an older version of SLURM (pre-17-something) that corrupted the database in a way that prevented GPU/gres fencing. If that affected you and you're still using the same database, GPU fencing probably isn't working. There's a way of fixing this manually through sql h

Re: [slurm-users] GPU Jobs with Slurm

2021-01-14 Thread Loris Bennett
Hi Abhiram, Glad to help, but it turns out I was wrong :-) We also didn't have ConstrainDevices=yes set, so nvidia-smi always showed all the GPUs. Thanks to Ryan and Samuel for putting me straight on that. Regards Loris Abhiram Chintangal writes: > Loris, > > You are correct! Instead of

Re: [slurm-users] Questions about sacctmgr load command

2021-01-14 Thread Diego Zuccato
Il 12/01/21 08:13, Loris Bennett ha scritto: > You can dump the database, edit the dump, and then load the edited file. > On loading, sacctmgr will then identify differences between the current > database and the modified dump and ask you to confirm that you really > want to make those changes. II