[slurm-users] CUDA environment variable not being set
Slurm 18.08 CentOS 7.7.1908 I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster, but if I launch a job requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in the slurmd.log file: debug: common_gres_set_env: unable to set env vars, no device files configured Has anyone encountered this before? Thank you, SS
Re: [slurm-users] CUDA environment variable not being set
That usually means you don't have the nvidia kernel module loaded, probably because there's no driver installed. Relu On 2020-10-08 14:57, Sajesh Singh wrote: Slurm 18.08 CentOS 7.7.1908 I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster, but if I launch a job requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in the slurmd.log file: debug: common_gres_set_env: unable to set env vars, no device files configured Has anyone encountered this before? Thank you, SS
Re: [slurm-users] CUDA environment variable not being set
It seems as though the modules are loaded as when I run lsmod I get the following: nvidia_drm 43714 0 nvidia_modeset 1109636 1 nvidia_drm nvidia_uvm935322 0 nvidia 20390295 2 nvidia_modeset,nvidia_uvm Also the nvidia-smi command returns the following: nvidia-smi Thu Oct 8 16:31:57 2020 +-+ | NVIDIA-SMI 440.64.00Driver Version: 440.64.00CUDA Version: 10.2 | |---+--+--+ | GPU NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===+==+==| | 0 Quadro M5000Off | :02:00.0 Off | Off | | 33% 21CP045W / 150W | 0MiB / 8126MiB | 0% Default | +---+--+--+ | 1 Quadro M5000Off | :82:00.0 Off | Off | | 30% 17CP045W / 150W | 0MiB / 8126MiB | 0% Default | +---+--+--+ +-+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=| | No running processes found | +-+ -- -SS- From: slurm-users On Behalf Of Relu Patrascu Sent: Thursday, October 8, 2020 4:26 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] CUDA environment variable not being set EXTERNAL SENDER That usually means you don't have the nvidia kernel module loaded, probably because there's no driver installed. Relu On 2020-10-08 14:57, Sajesh Singh wrote: Slurm 18.08 CentOS 7.7.1908 I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster, but if I launch a job requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in the slurmd.log file: debug: common_gres_set_env: unable to set env vars, no device files configured Has anyone encountered this before? Thank you, SS
Re: [slurm-users] CUDA environment variable not being set
From any node you can run scontrol from, what does ‘scontrol show node GPUNODENAME | grep -i gres’ return? Mine return lines for both “Gres=” and “CfgTRES=”. From: slurm-users on behalf of Sajesh Singh Reply-To: Slurm User Community List Date: Thursday, October 8, 2020 at 3:33 PM To: Slurm User Community List Subject: Re: [slurm-users] CUDA environment variable not being set External Email Warning This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests. It seems as though the modules are loaded as when I run lsmod I get the following: nvidia_drm 43714 0 nvidia_modeset 1109636 1 nvidia_drm nvidia_uvm935322 0 nvidia 20390295 2 nvidia_modeset,nvidia_uvm Also the nvidia-smi command returns the following: nvidia-smi Thu Oct 8 16:31:57 2020 +-+ | NVIDIA-SMI 440.64.00Driver Version: 440.64.00CUDA Version: 10.2 | |---+--+--+ | GPU NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===+==+==| | 0 Quadro M5000Off | :02:00.0 Off | Off | | 33% 21CP045W / 150W | 0MiB / 8126MiB | 0% Default | +---+--+--+ | 1 Quadro M5000Off | :82:00.0 Off | Off | | 30% 17CP045W / 150W | 0MiB / 8126MiB | 0% Default | +---+--+--+ +-+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=| | No running processes found | +-+ -- -SS- From: slurm-users On Behalf Of Relu Patrascu Sent: Thursday, October 8, 2020 4:26 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] CUDA environment variable not being set EXTERNAL SENDER That usually means you don't have the nvidia kernel module loaded, probably because there's no driver installed. Relu On 2020-10-08 14:57, Sajesh Singh wrote: Slurm 18.08 CentOS 7.7.1908 I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster, but if I launch a job requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in the slurmd.log file: debug: common_gres_set_env: unable to set env vars, no device files configured Has anyone encountered this before? Thank you, SS
Re: [slurm-users] CUDA environment variable not being set
I only get a line returned for “Gres=”, but this is the same behavior on another cluster that has GPUs and the variable gets set on that cluster. -Sajesh- -- _ Sajesh Singh Manager, Systems and Scientific Computing American Museum of Natural History 200 Central Park West New York, NY 10024 (O) (212) 313-7263 (C) (917) 763-9038 (E) ssi...@amnh.org From: slurm-users On Behalf Of Renfro, Michael Sent: Thursday, October 8, 2020 4:53 PM To: Slurm User Community List Subject: Re: [slurm-users] CUDA environment variable not being set EXTERNAL SENDER From any node you can run scontrol from, what does ‘scontrol show node GPUNODENAME | grep -i gres’ return? Mine return lines for both “Gres=” and “CfgTRES=”. From: slurm-users mailto:slurm-users-boun...@lists.schedmd.com>> on behalf of Sajesh Singh mailto:ssi...@amnh.org>> Reply-To: Slurm User Community List mailto:slurm-users@lists.schedmd.com>> Date: Thursday, October 8, 2020 at 3:33 PM To: Slurm User Community List mailto:slurm-users@lists.schedmd.com>> Subject: Re: [slurm-users] CUDA environment variable not being set External Email Warning This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests. It seems as though the modules are loaded as when I run lsmod I get the following: nvidia_drm 43714 0 nvidia_modeset 1109636 1 nvidia_drm nvidia_uvm935322 0 nvidia 20390295 2 nvidia_modeset,nvidia_uvm Also the nvidia-smi command returns the following: nvidia-smi Thu Oct 8 16:31:57 2020 +-+ | NVIDIA-SMI 440.64.00Driver Version: 440.64.00CUDA Version: 10.2 | |---+--+--+ | GPU NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===+==+==| | 0 Quadro M5000Off | :02:00.0 Off | Off | | 33% 21CP045W / 150W | 0MiB / 8126MiB | 0% Default | +---+--+--+ | 1 Quadro M5000Off | :82:00.0 Off | Off | | 30% 17CP045W / 150W | 0MiB / 8126MiB | 0% Default | +---+--+--+ +-+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=| | No running processes found | +-+ -- -SS- From: slurm-users mailto:slurm-users-boun...@lists.schedmd.com>> On Behalf Of Relu Patrascu Sent: Thursday, October 8, 2020 4:26 PM To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] CUDA environment variable not being set EXTERNAL SENDER That usually means you don't have the nvidia kernel module loaded, probably because there's no driver installed. Relu On 2020-10-08 14:57, Sajesh Singh wrote: Slurm 18.08 CentOS 7.7.1908 I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster, but if I launch a job requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in the slurmd.log file: debug: common_gres_set_env: unable to set env vars, no device files configured Has anyone encountered this before? Thank you, SS
Re: [slurm-users] CUDA environment variable not being set
do you have your gres.conf on the nodes also? Brian Andrus On 10/8/2020 11:57 AM, Sajesh Singh wrote: Slurm 18.08 CentOS 7.7.1908 I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster, but if I launch a job requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in the slurmd.log file: debug: common_gres_set_env: unable to set env vars, no device files configured Has anyone encountered this before? Thank you, SS
Re: [slurm-users] CUDA environment variable not being set
Yes. It is located in the /etc/slurm directory -- -SS- From: slurm-users On Behalf Of Brian Andrus Sent: Thursday, October 8, 2020 5:02 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] CUDA environment variable not being set EXTERNAL SENDER do you have your gres.conf on the nodes also? Brian Andrus On 10/8/2020 11:57 AM, Sajesh Singh wrote: Slurm 18.08 CentOS 7.7.1908 I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster, but if I launch a job requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in the slurmd.log file: debug: common_gres_set_env: unable to set env vars, no device files configured Has anyone encountered this before? Thank you, SS
Re: [slurm-users] CUDA environment variable not being set
Do you have a line like this in your cgroup_allowed_devices_file.conf /dev/nvidia* ? Relu On 2020-10-08 16:32, Sajesh Singh wrote: It seems as though the modules are loaded as when I run lsmod I get the following: nvidia_drm 43714 0 nvidia_modeset 1109636 1 nvidia_drm nvidia_uvm 935322 0 nvidia 20390295 2 nvidia_modeset,nvidia_uvm Also the nvidia-smi command returns the following: nvidia-smi Thu Oct 8 16:31:57 2020 +-+ | NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 | |---+--+--+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===+==+==| | 0 Quadro M5000 Off | :02:00.0 Off | Off | | 33% 21C P0 45W / 150W | 0MiB / 8126MiB | 0% Default | +---+--+--+ | 1 Quadro M5000 Off | :82:00.0 Off | Off | | 30% 17C P0 45W / 150W | 0MiB / 8126MiB | 0% Default | +---+--+--+ +-+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=| | No running processes found | +-+ -- -SS- *From:* slurm-users *On Behalf Of *Relu Patrascu *Sent:* Thursday, October 8, 2020 4:26 PM *To:* slurm-users@lists.schedmd.com *Subject:* Re: [slurm-users] CUDA environment variable not being set *EXTERNAL SENDER* That usually means you don't have the nvidia kernel module loaded, probably because there's no driver installed. Relu On 2020-10-08 14:57, Sajesh Singh wrote: Slurm 18.08 CentOS 7.7.1908 I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and gres.conf of the cluster, but if I launch a job requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in the slurmd.log file: debug: common_gres_set_env: unable to set env vars, no device files configured Has anyone encountered this before? Thank you, SS
Re: [slurm-users] CUDA environment variable not being set
Hi Sajesh, On 10/8/20 11:57 am, Sajesh Singh wrote: debug: common_gres_set_env: unable to set env vars, no device files configured I suspect the clue is here - what does your gres.conf look like? Does it list the devices in /dev for the GPUs? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] CUDA environment variable not being set
Relu, Thank you. Looks like the fix is indeed the missing file /etc/slurm/cgroup_allowed_devices_file.conf -SS- -Original Message- From: slurm-users On Behalf Of Christopher Samuel Sent: Thursday, October 8, 2020 6:10 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] CUDA environment variable not being set EXTERNAL SENDER Hi Sajesh, On 10/8/20 11:57 am, Sajesh Singh wrote: > debug: common_gres_set_env: unable to set env vars, no device files > configured I suspect the clue is here - what does your gres.conf look like? Does it list the devices in /dev for the GPUs? All the best, Chris -- Chris Samuel : https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=01%7C01%7Cssingh%40amnh.org%7C1bf5374fd6454b3fcd5a08d86bd6f427%7Cbe0003e8c6b9496883aeb34586974b76%7C0&sdata=INvZvw%2FiTrdf52patYRF9TtrQ0vuXRSivrxC8MJYLM4%3D&reserved=0 : Berkeley, CA, USA
Re: [slurm-users] CUDA environment variable not being set
On 10/8/20 3:48 pm, Sajesh Singh wrote: Thank you. Looks like the fix is indeed the missing file /etc/slurm/cgroup_allowed_devices_file.conf No, you don't want that, that will allow all access to GPUs whether people have requested them or not. What you want is in gres.conf and looks like (hopefully not line wrapped!): NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia0 Cores=0,2,4,6,8 NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia1 Cores=10,12,14,16,18 NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia2 Cores=20,22,24,26,28 NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia3 Cores=30,32,34,36,38 All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
Re: [slurm-users] CUDA environment variable not being set
Christopher, Thank you for the tip. That works as expected. -SS- -Original Message- From: slurm-users On Behalf Of Christopher Samuel Sent: Thursday, October 8, 2020 6:52 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] CUDA environment variable not being set EXTERNAL SENDER On 10/8/20 3:48 pm, Sajesh Singh wrote: >Thank you. Looks like the fix is indeed the missing file > /etc/slurm/cgroup_allowed_devices_file.conf No, you don't want that, that will allow all access to GPUs whether people have requested them or not. What you want is in gres.conf and looks like (hopefully not line wrapped!): NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia0 Cores=0,2,4,6,8 NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia1 Cores=10,12,14,16,18 NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia2 Cores=20,22,24,26,28 NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia3 Cores=30,32,34,36,38 All the best, Chris -- Chris Samuel : https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=01%7C01%7Cssingh%40amnh.org%7C19c93fb5353d43eae47f08d86bdcdebd%7Cbe0003e8c6b9496883aeb34586974b76%7C0&sdata=YO9kTd3TSKG6Y2B6NHx%2B59I5rNdZGPESatncINTPC5A%3D&reserved=0 : Berkeley, CA, USA
Re: [slurm-users] CUDA environment variable not being set
Hi Sajesh, On 10/8/20 4:18 pm, Sajesh Singh wrote: Thank you for the tip. That works as expected. No worries, glad it's useful. Do be aware that the core bindings for the GPUs would likely need to be adjusted for your hardware! Best of luck, Chris -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA