[slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
Slurm 18.08
CentOS 7.7.1908

I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and 
gres.conf of the cluster, but if I launch a job requesting GPUs the environment 
variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in 
the slurmd.log file:

debug:  common_gres_set_env: unable to set env vars, no device files configured

Has anyone encountered this before?

Thank you,

SS


Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Relu Patrascu
That usually means you don't have the nvidia kernel module loaded, 
probably because there's no driver installed.


Relu

On 2020-10-08 14:57, Sajesh Singh wrote:


Slurm 18.08

CentOS 7.7.1908

I have 2 M500 GPUs in a compute node which is defined in the 
slurm.conf and gres.conf of the cluster, but if I launch a job 
requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is never 
set and I see the following messages in the slurmd.log file:


debug:  common_gres_set_env: unable to set env vars, no device files 
configured


Has anyone encountered this before?

Thank you,

SS



Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
It seems as though the modules are loaded as when I run lsmod I get the 
following:

nvidia_drm 43714  0
nvidia_modeset   1109636  1 nvidia_drm
nvidia_uvm935322  0
nvidia  20390295  2 nvidia_modeset,nvidia_uvm

Also the nvidia-smi command returns the following:

nvidia-smi
Thu Oct  8 16:31:57 2020
+-+
| NVIDIA-SMI 440.64.00Driver Version: 440.64.00CUDA Version: 10.2 |
|---+--+--+
| GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute M. |
|===+==+==|
|   0  Quadro M5000Off  | :02:00.0 Off |  Off |
| 33%   21CP045W / 150W |  0MiB /  8126MiB |  0%  Default |
+---+--+--+
|   1  Quadro M5000Off  | :82:00.0 Off |  Off |
| 30%   17CP045W / 150W |  0MiB /  8126MiB |  0%  Default |
+---+--+--+

+-+
| Processes:   GPU Memory |
|  GPU   PID   Type   Process name Usage  |
|=|
|  No running processes found |
+-+

--

-SS-

From: slurm-users  On Behalf Of Relu 
Patrascu
Sent: Thursday, October 8, 2020 4:26 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] CUDA environment variable not being set

EXTERNAL SENDER


That usually means you don't have the nvidia kernel module loaded, probably 
because there's no driver installed.

Relu
On 2020-10-08 14:57, Sajesh Singh wrote:
Slurm 18.08
CentOS 7.7.1908

I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and 
gres.conf of the cluster, but if I launch a job requesting GPUs the environment 
variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in 
the slurmd.log file:

debug:  common_gres_set_env: unable to set env vars, no device files configured

Has anyone encountered this before?

Thank you,

SS


Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Renfro, Michael
From any node you can run scontrol from, what does ‘scontrol show node 
GPUNODENAME | grep -i gres’ return? Mine return lines for both “Gres=” and 
“CfgTRES=”.

From: slurm-users  on behalf of Sajesh 
Singh 
Reply-To: Slurm User Community List 
Date: Thursday, October 8, 2020 at 3:33 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] CUDA environment variable not being set


External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


It seems as though the modules are loaded as when I run lsmod I get the 
following:

nvidia_drm 43714  0
nvidia_modeset   1109636  1 nvidia_drm
nvidia_uvm935322  0
nvidia  20390295  2 nvidia_modeset,nvidia_uvm

Also the nvidia-smi command returns the following:

nvidia-smi
Thu Oct  8 16:31:57 2020
+-+
| NVIDIA-SMI 440.64.00Driver Version: 440.64.00CUDA Version: 10.2 |
|---+--+--+
| GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute M. |
|===+==+==|
|   0  Quadro M5000Off  | :02:00.0 Off |  Off |
| 33%   21CP045W / 150W |  0MiB /  8126MiB |  0%  Default |
+---+--+--+
|   1  Quadro M5000Off  | :82:00.0 Off |  Off |
| 30%   17CP045W / 150W |  0MiB /  8126MiB |  0%  Default |
+---+--+--+

+-+
| Processes:   GPU Memory |
|  GPU   PID   Type   Process name Usage  |
|=|
|  No running processes found |
+-+

--

-SS-

From: slurm-users  On Behalf Of Relu 
Patrascu
Sent: Thursday, October 8, 2020 4:26 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] CUDA environment variable not being set

EXTERNAL SENDER


That usually means you don't have the nvidia kernel module loaded, probably 
because there's no driver installed.

Relu
On 2020-10-08 14:57, Sajesh Singh wrote:
Slurm 18.08
CentOS 7.7.1908

I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and 
gres.conf of the cluster, but if I launch a job requesting GPUs the environment 
variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in 
the slurmd.log file:

debug:  common_gres_set_env: unable to set env vars, no device files configured

Has anyone encountered this before?

Thank you,

SS


Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
I only get a line returned for “Gres=”, but this is the same behavior on 
another cluster that has GPUs and the variable gets set on that cluster.

-Sajesh-

--
_
Sajesh Singh
Manager, Systems and Scientific Computing
American Museum of Natural History
200 Central Park West
New York, NY 10024

(O) (212) 313-7263
(C) (917) 763-9038
(E) ssi...@amnh.org

From: slurm-users  On Behalf Of Renfro, 
Michael
Sent: Thursday, October 8, 2020 4:53 PM
To: Slurm User Community List 
Subject: Re: [slurm-users] CUDA environment variable not being set

EXTERNAL SENDER

From any node you can run scontrol from, what does ‘scontrol show node 
GPUNODENAME | grep -i gres’ return? Mine return lines for both “Gres=” and 
“CfgTRES=”.

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 on behalf of Sajesh Singh mailto:ssi...@amnh.org>>
Reply-To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Date: Thursday, October 8, 2020 at 3:33 PM
To: Slurm User Community List 
mailto:slurm-users@lists.schedmd.com>>
Subject: Re: [slurm-users] CUDA environment variable not being set


External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


It seems as though the modules are loaded as when I run lsmod I get the 
following:

nvidia_drm 43714  0
nvidia_modeset   1109636  1 nvidia_drm
nvidia_uvm935322  0
nvidia  20390295  2 nvidia_modeset,nvidia_uvm

Also the nvidia-smi command returns the following:

nvidia-smi
Thu Oct  8 16:31:57 2020
+-+
| NVIDIA-SMI 440.64.00Driver Version: 440.64.00CUDA Version: 10.2 |
|---+--+--+
| GPU  NamePersistence-M| Bus-IdDisp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute M. |
|===+==+==|
|   0  Quadro M5000Off  | :02:00.0 Off |  Off |
| 33%   21CP045W / 150W |  0MiB /  8126MiB |  0%  Default |
+---+--+--+
|   1  Quadro M5000Off  | :82:00.0 Off |  Off |
| 30%   17CP045W / 150W |  0MiB /  8126MiB |  0%  Default |
+---+--+--+

+-+
| Processes:   GPU Memory |
|  GPU   PID   Type   Process name Usage  |
|=|
|  No running processes found |
+-+

--

-SS-

From: slurm-users 
mailto:slurm-users-boun...@lists.schedmd.com>>
 On Behalf Of Relu Patrascu
Sent: Thursday, October 8, 2020 4:26 PM
To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] CUDA environment variable not being set

EXTERNAL SENDER


That usually means you don't have the nvidia kernel module loaded, probably 
because there's no driver installed.

Relu
On 2020-10-08 14:57, Sajesh Singh wrote:
Slurm 18.08
CentOS 7.7.1908

I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and 
gres.conf of the cluster, but if I launch a job requesting GPUs the environment 
variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in 
the slurmd.log file:

debug:  common_gres_set_env: unable to set env vars, no device files configured

Has anyone encountered this before?

Thank you,

SS


Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Brian Andrus

do you have your gres.conf on the nodes also?

Brian Andrus

On 10/8/2020 11:57 AM, Sajesh Singh wrote:


Slurm 18.08

CentOS 7.7.1908

I have 2 M500 GPUs in a compute node which is defined in the 
slurm.conf and gres.conf of the cluster, but if I launch a job 
requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is never 
set and I see the following messages in the slurmd.log file:


debug:  common_gres_set_env: unable to set env vars, no device files 
configured


Has anyone encountered this before?

Thank you,

SS



Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
Yes. It is located in the /etc/slurm directory

--

-SS-

From: slurm-users  On Behalf Of Brian 
Andrus
Sent: Thursday, October 8, 2020 5:02 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] CUDA environment variable not being set

EXTERNAL SENDER


do you have your gres.conf on the nodes also?

Brian Andrus
On 10/8/2020 11:57 AM, Sajesh Singh wrote:
Slurm 18.08
CentOS 7.7.1908

I have 2 M500 GPUs in a compute node which is defined in the slurm.conf and 
gres.conf of the cluster, but if I launch a job requesting GPUs the environment 
variable CUDA_VISIBLE_DEVICES Is never set and I see the following messages in 
the slurmd.log file:

debug:  common_gres_set_env: unable to set env vars, no device files configured

Has anyone encountered this before?

Thank you,

SS


Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Relu Patrascu

Do you have a line like this in  your cgroup_allowed_devices_file.conf
/dev/nvidia*

?

Relu

On 2020-10-08 16:32, Sajesh Singh wrote:


It seems as though the modules are loaded as when I run lsmod I get 
the following:


nvidia_drm 43714  0

nvidia_modeset   1109636  1 nvidia_drm

nvidia_uvm    935322  0

nvidia  20390295  2 nvidia_modeset,nvidia_uvm

Also the nvidia-smi command returns the following:

nvidia-smi

Thu Oct  8 16:31:57 2020

+-+

| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 
10.2 |


|---+--+--+

| GPU  Name    Persistence-M| Bus-Id    Disp.A | Volatile 
Uncorr. ECC |


| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute M. |

|===+==+==|

|   0  Quadro M5000    Off  | :02:00.0 Off 
|  Off |


| 33%   21C    P0    45W / 150W |  0MiB /  8126MiB |  0%  
Default |


+---+--+--+

|   1  Quadro M5000    Off  | :82:00.0 Off 
|  Off |


| 30%   17C    P0    45W / 150W |  0MiB /  8126MiB |  0%  
Default |


+---+--+--+

+-+

| Processes: GPU Memory |

|  GPU   PID   Type   Process name 
Usage  |


|=|

|  No running processes 
found |


+-+

--

-SS-

*From:* slurm-users  *On Behalf 
Of *Relu Patrascu

*Sent:* Thursday, October 8, 2020 4:26 PM
*To:* slurm-users@lists.schedmd.com
*Subject:* Re: [slurm-users] CUDA environment variable not being set

*EXTERNAL SENDER*

That usually means you don't have the nvidia kernel module loaded, 
probably because there's no driver installed.


Relu

On 2020-10-08 14:57, Sajesh Singh wrote:

Slurm 18.08

CentOS 7.7.1908

I have 2 M500 GPUs in a compute node which is defined in the
slurm.conf and gres.conf of the cluster, but if I launch a job
requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is
never set and I see the following messages in the slurmd.log file:

debug:  common_gres_set_env: unable to set env vars, no device
files configured

Has anyone encountered this before?

Thank you,

SS



Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Christopher Samuel

Hi Sajesh,

On 10/8/20 11:57 am, Sajesh Singh wrote:

debug:  common_gres_set_env: unable to set env vars, no device files 
configured


I suspect the clue is here - what does your gres.conf look like?
Does it list the devices in /dev for the GPUs?

All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
Relu, 
  Thank you. Looks like the fix is indeed the missing file 
/etc/slurm/cgroup_allowed_devices_file.conf



-SS-

-Original Message-
From: slurm-users  On Behalf Of 
Christopher Samuel
Sent: Thursday, October 8, 2020 6:10 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] CUDA environment variable not being set

EXTERNAL SENDER


Hi Sajesh,

On 10/8/20 11:57 am, Sajesh Singh wrote:

> debug:  common_gres_set_env: unable to set env vars, no device files 
> configured

I suspect the clue is here - what does your gres.conf look like?
Does it list the devices in /dev for the GPUs?

All the best,
Chris
--
   Chris Samuel  :  
https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=01%7C01%7Cssingh%40amnh.org%7C1bf5374fd6454b3fcd5a08d86bd6f427%7Cbe0003e8c6b9496883aeb34586974b76%7C0&sdata=INvZvw%2FiTrdf52patYRF9TtrQ0vuXRSivrxC8MJYLM4%3D&reserved=0
  :  Berkeley, CA, USA




Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Christopher Samuel

On 10/8/20 3:48 pm, Sajesh Singh wrote:


   Thank you. Looks like the fix is indeed the missing file 
/etc/slurm/cgroup_allowed_devices_file.conf


No, you don't want that, that will allow all access to GPUs whether 
people have requested them or not.


What you want is in gres.conf and looks like (hopefully not line wrapped!):

NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia0 Cores=0,2,4,6,8
NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia1 
Cores=10,12,14,16,18
NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia2 
Cores=20,22,24,26,28
NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia3 
Cores=30,32,34,36,38


All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Sajesh Singh
Christopher,

 Thank you for the tip. That works as expected. 


-SS-

-Original Message-
From: slurm-users  On Behalf Of 
Christopher Samuel
Sent: Thursday, October 8, 2020 6:52 PM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] CUDA environment variable not being set

EXTERNAL SENDER


On 10/8/20 3:48 pm, Sajesh Singh wrote:

>Thank you. Looks like the fix is indeed the missing file 
> /etc/slurm/cgroup_allowed_devices_file.conf

No, you don't want that, that will allow all access to GPUs whether people have 
requested them or not.

What you want is in gres.conf and looks like (hopefully not line wrapped!):

NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia0 Cores=0,2,4,6,8 
NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia1
Cores=10,12,14,16,18
NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia2
Cores=20,22,24,26,28
NodeName=nodes[01-18] Name=gpu Type=v100 File=/dev/nvidia3
Cores=30,32,34,36,38

All the best,
Chris
--
   Chris Samuel  :  
https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=01%7C01%7Cssingh%40amnh.org%7C19c93fb5353d43eae47f08d86bdcdebd%7Cbe0003e8c6b9496883aeb34586974b76%7C0&sdata=YO9kTd3TSKG6Y2B6NHx%2B59I5rNdZGPESatncINTPC5A%3D&reserved=0
  :  Berkeley, CA, USA



Re: [slurm-users] CUDA environment variable not being set

2020-10-08 Thread Christopher Samuel

Hi Sajesh,

On 10/8/20 4:18 pm, Sajesh Singh wrote:


  Thank you for the tip. That works as expected.


No worries, glad it's useful. Do be aware that the core bindings for the 
GPUs would likely need to be adjusted for your hardware!


Best of luck,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA