Do you have a line like this in your cgroup_allowed_devices_file.conf
/dev/nvidia*
?
Relu
On 2020-10-08 16:32, Sajesh Singh wrote:
It seems as though the modules are loaded as when I run lsmod I get
the following:
nvidia_drm 43714 0
nvidia_modeset 1109636 1 nvidia_drm
nvidia_uvm 935322 0
nvidia 20390295 2 nvidia_modeset,nvidia_uvm
Also the nvidia-smi command returns the following:
nvidia-smi
Thu Oct 8 16:31:57 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version:
10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile
Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro M5000 Off | 00000000:02:00.0 Off
| Off |
| 33% 21C P0 45W / 150W | 0MiB / 8126MiB | 0%
Default |
+-------------------------------+----------------------+----------------------+
| 1 Quadro M5000 Off | 00000000:82:00.0 Off
| Off |
| 30% 17C P0 45W / 150W | 0MiB / 8126MiB | 0%
Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name
Usage |
|=============================================================================|
| No running processes
found |
+-----------------------------------------------------------------------------+
--
-SS-
*From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf
Of *Relu Patrascu
*Sent:* Thursday, October 8, 2020 4:26 PM
*To:* slurm-users@lists.schedmd.com
*Subject:* Re: [slurm-users] CUDA environment variable not being set
*EXTERNAL SENDER*
That usually means you don't have the nvidia kernel module loaded,
probably because there's no driver installed.
Relu
On 2020-10-08 14:57, Sajesh Singh wrote:
Slurm 18.08
CentOS 7.7.1908
I have 2 M500 GPUs in a compute node which is defined in the
slurm.conf and gres.conf of the cluster, but if I launch a job
requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is
never set and I see the following messages in the slurmd.log file:
debug: common_gres_set_env: unable to set env vars, no device
files configured
Has anyone encountered this before?
Thank you,
SS