Do you have a line like this in  your cgroup_allowed_devices_file.conf
/dev/nvidia*

?

Relu

On 2020-10-08 16:32, Sajesh Singh wrote:

It seems as though the modules are loaded as when I run lsmod I get the following:

nvidia_drm             43714  0

nvidia_modeset       1109636  1 nvidia_drm

nvidia_uvm            935322  0

nvidia              20390295  2 nvidia_modeset,nvidia_uvm

Also the nvidia-smi command returns the following:

nvidia-smi

Thu Oct  8 16:31:57 2020

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |

|-------------------------------+----------------------+----------------------+

| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |

| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage | GPU-Util  Compute M. |

|===============================+======================+======================|

|   0  Quadro M5000        Off  | 00000000:02:00.0 Off |                  Off |

| 33%   21C    P0    45W / 150W |      0MiB /  8126MiB |      0%      Default |

+-------------------------------+----------------------+----------------------+

|   1  Quadro M5000        Off  | 00000000:82:00.0 Off |                  Off |

| 30%   17C    P0    45W / 150W |      0MiB /  8126MiB |      0%      Default |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

| Processes: GPU Memory |

|  GPU       PID   Type   Process name                             Usage      |

|=============================================================================|

|  No running processes found                                                 |

+-----------------------------------------------------------------------------+

--

-SS-

*From:* slurm-users <slurm-users-boun...@lists.schedmd.com> *On Behalf Of *Relu Patrascu
*Sent:* Thursday, October 8, 2020 4:26 PM
*To:* slurm-users@lists.schedmd.com
*Subject:* Re: [slurm-users] CUDA environment variable not being set

*EXTERNAL SENDER*

That usually means you don't have the nvidia kernel module loaded, probably because there's no driver installed.

Relu

On 2020-10-08 14:57, Sajesh Singh wrote:

    Slurm 18.08

    CentOS 7.7.1908

    I have 2 M500 GPUs in a compute node which is defined in the
    slurm.conf and gres.conf of the cluster, but if I launch a job
    requesting GPUs the environment variable CUDA_VISIBLE_DEVICES Is
    never set and I see the following messages in the slurmd.log file:

    debug:  common_gres_set_env: unable to set env vars, no device
    files configured

    Has anyone encountered this before?

    Thank you,

    SS

Reply via email to