First of all thank you very much to the Julia and package developers. I 
really enjoy this programming language!

So far I have successfully used Julia's CUDA package on my workstation 
containing a single CUDA device. However, I would like to use the CUDA 
cluster of my university, where a single node holds 6 CUDA devices. The job 
scheduler is slurm, which stores as many CUDA device IDs of unused GPUs in 
the environment variable "CUDA_VISIBLE_DEVICES" as are requested by a job.

When a job (1 CPU + 1 GPU) is assigned the device ID "0" it runs fine, but 
with any other ID it fails with:

ERROR: CuDriverError(101)
in CuDevice at ~/.julia/CUDA/src/devices.jl:11

If I request all 6 CUDA devices of a node, I can run 6 copies of my Julia 
script (each a different CUDA device ID) within the same job in parallel 
without any trouble.

So far I have checked that the device IDs assigned by slurm match those 
reported as free by "nvidia-smi". Regarding slurm configuration it looks ok 
to me, especially since other users have single GPU jobs running on the 
cluster. I have also tried using CUDA 4.1 instead of 5.0, but it didn't 
help. So, I wonder whether perhaps Julia's CUDA package at some point 
assumes a certain device ID or a range of device IDs starting always with 
"0" or something like that?

Thanks for your help!

Reply via email to