Re: [slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected

2019-11-13 Thread Chris Samuel
On Wednesday, 13 November 2019 10:11:30 AM PST Tamas Hegedus wrote:

> Thanks for your suggestion. You are right, I do not have to deal with
> specific GPUs.
> (I have not tried to compile your code, I simply tested two gromacs runs
> on the same node with -gres=gpu:1 options.)

How are you controlling access to GPUs?  Is that via cgroups?

If so you should be fine, but if you're not using cgroups to control access 
then you may well find that they are sharing the same GPU.

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected

2019-11-13 Thread Tamas Hegedus
Thanks for your suggestion. You are right, I do not have to deal with 
specific GPUs.
(I have not tried to compile your code, I simply tested two gromacs runs 
on the same node with -gres=gpu:1 options.)


On 11/13/19 5:17 PM, Renfro, Michael wrote:

Pretty sure you don’t need to explicitly specify GPU IDs on a Gromacs job 
running inside of Slurm with gres=gpu. Gromacs should only see the GPUs you 
have reserved for that job.

Here’s a verification code you can run to verify that two different GPU jobs 
see different GPU devices (compile with nvcc):

=

// From http://www.cs.fsu.edu/~xyuan/cda5125/examples/lect24/devicequery.cu
#include 
void printDevProp(cudaDeviceProp dP)
{
 printf("%s has %d multiprocessors\n", dP.name, dP.multiProcessorCount);
 printf("%s has PCI BusID %d, DeviceID %d\n", dP.name, dP.pciBusID, 
dP.pciDeviceID);
}
int main()
{
 // Number of CUDA devices
 int devCount; cudaGetDeviceCount();
 printf("There are %d CUDA devices.\n", devCount);
 // Iterate through devices
 for (int i = 0; i < devCount; ++i)
 {
 // Get device properties
 printf("CUDA Device #%d: ", i);
 cudaDeviceProp devProp; cudaGetDeviceProperties(, i);
 printDevProp(devProp);
 }
 return 0;
}

=

When run from two simultaneous jobs on the same node (each with a gres=gpu), I 
get:

=

[renfro@gpunode003(job 221584) hw]$ ./cuda_props
There are 1 CUDA devices.
CUDA Device #0: Tesla K80 has 13 multiprocessors
Tesla K80 has PCI BusID 5, DeviceID 0

=

[renfro@gpunode003(job 221585) hw]$ ./cuda_props
There are 1 CUDA devices.
CUDA Device #0: Tesla K80 has 13 multiprocessors
Tesla K80 has PCI BusID 6, DeviceID 0

=



--
Tamas Hegedus, PhD
Senior Research Fellow
Department of Biophysics and Radiation Biology
Semmelweis University | phone: (36) 1-459 1500/60233
Tuzolto utca 37-47| mailto:ta...@hegelab.org
Budapest, 1094, Hungary   | http://www.hegelab.org




Re: [slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected

2019-11-13 Thread Renfro, Michael
Pretty sure you don’t need to explicitly specify GPU IDs on a Gromacs job 
running inside of Slurm with gres=gpu. Gromacs should only see the GPUs you 
have reserved for that job.

Here’s a verification code you can run to verify that two different GPU jobs 
see different GPU devices (compile with nvcc):

=

// From http://www.cs.fsu.edu/~xyuan/cda5125/examples/lect24/devicequery.cu
#include 
void printDevProp(cudaDeviceProp dP)
{
printf("%s has %d multiprocessors\n", dP.name, dP.multiProcessorCount);
printf("%s has PCI BusID %d, DeviceID %d\n", dP.name, dP.pciBusID, 
dP.pciDeviceID);
}
int main()
{
// Number of CUDA devices
int devCount; cudaGetDeviceCount();
printf("There are %d CUDA devices.\n", devCount);
// Iterate through devices
for (int i = 0; i < devCount; ++i)
{
// Get device properties
printf("CUDA Device #%d: ", i);
cudaDeviceProp devProp; cudaGetDeviceProperties(, i);
printDevProp(devProp);
}
return 0;
}

=

When run from two simultaneous jobs on the same node (each with a gres=gpu), I 
get:

=

[renfro@gpunode003(job 221584) hw]$ ./cuda_props
There are 1 CUDA devices.
CUDA Device #0: Tesla K80 has 13 multiprocessors
Tesla K80 has PCI BusID 5, DeviceID 0

=

[renfro@gpunode003(job 221585) hw]$ ./cuda_props
There are 1 CUDA devices.
CUDA Device #0: Tesla K80 has 13 multiprocessors
Tesla K80 has PCI BusID 6, DeviceID 0

=

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

> On Nov 13, 2019, at 9:54 AM, Tamas Hegedus  wrote:
> 
> External Email Warning
> 
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> 
> 
> 
> Hi,
> 
> I run gmx 2019 using GPU
> There are 4 GPUs in my GPU hosts.
> I have slurm and configured gres=gpu
> 
> 1. If I submit a job with --gres=gpu:1 then GPU#0 is identified and used
> (-gpu_id $CUDA_VISIBLE_DEVICES).
> 2. If I submit a second job, it fails: the $CUDA_VISIBLE_DEVICES is 1
> and selected, but GPU #0 is identified by gmx as a compatible gpu.
> From the output:
> 
> gmx mdrun -v -pin on -deffnm equi_nvt -nt 8 -gpu_id 1 -nb gpu -pme gpu
> -npme 1 -ntmpi 4
> 
>  GPU info:
>Number of GPUs detected: 1
>#0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC:  no, stat:
> compatible
> 
> Fatal error:
> You limited the set of compatible GPUs to a set that included ID #1, but
> that
> ID is not for a compatible GPU. List only compatible GPUs.
> 
> 3. If I login to that node and run the mdrun command written into the
> output in the previous step then it selects the right gpu and runs as
> expected.
> 
> $CUDA_DEVICE_ORDER is set to PCI_BUS_ID
> 
> I can not decide if this is a slurm config error or something with
> gromacs, as $CUDA_VISIBLE_DEVICES is set correctly by slurm and I expect
> gromacs to detect all 4GPUs.
> 
> Thanks for your help and suggestions,
> Tamas
> 
> --
> 
> Tamas Hegedus, PhD
> Senior Research Fellow
> Department of Biophysics and Radiation Biology
> Semmelweis University | phone: (36) 1-459 1500/60233
> Tuzolto utca 37-47| mailto:ta...@hegelab.org
> Budapest, 1094, Hungary   | http://www.hegelab.org
> 
> 



[slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected

2019-11-13 Thread Tamas Hegedus

Hi,

I run gmx 2019 using GPU
There are 4 GPUs in my GPU hosts.
I have slurm and configured gres=gpu

1. If I submit a job with --gres=gpu:1 then GPU#0 is identified and used 
(-gpu_id $CUDA_VISIBLE_DEVICES).
2. If I submit a second job, it fails: the $CUDA_VISIBLE_DEVICES is 1 
and selected, but GPU #0 is identified by gmx as a compatible gpu.

From the output:

gmx mdrun -v -pin on -deffnm equi_nvt -nt 8 -gpu_id 1 -nb gpu -pme gpu 
-npme 1 -ntmpi 4


  GPU info:
    Number of GPUs detected: 1
    #0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC:  no, stat: 
compatible


Fatal error:
You limited the set of compatible GPUs to a set that included ID #1, but 
that

ID is not for a compatible GPU. List only compatible GPUs.

3. If I login to that node and run the mdrun command written into the 
output in the previous step then it selects the right gpu and runs as 
expected.


$CUDA_DEVICE_ORDER is set to PCI_BUS_ID

I can not decide if this is a slurm config error or something with 
gromacs, as $CUDA_VISIBLE_DEVICES is set correctly by slurm and I expect 
gromacs to detect all 4GPUs.


Thanks for your help and suggestions,
Tamas

--

Tamas Hegedus, PhD
Senior Research Fellow
Department of Biophysics and Radiation Biology
Semmelweis University | phone: (36) 1-459 1500/60233
Tuzolto utca 37-47| mailto:ta...@hegelab.org
Budapest, 1094, Hungary   | http://www.hegelab.org