Re: [slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected

2019-11-13 Thread Chris Samuel
On Wednesday, 13 November 2019 10:11:30 AM PST Tamas Hegedus wrote:

> Thanks for your suggestion. You are right, I do not have to deal with
> specific GPUs.
> (I have not tried to compile your code, I simply tested two gromacs runs
> on the same node with -gres=gpu:1 options.)

How are you controlling access to GPUs?  Is that via cgroups?

If so you should be fine, but if you're not using cgroups to control access 
then you may well find that they are sharing the same GPU.

All the best,
Chris
-- 
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA






Re: [slurm-users] Upgrade slurm to 19.05.3 from 18.08.7

2019-11-13 Thread Bas van der Vlies




On 11/13/19 8:36 PM, Christopher Samuel wrote:


https://slurm.schedmd.com/quickstart_admin.html#upgrade

As Ole says, *always* upgrade slurmdbd first, then slurmctld and finally 
slurmd's.  This is required because of the way the RPC protocol support 
for older versions works.




Thanks Chris I also found the above link. I read the RPC documentation 
wrong and now have the correct procedure for upgrading

--
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 
XG Amsterdam

| T +31 (0) 20 800 1300  | bas.vandervl...@surfsara.nl | www.surfsara.nl |



Re: [slurm-users] Upgrade slurm to 19.05.3 from 18.08.7

2019-11-13 Thread Bas van der Vlies




Hi Bas,

Your order of upgrading is *disrecommended*, see for example page 6 in 
the presentation "Field Notes From A MadMan, Tim Wickberg, SchedMD" in 
the page https://slurm.schedmd.com/publications.html


Versions may be mixed as follows:
slurmdbd >= slurmctld >= slurmd >= commands


Thanks a lot Ole. This helps a a lot.

Regards


--
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 
XG Amsterdam

| T +31 (0) 20 800 1300  | bas.vandervl...@surfsara.nl | www.surfsara.nl |



Re: [slurm-users] Upgrade slurm to 19.05.3 from 18.08.7

2019-11-13 Thread Christopher Samuel

On 11/13/19 10:42 AM, Ole Holm Nielsen wrote:

Your order of upgrading is *disrecommended*, see for example page 6 in 
the presentation "Field Notes From A MadMan, Tim Wickberg, SchedMD" in 
the page https://slurm.schedmd.com/publications.html


Also the documentation for upgrading here:

https://slurm.schedmd.com/quickstart_admin.html#upgrade

As Ole says, *always* upgrade slurmdbd first, then slurmctld and finally 
slurmd's.  This is required because of the way the RPC protocol support 
for older versions works.


--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Upgrade slurm to 19.05.3 from 18.08.7

2019-11-13 Thread Ole Holm Nielsen

On 13-11-2019 18:04, Bas van der Vlies wrote:
We have currently version 18.08.7 installed on our cluster and want to 
upgrade to 19.03.3.. So I wanted to start small and installed it one of 
our compute node. Buy if I start the 'slurmd' then our slurmctld will 
complain that:

{{{
2019-11-13T17:49:37.402] error: slurm_unpack_received_msg: Incompatible 
versions of client and server code
[2019-11-13T17:49:37.412] error: slurm_receive_msg [10.10..0.40:32546]: 
Unspecified error
[2019-11-13T17:49:38.413] error: slurm_unpack_received_msg: Invalid 
Protocol Version 8704 from uid=-1 at 10.10.0.40:32548
[2019-11-13T17:49:38.413] error: slurm_unpack_received_msg: Incompatible 
versions of client and server code

}}}


I have read about the RPC protocol:
  * https://slurm.schedmd.com/rpc.html

Can an old `slurmctld` not communicate with a newer `slurmd`? Or is this 
setup supported and something else goes wrong?


Hi Bas,

Your order of upgrading is *disrecommended*, see for example page 6 in 
the presentation "Field Notes From A MadMan, Tim Wickberg, SchedMD" in 
the page https://slurm.schedmd.com/publications.html


Versions may be mixed as follows:
slurmdbd >= slurmctld >= slurmd >= commands

Perhaps you may find some useful further information in my Slurm Wiki page:
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation

/Ole



Re: [slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected

2019-11-13 Thread Tamas Hegedus
Thanks for your suggestion. You are right, I do not have to deal with 
specific GPUs.
(I have not tried to compile your code, I simply tested two gromacs runs 
on the same node with -gres=gpu:1 options.)


On 11/13/19 5:17 PM, Renfro, Michael wrote:

Pretty sure you don’t need to explicitly specify GPU IDs on a Gromacs job 
running inside of Slurm with gres=gpu. Gromacs should only see the GPUs you 
have reserved for that job.

Here’s a verification code you can run to verify that two different GPU jobs 
see different GPU devices (compile with nvcc):

=

// From http://www.cs.fsu.edu/~xyuan/cda5125/examples/lect24/devicequery.cu
#include 
void printDevProp(cudaDeviceProp dP)
{
 printf("%s has %d multiprocessors\n", dP.name, dP.multiProcessorCount);
 printf("%s has PCI BusID %d, DeviceID %d\n", dP.name, dP.pciBusID, 
dP.pciDeviceID);
}
int main()
{
 // Number of CUDA devices
 int devCount; cudaGetDeviceCount();
 printf("There are %d CUDA devices.\n", devCount);
 // Iterate through devices
 for (int i = 0; i < devCount; ++i)
 {
 // Get device properties
 printf("CUDA Device #%d: ", i);
 cudaDeviceProp devProp; cudaGetDeviceProperties(, i);
 printDevProp(devProp);
 }
 return 0;
}

=

When run from two simultaneous jobs on the same node (each with a gres=gpu), I 
get:

=

[renfro@gpunode003(job 221584) hw]$ ./cuda_props
There are 1 CUDA devices.
CUDA Device #0: Tesla K80 has 13 multiprocessors
Tesla K80 has PCI BusID 5, DeviceID 0

=

[renfro@gpunode003(job 221585) hw]$ ./cuda_props
There are 1 CUDA devices.
CUDA Device #0: Tesla K80 has 13 multiprocessors
Tesla K80 has PCI BusID 6, DeviceID 0

=



--
Tamas Hegedus, PhD
Senior Research Fellow
Department of Biophysics and Radiation Biology
Semmelweis University | phone: (36) 1-459 1500/60233
Tuzolto utca 37-47| mailto:ta...@hegelab.org
Budapest, 1094, Hungary   | http://www.hegelab.org




[slurm-users] Upgrade slurm to 19.05.3 from 18.08.7

2019-11-13 Thread Bas van der Vlies
We have currently version 18.08.7 installed on our cluster and want to 
upgrade to 19.03.3.. So I wanted to start small and installed it one of 
our compute node. Buy if I start the 'slurmd' then our slurmctld will 
complain that:

{{{
2019-11-13T17:49:37.402] error: slurm_unpack_received_msg: Incompatible 
versions of client and server code
[2019-11-13T17:49:37.412] error: slurm_receive_msg [10.10..0.40:32546]: 
Unspecified error
[2019-11-13T17:49:38.413] error: slurm_unpack_received_msg: Invalid 
Protocol Version 8704 from uid=-1 at 10.10.0.40:32548
[2019-11-13T17:49:38.413] error: slurm_unpack_received_msg: Incompatible 
versions of client and server code

}}}


I have read about the RPC protocol:
 * https://slurm.schedmd.com/rpc.html

Can an old `slurmctld` not communicate with a newer `slurmd`? Or is this 
setup supported and something else goes wrong?


Regards

--
Bas van der Vlies
| Operations, Support & Development | SURFsara | Science Park 140 | 1098 
XG Amsterdam

| T +31 (0) 20 800 1300  | bas.vandervl...@surfsara.nl | www.surfsara.nl |



Re: [slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected

2019-11-13 Thread Renfro, Michael
Pretty sure you don’t need to explicitly specify GPU IDs on a Gromacs job 
running inside of Slurm with gres=gpu. Gromacs should only see the GPUs you 
have reserved for that job.

Here’s a verification code you can run to verify that two different GPU jobs 
see different GPU devices (compile with nvcc):

=

// From http://www.cs.fsu.edu/~xyuan/cda5125/examples/lect24/devicequery.cu
#include 
void printDevProp(cudaDeviceProp dP)
{
printf("%s has %d multiprocessors\n", dP.name, dP.multiProcessorCount);
printf("%s has PCI BusID %d, DeviceID %d\n", dP.name, dP.pciBusID, 
dP.pciDeviceID);
}
int main()
{
// Number of CUDA devices
int devCount; cudaGetDeviceCount();
printf("There are %d CUDA devices.\n", devCount);
// Iterate through devices
for (int i = 0; i < devCount; ++i)
{
// Get device properties
printf("CUDA Device #%d: ", i);
cudaDeviceProp devProp; cudaGetDeviceProperties(, i);
printDevProp(devProp);
}
return 0;
}

=

When run from two simultaneous jobs on the same node (each with a gres=gpu), I 
get:

=

[renfro@gpunode003(job 221584) hw]$ ./cuda_props
There are 1 CUDA devices.
CUDA Device #0: Tesla K80 has 13 multiprocessors
Tesla K80 has PCI BusID 5, DeviceID 0

=

[renfro@gpunode003(job 221585) hw]$ ./cuda_props
There are 1 CUDA devices.
CUDA Device #0: Tesla K80 has 13 multiprocessors
Tesla K80 has PCI BusID 6, DeviceID 0

=

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

> On Nov 13, 2019, at 9:54 AM, Tamas Hegedus  wrote:
> 
> External Email Warning
> 
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> 
> 
> 
> Hi,
> 
> I run gmx 2019 using GPU
> There are 4 GPUs in my GPU hosts.
> I have slurm and configured gres=gpu
> 
> 1. If I submit a job with --gres=gpu:1 then GPU#0 is identified and used
> (-gpu_id $CUDA_VISIBLE_DEVICES).
> 2. If I submit a second job, it fails: the $CUDA_VISIBLE_DEVICES is 1
> and selected, but GPU #0 is identified by gmx as a compatible gpu.
> From the output:
> 
> gmx mdrun -v -pin on -deffnm equi_nvt -nt 8 -gpu_id 1 -nb gpu -pme gpu
> -npme 1 -ntmpi 4
> 
>  GPU info:
>Number of GPUs detected: 1
>#0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC:  no, stat:
> compatible
> 
> Fatal error:
> You limited the set of compatible GPUs to a set that included ID #1, but
> that
> ID is not for a compatible GPU. List only compatible GPUs.
> 
> 3. If I login to that node and run the mdrun command written into the
> output in the previous step then it selects the right gpu and runs as
> expected.
> 
> $CUDA_DEVICE_ORDER is set to PCI_BUS_ID
> 
> I can not decide if this is a slurm config error or something with
> gromacs, as $CUDA_VISIBLE_DEVICES is set correctly by slurm and I expect
> gromacs to detect all 4GPUs.
> 
> Thanks for your help and suggestions,
> Tamas
> 
> --
> 
> Tamas Hegedus, PhD
> Senior Research Fellow
> Department of Biophysics and Radiation Biology
> Semmelweis University | phone: (36) 1-459 1500/60233
> Tuzolto utca 37-47| mailto:ta...@hegelab.org
> Budapest, 1094, Hungary   | http://www.hegelab.org
> 
> 



[slurm-users] slurm, gres:gpu, only 1 GPU out of 4 is detected

2019-11-13 Thread Tamas Hegedus

Hi,

I run gmx 2019 using GPU
There are 4 GPUs in my GPU hosts.
I have slurm and configured gres=gpu

1. If I submit a job with --gres=gpu:1 then GPU#0 is identified and used 
(-gpu_id $CUDA_VISIBLE_DEVICES).
2. If I submit a second job, it fails: the $CUDA_VISIBLE_DEVICES is 1 
and selected, but GPU #0 is identified by gmx as a compatible gpu.

From the output:

gmx mdrun -v -pin on -deffnm equi_nvt -nt 8 -gpu_id 1 -nb gpu -pme gpu 
-npme 1 -ntmpi 4


  GPU info:
    Number of GPUs detected: 1
    #0: NVIDIA GeForce GTX 1080 Ti, compute cap.: 6.1, ECC:  no, stat: 
compatible


Fatal error:
You limited the set of compatible GPUs to a set that included ID #1, but 
that

ID is not for a compatible GPU. List only compatible GPUs.

3. If I login to that node and run the mdrun command written into the 
output in the previous step then it selects the right gpu and runs as 
expected.


$CUDA_DEVICE_ORDER is set to PCI_BUS_ID

I can not decide if this is a slurm config error or something with 
gromacs, as $CUDA_VISIBLE_DEVICES is set correctly by slurm and I expect 
gromacs to detect all 4GPUs.


Thanks for your help and suggestions,
Tamas

--

Tamas Hegedus, PhD
Senior Research Fellow
Department of Biophysics and Radiation Biology
Semmelweis University | phone: (36) 1-459 1500/60233
Tuzolto utca 37-47| mailto:ta...@hegelab.org
Budapest, 1094, Hungary   | http://www.hegelab.org