Re: [gmx-users] How to use multiple nodes, each with 2 CPUs and 3 GPUs

2013-04-25 Thread Szilárd Páll
Hi,

You should really check out the documentation on how to use mdrun 4.6:
http://www.gromacs.org/Documentation/Acceleration_and_parallelization#Running_simulations

Brief summary: when running on GPUs every domain is assigned to a set
of CPU cores and a GPU, hence you need to start as many PP MPI ranks
per node as the number of GPUs (or pass a PP-GPU mapping manually).


Now, there are some slight complications with the inconvenient
hardware setup of the machines you are using. When the number of cores
is not divisible by the number of GPUs, you'll end up wasting cores.
In your case only 3*5=15 cores per compute node will be used. What
will make things even worse, unless you use "-pin on" (which is the
default behavior *only* if you use all cores in a node), is that mdrun
will not lock threads to cores and will let them be moved around by
the OS which can cause severe performance degradation .

However, you can actually work around these issues and get good
performance by using separate PME ranks. You can just try using 3 PP +
1 PME per compute node with four OpenMP threads each, i.e:
mpirun -np 4*Nnodes mpirun_mpi -npme 1 -ntomp 4
If you are lucky with the PP/PME load, this should work well and even
if you get some PP-PME imbalance, this should hurt performance way
less than the inconvenient 3x5 threads setup.

Cheers,
--
Szilárd


On Wed, Apr 24, 2013 at 7:08 PM, Christopher Neale
 wrote:
> Dear Users:
>
> I am having trouble getting any speedup by using more than one node,
> where each node has 2 8-core cpus and 3 GPUs. I am using gromacs 4.6.1.
>
> I saw this post, indicating that the .log file output about number of gpus 
> used might not be accurate:
> http://lists.gromacs.org/pipermail/gmx-users/2013-March/079802.html
>
> Still, I'm getting 21.2 ns/day on 1 node, 21.2 ns/day on 2 nodes, and 20.5 
> ns/day on 3 nodes.
> Somehow I think I have not configures the mpirun -np and mdrun -ntomp 
> correctly
> (although I have tried numerous combinations).
>
> On 1 node, I can just run mdrun without mpirun like this:
> http://lists.gromacs.org/pipermail/gmx-users/2013-March/079802.html
>
> For that run, the top of the .log file is:
> Log file opened on Wed Apr 24 11:36:53 2013
> Host: kfs179  pid: 59561  nodeid: 0  nnodes:  1
> Gromacs version:VERSION 4.6.1
> Precision:  single
> Memory model:   64 bit
> MPI library:thread_mpi
> OpenMP support: enabled
> GPU support:enabled
> invsqrt routine:gmx_software_invsqrt(x)
> CPU acceleration:   AVX_256
> FFT library:fftw-3.3.3-sse2
> Large file support: enabled
> RDTSCP usage:   enabled
> Built on:   Tue Apr 23 12:59:48 EDT 2013
> Built by:   cne...@kfslogin2.nics.utk.edu [CMAKE]
> Build OS/arch:  Linux 2.6.32-220.4.1.el6.x86_64 x86_64
> Build CPU vendor:   GenuineIntel
> Build CPU brand:Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
> Build CPU family:   6   Model: 45   Stepping: 7
> Build CPU features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr 
> nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 
> sse4.2 ssse3 tdt x2apic
> C compiler: /opt/intel/composer_xe_2011_sp1.11.339/bin/intel64/icc 
> Intel icc (ICC) 12.1.5 20120612
> C compiler flags:   -mavx   -std=gnu99 -Wall   -ip -funroll-all-loops  -O3 
> -DNDEBUG
> C++ compiler:   /opt/intel/composer_xe_2011_sp1.11.339/bin/intel64/icpc 
> Intel icpc (ICC) 12.1.5 20120612
> C++ compiler flags: -mavx   -Wall   -ip -funroll-all-loops  -O3 -DNDEBUG
> CUDA compiler:  nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 
> 2005-2012 NVIDIA Corporation;Built on Thu_Apr__5_00:24:31_PDT_2012;Cuda 
> compilation tools, release 4.2, V0.2.1221
> CUDA driver:5.0
> CUDA runtime:   4.20
> ...
> 
> ...
> Initializing Domain Decomposition on 3 nodes
> Dynamic load balancing: yes
> Will sort the charge groups at every domain (re)decomposition
> Initial maximum inter charge-group distances:
> two-body bonded interactions: 0.431 nm, LJ-14, atoms 101 108
>   multi-body bonded interactions: 0.431 nm, Proper Dih., atoms 101 108
> Minimum cell size due to bonded interactions: 0.475 nm
> Maximum distance for 7 constraints, at 120 deg. angles, all-trans: 1.175 nm
> Estimated maximum distance required for P-LINCS: 1.175 nm
> This distance will limit the DD cell size, you can override this with -rcon
> Using 0 separate PME nodes, per user request
> Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
> Optimizing the DD grid for 3 cells with a minimum initial size of 1.469 nm
> The maximum allowed number of cells is: X 5 Y 5 Z 6
> Domain decomposition grid 3 x 1 x 1, separate PME nodes 0
> PME domain decomposition: 3 x 1 x 1
> Domain decomposition nodeid 0, coordinates 0 0 0
>
> Using 3 MPI threads
> Using 5 OpenMP threads per tMPI thread
>
> Detecting CPU-specific acceleration.
> Present hardware specification:
> Vendor: GenuineIntel
> Brand:  Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
> Family: 

RE: [gmx-users] How to use multiple nodes, each with 2 CPUs and 3 GPUs

2013-04-25 Thread Berk Hess
Hi,

You're using thread-MPI, but you should compile with MPI. And then start as 
many processes as total GPUs.

Cheers,

Berk

> From: chris.ne...@mail.utoronto.ca
> To: gmx-users@gromacs.org
> Date: Wed, 24 Apr 2013 17:08:28 +
> Subject: [gmx-users] How to use multiple nodes, each with 2 CPUs and 3 GPUs
> 
> Dear Users:
> 
> I am having trouble getting any speedup by using more than one node, 
> where each node has 2 8-core cpus and 3 GPUs. I am using gromacs 4.6.1.
> 
> I saw this post, indicating that the .log file output about number of gpus 
> used might not be accurate:
> http://lists.gromacs.org/pipermail/gmx-users/2013-March/079802.html
> 
> Still, I'm getting 21.2 ns/day on 1 node, 21.2 ns/day on 2 nodes, and 20.5 
> ns/day on 3 nodes. 
> Somehow I think I have not configures the mpirun -np and mdrun -ntomp 
> correctly 
> (although I have tried numerous combinations).
> 
> On 1 node, I can just run mdrun without mpirun like this:
> http://lists.gromacs.org/pipermail/gmx-users/2013-March/079802.html
> 
> For that run, the top of the .log file is:
> Log file opened on Wed Apr 24 11:36:53 2013
> Host: kfs179  pid: 59561  nodeid: 0  nnodes:  1
> Gromacs version:VERSION 4.6.1
> Precision:  single
> Memory model:   64 bit
> MPI library:thread_mpi
> OpenMP support: enabled
> GPU support:enabled
> invsqrt routine:gmx_software_invsqrt(x)
> CPU acceleration:   AVX_256
> FFT library:fftw-3.3.3-sse2
> Large file support: enabled
> RDTSCP usage:   enabled
> Built on:   Tue Apr 23 12:59:48 EDT 2013
> Built by:   cne...@kfslogin2.nics.utk.edu [CMAKE]
> Build OS/arch:  Linux 2.6.32-220.4.1.el6.x86_64 x86_64
> Build CPU vendor:   GenuineIntel
> Build CPU brand:Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
> Build CPU family:   6   Model: 45   Stepping: 7
> Build CPU features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr 
> nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 
> sse4.2 ssse3 tdt x2apic
> C compiler: /opt/intel/composer_xe_2011_sp1.11.339/bin/intel64/icc 
> Intel icc (ICC) 12.1.5 20120612
> C compiler flags:   -mavx   -std=gnu99 -Wall   -ip -funroll-all-loops  -O3 
> -DNDEBUG
> C++ compiler:   /opt/intel/composer_xe_2011_sp1.11.339/bin/intel64/icpc 
> Intel icpc (ICC) 12.1.5 20120612
> C++ compiler flags: -mavx   -Wall   -ip -funroll-all-loops  -O3 -DNDEBUG
> CUDA compiler:  nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 
> 2005-2012 NVIDIA Corporation;Built on Thu_Apr__5_00:24:31_PDT_2012;Cuda 
> compilation tools, release 4.2, V0.2.1221
> CUDA driver:5.0
> CUDA runtime:   4.20
> ...
> 
> ...
> Initializing Domain Decomposition on 3 nodes
> Dynamic load balancing: yes
> Will sort the charge groups at every domain (re)decomposition
> Initial maximum inter charge-group distances:
> two-body bonded interactions: 0.431 nm, LJ-14, atoms 101 108
>   multi-body bonded interactions: 0.431 nm, Proper Dih., atoms 101 108
> Minimum cell size due to bonded interactions: 0.475 nm
> Maximum distance for 7 constraints, at 120 deg. angles, all-trans: 1.175 nm
> Estimated maximum distance required for P-LINCS: 1.175 nm
> This distance will limit the DD cell size, you can override this with -rcon
> Using 0 separate PME nodes, per user request
> Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25
> Optimizing the DD grid for 3 cells with a minimum initial size of 1.469 nm
> The maximum allowed number of cells is: X 5 Y 5 Z 6
> Domain decomposition grid 3 x 1 x 1, separate PME nodes 0
> PME domain decomposition: 3 x 1 x 1
> Domain decomposition nodeid 0, coordinates 0 0 0
> 
> Using 3 MPI threads
> Using 5 OpenMP threads per tMPI thread
> 
> Detecting CPU-specific acceleration.
> Present hardware specification:
> Vendor: GenuineIntel
> Brand:  Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
> Family:  6  Model: 45  Stepping:  7
> Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc 
> pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 
> tdt x2apic
> Acceleration most likely to fit this hardware: AVX_256
> Acceleration selected at GROMACS compile time: AVX_256
> 
> 
> 3 GPUs detected:
>   #0: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
>   #1: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
>   #2: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible
> 
> 3 GPUs auto-selected for this run: #0, #1, #2
> 
> Will do PME sum in reciprocal space.
> ...
> 
> ...
> 
>  R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
> 
>  Computing: Nodes   Th. Count  Wall t (s) G-Cycles   %
> -
>  Domain decomp. 35   4380  23.714  922.574 6.7
>  DD comm. load  35   4379   0.0542.114 0.0
>  DD comm. bounds35   438