Re: [gmx-users] How to use multiple nodes, each with 2 CPUs and 3 GPUs
Hi, You should really check out the documentation on how to use mdrun 4.6: http://www.gromacs.org/Documentation/Acceleration_and_parallelization#Running_simulations Brief summary: when running on GPUs every domain is assigned to a set of CPU cores and a GPU, hence you need to start as many PP MPI ranks per node as the number of GPUs (or pass a PP-GPU mapping manually). Now, there are some slight complications with the inconvenient hardware setup of the machines you are using. When the number of cores is not divisible by the number of GPUs, you'll end up wasting cores. In your case only 3*5=15 cores per compute node will be used. What will make things even worse, unless you use "-pin on" (which is the default behavior *only* if you use all cores in a node), is that mdrun will not lock threads to cores and will let them be moved around by the OS which can cause severe performance degradation . However, you can actually work around these issues and get good performance by using separate PME ranks. You can just try using 3 PP + 1 PME per compute node with four OpenMP threads each, i.e: mpirun -np 4*Nnodes mpirun_mpi -npme 1 -ntomp 4 If you are lucky with the PP/PME load, this should work well and even if you get some PP-PME imbalance, this should hurt performance way less than the inconvenient 3x5 threads setup. Cheers, -- Szilárd On Wed, Apr 24, 2013 at 7:08 PM, Christopher Neale wrote: > Dear Users: > > I am having trouble getting any speedup by using more than one node, > where each node has 2 8-core cpus and 3 GPUs. I am using gromacs 4.6.1. > > I saw this post, indicating that the .log file output about number of gpus > used might not be accurate: > http://lists.gromacs.org/pipermail/gmx-users/2013-March/079802.html > > Still, I'm getting 21.2 ns/day on 1 node, 21.2 ns/day on 2 nodes, and 20.5 > ns/day on 3 nodes. > Somehow I think I have not configures the mpirun -np and mdrun -ntomp > correctly > (although I have tried numerous combinations). > > On 1 node, I can just run mdrun without mpirun like this: > http://lists.gromacs.org/pipermail/gmx-users/2013-March/079802.html > > For that run, the top of the .log file is: > Log file opened on Wed Apr 24 11:36:53 2013 > Host: kfs179 pid: 59561 nodeid: 0 nnodes: 1 > Gromacs version:VERSION 4.6.1 > Precision: single > Memory model: 64 bit > MPI library:thread_mpi > OpenMP support: enabled > GPU support:enabled > invsqrt routine:gmx_software_invsqrt(x) > CPU acceleration: AVX_256 > FFT library:fftw-3.3.3-sse2 > Large file support: enabled > RDTSCP usage: enabled > Built on: Tue Apr 23 12:59:48 EDT 2013 > Built by: cne...@kfslogin2.nics.utk.edu [CMAKE] > Build OS/arch: Linux 2.6.32-220.4.1.el6.x86_64 x86_64 > Build CPU vendor: GenuineIntel > Build CPU brand:Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz > Build CPU family: 6 Model: 45 Stepping: 7 > Build CPU features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr > nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 > sse4.2 ssse3 tdt x2apic > C compiler: /opt/intel/composer_xe_2011_sp1.11.339/bin/intel64/icc > Intel icc (ICC) 12.1.5 20120612 > C compiler flags: -mavx -std=gnu99 -Wall -ip -funroll-all-loops -O3 > -DNDEBUG > C++ compiler: /opt/intel/composer_xe_2011_sp1.11.339/bin/intel64/icpc > Intel icpc (ICC) 12.1.5 20120612 > C++ compiler flags: -mavx -Wall -ip -funroll-all-loops -O3 -DNDEBUG > CUDA compiler: nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) > 2005-2012 NVIDIA Corporation;Built on Thu_Apr__5_00:24:31_PDT_2012;Cuda > compilation tools, release 4.2, V0.2.1221 > CUDA driver:5.0 > CUDA runtime: 4.20 > ... > > ... > Initializing Domain Decomposition on 3 nodes > Dynamic load balancing: yes > Will sort the charge groups at every domain (re)decomposition > Initial maximum inter charge-group distances: > two-body bonded interactions: 0.431 nm, LJ-14, atoms 101 108 > multi-body bonded interactions: 0.431 nm, Proper Dih., atoms 101 108 > Minimum cell size due to bonded interactions: 0.475 nm > Maximum distance for 7 constraints, at 120 deg. angles, all-trans: 1.175 nm > Estimated maximum distance required for P-LINCS: 1.175 nm > This distance will limit the DD cell size, you can override this with -rcon > Using 0 separate PME nodes, per user request > Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25 > Optimizing the DD grid for 3 cells with a minimum initial size of 1.469 nm > The maximum allowed number of cells is: X 5 Y 5 Z 6 > Domain decomposition grid 3 x 1 x 1, separate PME nodes 0 > PME domain decomposition: 3 x 1 x 1 > Domain decomposition nodeid 0, coordinates 0 0 0 > > Using 3 MPI threads > Using 5 OpenMP threads per tMPI thread > > Detecting CPU-specific acceleration. > Present hardware specification: > Vendor: GenuineIntel > Brand: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz > Family:
RE: [gmx-users] How to use multiple nodes, each with 2 CPUs and 3 GPUs
Hi, You're using thread-MPI, but you should compile with MPI. And then start as many processes as total GPUs. Cheers, Berk > From: chris.ne...@mail.utoronto.ca > To: gmx-users@gromacs.org > Date: Wed, 24 Apr 2013 17:08:28 + > Subject: [gmx-users] How to use multiple nodes, each with 2 CPUs and 3 GPUs > > Dear Users: > > I am having trouble getting any speedup by using more than one node, > where each node has 2 8-core cpus and 3 GPUs. I am using gromacs 4.6.1. > > I saw this post, indicating that the .log file output about number of gpus > used might not be accurate: > http://lists.gromacs.org/pipermail/gmx-users/2013-March/079802.html > > Still, I'm getting 21.2 ns/day on 1 node, 21.2 ns/day on 2 nodes, and 20.5 > ns/day on 3 nodes. > Somehow I think I have not configures the mpirun -np and mdrun -ntomp > correctly > (although I have tried numerous combinations). > > On 1 node, I can just run mdrun without mpirun like this: > http://lists.gromacs.org/pipermail/gmx-users/2013-March/079802.html > > For that run, the top of the .log file is: > Log file opened on Wed Apr 24 11:36:53 2013 > Host: kfs179 pid: 59561 nodeid: 0 nnodes: 1 > Gromacs version:VERSION 4.6.1 > Precision: single > Memory model: 64 bit > MPI library:thread_mpi > OpenMP support: enabled > GPU support:enabled > invsqrt routine:gmx_software_invsqrt(x) > CPU acceleration: AVX_256 > FFT library:fftw-3.3.3-sse2 > Large file support: enabled > RDTSCP usage: enabled > Built on: Tue Apr 23 12:59:48 EDT 2013 > Built by: cne...@kfslogin2.nics.utk.edu [CMAKE] > Build OS/arch: Linux 2.6.32-220.4.1.el6.x86_64 x86_64 > Build CPU vendor: GenuineIntel > Build CPU brand:Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz > Build CPU family: 6 Model: 45 Stepping: 7 > Build CPU features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr > nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 > sse4.2 ssse3 tdt x2apic > C compiler: /opt/intel/composer_xe_2011_sp1.11.339/bin/intel64/icc > Intel icc (ICC) 12.1.5 20120612 > C compiler flags: -mavx -std=gnu99 -Wall -ip -funroll-all-loops -O3 > -DNDEBUG > C++ compiler: /opt/intel/composer_xe_2011_sp1.11.339/bin/intel64/icpc > Intel icpc (ICC) 12.1.5 20120612 > C++ compiler flags: -mavx -Wall -ip -funroll-all-loops -O3 -DNDEBUG > CUDA compiler: nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) > 2005-2012 NVIDIA Corporation;Built on Thu_Apr__5_00:24:31_PDT_2012;Cuda > compilation tools, release 4.2, V0.2.1221 > CUDA driver:5.0 > CUDA runtime: 4.20 > ... > > ... > Initializing Domain Decomposition on 3 nodes > Dynamic load balancing: yes > Will sort the charge groups at every domain (re)decomposition > Initial maximum inter charge-group distances: > two-body bonded interactions: 0.431 nm, LJ-14, atoms 101 108 > multi-body bonded interactions: 0.431 nm, Proper Dih., atoms 101 108 > Minimum cell size due to bonded interactions: 0.475 nm > Maximum distance for 7 constraints, at 120 deg. angles, all-trans: 1.175 nm > Estimated maximum distance required for P-LINCS: 1.175 nm > This distance will limit the DD cell size, you can override this with -rcon > Using 0 separate PME nodes, per user request > Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25 > Optimizing the DD grid for 3 cells with a minimum initial size of 1.469 nm > The maximum allowed number of cells is: X 5 Y 5 Z 6 > Domain decomposition grid 3 x 1 x 1, separate PME nodes 0 > PME domain decomposition: 3 x 1 x 1 > Domain decomposition nodeid 0, coordinates 0 0 0 > > Using 3 MPI threads > Using 5 OpenMP threads per tMPI thread > > Detecting CPU-specific acceleration. > Present hardware specification: > Vendor: GenuineIntel > Brand: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz > Family: 6 Model: 45 Stepping: 7 > Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc > pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 > tdt x2apic > Acceleration most likely to fit this hardware: AVX_256 > Acceleration selected at GROMACS compile time: AVX_256 > > > 3 GPUs detected: > #0: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible > #1: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible > #2: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible > > 3 GPUs auto-selected for this run: #0, #1, #2 > > Will do PME sum in reciprocal space. > ... > > ... > > R E A L C Y C L E A N D T I M E A C C O U N T I N G > > Computing: Nodes Th. Count Wall t (s) G-Cycles % > - > Domain decomp. 35 4380 23.714 922.574 6.7 > DD comm. load 35 4379 0.0542.114 0.0 > DD comm. bounds35 438