Hi, You're using thread-MPI, but you should compile with MPI. And then start as many processes as total GPUs.
Cheers, Berk > From: chris.ne...@mail.utoronto.ca > To: gmx-users@gromacs.org > Date: Wed, 24 Apr 2013 17:08:28 +0000 > Subject: [gmx-users] How to use multiple nodes, each with 2 CPUs and 3 GPUs > > Dear Users: > > I am having trouble getting any speedup by using more than one node, > where each node has 2 8-core cpus and 3 GPUs. I am using gromacs 4.6.1. > > I saw this post, indicating that the .log file output about number of gpus > used might not be accurate: > http://lists.gromacs.org/pipermail/gmx-users/2013-March/079802.html > > Still, I'm getting 21.2 ns/day on 1 node, 21.2 ns/day on 2 nodes, and 20.5 > ns/day on 3 nodes. > Somehow I think I have not configures the mpirun -np and mdrun -ntomp > correctly > (although I have tried numerous combinations). > > On 1 node, I can just run mdrun without mpirun like this: > http://lists.gromacs.org/pipermail/gmx-users/2013-March/079802.html > > For that run, the top of the .log file is: > Log file opened on Wed Apr 24 11:36:53 2013 > Host: kfs179 pid: 59561 nodeid: 0 nnodes: 1 > Gromacs version: VERSION 4.6.1 > Precision: single > Memory model: 64 bit > MPI library: thread_mpi > OpenMP support: enabled > GPU support: enabled > invsqrt routine: gmx_software_invsqrt(x) > CPU acceleration: AVX_256 > FFT library: fftw-3.3.3-sse2 > Large file support: enabled > RDTSCP usage: enabled > Built on: Tue Apr 23 12:59:48 EDT 2013 > Built by: cne...@kfslogin2.nics.utk.edu [CMAKE] > Build OS/arch: Linux 2.6.32-220.4.1.el6.x86_64 x86_64 > Build CPU vendor: GenuineIntel > Build CPU brand: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz > Build CPU family: 6 Model: 45 Stepping: 7 > Build CPU features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr > nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 > sse4.2 ssse3 tdt x2apic > C compiler: /opt/intel/composer_xe_2011_sp1.11.339/bin/intel64/icc > Intel icc (ICC) 12.1.5 20120612 > C compiler flags: -mavx -std=gnu99 -Wall -ip -funroll-all-loops -O3 > -DNDEBUG > C++ compiler: /opt/intel/composer_xe_2011_sp1.11.339/bin/intel64/icpc > Intel icpc (ICC) 12.1.5 20120612 > C++ compiler flags: -mavx -Wall -ip -funroll-all-loops -O3 -DNDEBUG > CUDA compiler: nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) > 2005-2012 NVIDIA Corporation;Built on Thu_Apr__5_00:24:31_PDT_2012;Cuda > compilation tools, release 4.2, V0.2.1221 > CUDA driver: 5.0 > CUDA runtime: 4.20 > ... > <snip> > ... > Initializing Domain Decomposition on 3 nodes > Dynamic load balancing: yes > Will sort the charge groups at every domain (re)decomposition > Initial maximum inter charge-group distances: > two-body bonded interactions: 0.431 nm, LJ-14, atoms 101 108 > multi-body bonded interactions: 0.431 nm, Proper Dih., atoms 101 108 > Minimum cell size due to bonded interactions: 0.475 nm > Maximum distance for 7 constraints, at 120 deg. angles, all-trans: 1.175 nm > Estimated maximum distance required for P-LINCS: 1.175 nm > This distance will limit the DD cell size, you can override this with -rcon > Using 0 separate PME nodes, per user request > Scaling the initial minimum size with 1/0.8 (option -dds) = 1.25 > Optimizing the DD grid for 3 cells with a minimum initial size of 1.469 nm > The maximum allowed number of cells is: X 5 Y 5 Z 6 > Domain decomposition grid 3 x 1 x 1, separate PME nodes 0 > PME domain decomposition: 3 x 1 x 1 > Domain decomposition nodeid 0, coordinates 0 0 0 > > Using 3 MPI threads > Using 5 OpenMP threads per tMPI thread > > Detecting CPU-specific acceleration. > Present hardware specification: > Vendor: GenuineIntel > Brand: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz > Family: 6 Model: 45 Stepping: 7 > Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc > pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 > tdt x2apic > Acceleration most likely to fit this hardware: AVX_256 > Acceleration selected at GROMACS compile time: AVX_256 > > > 3 GPUs detected: > #0: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible > #1: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible > #2: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible > > 3 GPUs auto-selected for this run: #0, #1, #2 > > Will do PME sum in reciprocal space. > ... > <snip> > ... > > R E A L C Y C L E A N D T I M E A C C O U N T I N G > > Computing: Nodes Th. Count Wall t (s) G-Cycles % > ----------------------------------------------------------------------------- > Domain decomp. 3 5 4380 23.714 922.574 6.7 > DD comm. load 3 5 4379 0.054 2.114 0.0 > DD comm. bounds 3 5 4381 0.056 2.193 0.0 > Neighbor search 3 5 4380 11.325 440.581 3.2 > Launch GPU ops. 3 5 87582 3.970 154.455 1.1 > Comm. coord. 3 5 39411 2.522 98.132 0.7 > Force 3 5 43791 55.351 2153.409 15.5 > Wait + Comm. F 3 5 43791 2.800 108.930 0.8 > PME mesh 3 5 43791 97.377 3788.427 27.3 > Wait GPU nonlocal 3 5 43791 0.027 1.046 0.0 > Wait GPU local 3 5 43791 0.009 0.347 0.0 > NB X/F buffer ops. 3 5 166404 3.426 133.276 1.0 > Write traj. 3 5 2 0.028 1.087 0.0 > Update 3 5 43791 73.140 2845.491 20.5 > Constraints 3 5 87582 65.339 2541.981 18.3 > Comm. energies 3 5 4380 0.102 3.955 0.0 > Rest 3 17.332 674.286 4.9 > ----------------------------------------------------------------------------- > Total 3 356.572 13872.284 100.0 > ----------------------------------------------------------------------------- > ----------------------------------------------------------------------------- > PME redist. X/F 3 5 87582 10.668 415.017 3.0 > PME spread/gather 3 5 87582 44.767 1741.641 12.6 > PME 3D-FFT 3 5 87582 26.979 1049.617 7.6 > PME 3D-FFT Comm. 3 5 87582 11.085 431.273 3.1 > PME solve 3 5 43791 3.705 144.139 1.0 > ----------------------------------------------------------------------------- > > Core t (s) Wall t (s) (%) > Time: 5341.770 356.572 1498.1 > (ns/day) (hour/ns) > Performance: 21.222 1.131 > Finished mdrun on node 0 Wed Apr 24 11:42:50 2013 > > > > ########################################################################################### > > For my MPI run, I ran on a single node like this: > > mpirun -np 1 /nics/b/home/cneale/exe/gromacs-4.6.1_cuda/exec2/bin/mdrun_mpi > -notunepme -deffnm md3 -dlb yes -npme -1 -cpt 60 -maxh 0.1 -cpi md3.cpt > > And the top of the log is the same, except: > MPI library: MPI > ... > <snip> > ... > Using 1 MPI process > Using 16 OpenMP threads > ... > > To run on 2 nodes, I got errors if I did not specify mpirun -np: > > Using 24 MPI processes > Using 1 OpenMP thread per MPI process > > Detecting CPU-specific acceleration. > Present hardware specification: > Vendor: GenuineIntel > Brand: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz > Family: 6 Model: 45 Stepping: 7 > Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc > pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 > tdt x2apic > Acceleration most likely to fit this hardware: AVX_256 > Acceleration selected at GROMACS compile time: AVX_256 > > > 3 GPUs detected on host kfs179: > #0: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible > #1: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible > #2: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible > > > ------------------------------------------------------- > Program mdrun_mpi, VERSION 4.6.1 > Source code file: > /nics/b/home/cneale/exe/gromacs-4.6.1_cuda/source/src/gmxlib/gmx_detect_hardware.c, > line: 356 > > Fatal error: > Incorrect launch configuration: mismatching number of PP MPI processes and > GPUs per node. > mdrun_mpi was started with 12 PP MPI processes per node, but only 3 GPUs were > detected. > For more information and tips for troubleshooting, please check the GROMACS > website at http://www.gromacs.org/Documentation/Errors > ------------------------------------------------------- > > Thanx for Using GROMACS - Have a Nice Day > > > ####################### > > So I tried lots of different mpirun -np options, but only -np 2 and -np 3 > worked; i.e., it worked when gromacs did: > > Using 2 MPI processes > Using 8 OpenMP threads per MPI process > > or > > Using 3 MPI processes > Using 5 OpenMP threads per MPI process > > but -np 4, 6, and 32 all failed. > > For example, when I use mpirun -np 32, I get > > Using 32 MPI processes > Using 1 OpenMP thread per MPI process > > WARNING: On node 0: oversubscribing the available 16 logical CPU cores per > node with 20 MPI processes. > This will cause considerable performance loss! > > Detecting CPU-specific acceleration. > Present hardware specification: > Vendor: GenuineIntel > Brand: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz > Family: 6 Model: 45 Stepping: 7 > Features: aes apic avx clfsh cmov cx8 cx16 htt lahf_lm mmx msr nonstop_tsc > pcid pclmuldq pdcm pdpe1gb popcnt pse rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 > tdt x2apic > Acceleration most likely to fit this hardware: AVX_256 > Acceleration selected at GROMACS compile time: AVX_256 > > > 3 GPUs detected on host kfs179: > #0: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible > #1: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible > #2: NVIDIA Tesla M2090, compute cap.: 2.0, ECC: yes, stat: compatible > > > ------------------------------------------------------- > Program mdrun_mpi, VERSION 4.6.1 > Source code file: > /nics/b/home/cneale/exe/gromacs-4.6.1_cuda/source/src/gmxlib/gmx_detect_hardware.c, > line: 356 > > Fatal error: > Incorrect launch configuration: mismatching number of PP MPI processes and > GPUs per node. > mdrun_mpi was started with 20 PP MPI processes per node, but only 3 GPUs were > detected. > For more information and tips for troubleshooting, please check the GROMACS > website at http://www.gromacs.org/Documentation/Errors > ------------------------------------------------------- > > > ########### > > All of this makes me think that only 1 node is being picked up. I suppose > that it is possibly my fault with > submission, etc. since this is a new cluster to me, but my PBS script asks > for 2 nodes and showq reports that > 2 nodes were allocated when it is running. > > #PBS -l walltime=00:10:00,nodes=2:ppn=12:gpus=3:shared > > $ showq |grep cneale > 288686 cneale Running 32 00:09:53 Wed Apr 24 12:42:10 > > > Thank you, > Chris. > -- > gmx-users mailing list gmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > * Please don't post (un)subscribe requests to the list. Use the > www interface or send it to gmx-users-requ...@gromacs.org. > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists -- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists