Hi, I am working on a node with Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 16 physical core, 32 logical core and 1 GPU NVIDIA GeForce GTX 980 Ti. I am launching a series of 2 ns molecolar dynamics simulations of a system of 60000 atoms. I tried diverse setting combination, but however i obtained the best performance with the command:
"gmx mdrun -deffnm md_LIG -cpt 1 -cpo restart1.cpt -pin on" which use 32 OpenMP threads, 1 MPI thread, and the GPU. At the end of the file.log of molecular dynamic production I obtain this message: "NOTE: The GPU has >25% less load than the CPU. This imbalance causes performance loss." I don't know how can improve the load on CPU more than this, or how I can decrease the load on GPU. Do you have any suggestions? Thank you in advance. Cheers, Davide Bonanni Initial and final part of LOG file here: Log file opened on Sun Jul 9 04:02:44 2017 Host: bigblue pid: 16777 rank ID: 0 number of ranks: 1 :-) GROMACS - gmx mdrun, VERSION 5.1.4 (-: GROMACS: gmx mdrun, VERSION 5.1.4 Executable: /usr/bin/gmx Data prefix: /usr/local/gromacs Command line: gmx mdrun -deffnm md_fluo_7 -cpt 1 -cpo restart1.cpt -pin on GROMACS version: VERSION 5.1.4 Precision: single Memory model: 64 bit MPI library: thread_mpi OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 32) GPU support: enabled OpenCL support: disabled invsqrt routine: gmx_software_invsqrt(x) SIMD instructions: AVX2_256 FFT library: fftw-3.3.4-sse2-avx RDTSCP usage: enabled C++11 compilation: disabled TNG support: enabled Tracing support: disabled Built on: Tue 8 Nov 12:26:14 CET 2016 Built by: root@bigblue [CMAKE] Build OS/arch: Linux 3.10.0-327.el7.x86_64 x86_64 Build CPU vendor: GenuineIntel Build CPU brand: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz Build CPU family: 6 Model: 63 Stepping: 2 Build CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic C compiler: /bin/cc GNU 4.8.5 C compiler flags: -march=core-avx2 -Wextra -Wno-missing-field-initializers -Wno-sign-compare -Wpointer-arith -Wall -Wno-unused -Wunused-value -Wunused-parameter -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds C++ compiler: /bin/c++ GNU 4.8.5 C++ compiler flags: -march=core-avx2 -Wextra -Wno-missing-field-initializers -Wpointer-arith -Wall -Wno-unused-function -O3 -DNDEBUG -funroll-all-loops -fexcess-precision=fast -Wno-array-bounds Boost version: 1.55.0 (internal) CUDA compiler: /usr/local/cuda/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2016 NVIDIA Corporation;Built on Sun_Sep__4_22:14:01_CDT_2016;Cuda compilation tools, release 8.0, V8.0.44 CUDA compiler flags:-gencode;arch=compute_20,code=sm_20;-gencode;arch= compute_30,code=sm_30;-gencode;arch=compute_35,code= sm_35;-gencode;arch=compute_37,code=sm_37;-gencode;arch= compute_50,code=sm_50;-gencode;arch=compute_52,code= sm_52;-gencode;arch=compute_60,code=sm_60;-gencode;arch= compute_61,code=sm_61;-gencode;arch=compute_60,code= compute_60;-gencode;arch=compute_61,code=compute_61;-use_fast_math;; ;-march=core-avx2;-Wextra;-Wno-missing-field-initializers;-Wpointer-arith;- Wall;-Wno-unused-function;-O3;-DNDEBUG;-funroll-all-loops;- fexcess-precision=fast;-Wno-array-bounds; CUDA driver: 8.0 CUDA runtime: 8.0 Running on 1 node with total 16 cores, 32 logical cores, 1 compatible GPU Hardware detected: CPU info: Vendor: GenuineIntel Brand: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz Family: 6 model: 63 stepping: 2 CPU features: aes apic avx avx2 clfsh cmov cx8 cx16 f16c fma htt lahf_lm mmx msr nonstop_tsc pcid pclmuldq pdcm pdpe1gb popcnt pse rdrnd rdtscp sse2 sse3 sse4.1 sse4.2 ssse3 tdt x2apic SIMD instructions most likely to fit this hardware: AVX2_256 SIMD instructions selected at GROMACS compile time: AVX2_256 GPU info: Number of GPUs detected: 1 #0: NVIDIA GeForce GTX 980 Ti, compute cap.: 5.2, ECC: no, stat: compatible Changing nstlist from 20 to 40, rlist from 1.2 to 1.2 Input Parameters: integrator = sd tinit = 0 dt = 0.002 nsteps = 1000000 init-step = 0 simulation-part = 1 comm-mode = Linear nstcomm = 100 bd-fric = 0 ld-seed = 57540858 emtol = 10 emstep = 0.01 niter = 20 fcstep = 0 nstcgsteep = 1000 nbfgscorr = 10 rtpi = 0.05 nstxout = 5000 nstvout = 500 nstfout = 0 nstlog = 500 nstcalcenergy = 100 nstenergy = 1000 nstxout-compressed = 0 compressed-x-precision = 1000 cutoff-scheme = Verlet nstlist = 40 ns-type = Grid pbc = xyz periodic-molecules = FALSE verlet-buffer-tolerance = 0.005 rlist = 1.2 rlistlong = 1.2 nstcalclr = 20 coulombtype = PME coulomb-modifier = Potential-shift rcoulomb-switch = 0 rcoulomb = 1.2 epsilon-r = 1 epsilon-rf = inf vdw-type = Cut-off vdw-modifier = Potential-switch rvdw-switch = 1 rvdw = 1.2 DispCorr = EnerPres table-extension = 1 fourierspacing = 0.12 fourier-nx = 72 fourier-ny = 72 fourier-nz = 72 pme-order = 6 ewald-rtol = 1e-06 ewald-rtol-lj = 0.001 lj-pme-comb-rule = Geometric ewald-geometry = 0 epsilon-surface = 0 implicit-solvent = No gb-algorithm = Still nstgbradii = 1 rgbradii = 1 gb-epsilon-solvent = 80 gb-saltconc = 0 gb-obc-alpha = 1 gb-obc-beta = 0.8 gb-obc-gamma = 4.85 gb-dielectric-offset = 0.009 sa-algorithm = Ace-approximation sa-surface-tension = 2.05016 tcoupl = No nsttcouple = -1 nh-chain-length = 0 print-nose-hoover-chain-variables = FALSE pcoupl = Parrinello-Rahman pcoupltype = Isotropic nstpcouple = 20 tau-p = 1 Using 1 MPI thread Using 32 OpenMP threads 1 compatible GPU is present, with ID 0 1 GPU auto-selected for this run. Mapping of GPU ID to the 1 PP rank in this node: 0 Will do PME sum in reciprocal space for electrostatic interactions. ++++ PLEASE READ AND CITE THE FOLLOWING REFERENCE ++++ U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee and L. G. Pedersen A smooth particle mesh Ewald method J. Chem. Phys. 103 (1995) pp. 8577-8592 -------- -------- --- Thank You --- -------- -------- Will do ordinary reciprocal space Ewald sum. Using a Gaussian width (1/beta) of 0.34693 nm for Ewald Cut-off's: NS: 1.2 Coulomb: 1.2 LJ: 1.2 Long Range LJ corr.: <C6> 3.2003e-04 System total charge, top. A: -0.000 top. B: -0.000 Generated table with 1100 data points for Ewald. Tabscale = 500 points/nm Generated table with 1100 data points for LJ6Switch. Tabscale = 500 points/nm Generated table with 1100 data points for LJ12Switch. Tabscale = 500 points/nm Generated table with 1100 data points for 1-4 COUL. Tabscale = 500 points/nm Generated table with 1100 data points for 1-4 LJ6. Tabscale = 500 points/nm Generated table with 1100 data points for 1-4 LJ12. Tabscale = 500 points/nm Potential shift: LJ r^-12: 0.000e+00 r^-6: 0.000e+00, Ewald -1.000e-06 Initialized non-bonded Ewald correction tables, spacing: 9.71e-04 size: 1237 Using GPU 8x8 non-bonded kernels NOTE: With GPUs, reporting energy group contributions is not supported There are 39 atoms and 39 charges for free energy perturbation Pinning threads with an auto-selected logical core stride of 1 Initializing LINear Constraint Solver -------- -------- --- Thank You --- -------- -------- There are: 59559 Atoms Initial temperature: 301.342 K Started mdrun on rank 0 Sun Jul 9 04:02:47 2017 Step Time Lambda 0 0.00000 0.35000 ..... ..... ..... ..... ..... M E G A - F L O P S A C C O U N T I N G NB=Group-cutoff nonbonded kernels NxN=N-by-N cluster Verlet kernels RF=Reaction-Field VdW=Van der Waals QSTab=quadratic-spline table W3=SPC/TIP3p W4=TIP4p (single or pairs) V&F=Potential and force V=Potential only F=Force only Computing: M-Number M-Flops % Flops ------------------------------------------------------------ ----------------- NB Free energy kernel 7881861.469518 7881861.470 0.1 Pair Search distance check 211801.978992 1906217.811 0.0 NxN Ewald Elec. + LJ [F] 61644114.490880 5732902647.652 91.3 NxN Ewald Elec. + LJ [V&F] 622729.312576 79086622.697 1.3 1,4 nonbonded interactions 15157.138733 1364142.486 0.0 Calc Weights 178677.178677 6432378.432 0.1 Spread Q Bspline 25729513.729488 51459027.459 0.8 Gather F Bspline 25729513.729488 154377082.377 2.5 3D-FFT 27628393.815424 221027150.523 3.5 Solve PME 10366.046848 663426.998 0.0 Shift-X 1489.034559 8934.207 0.0 Angles 10513.850597 1766326.900 0.0 Propers 18191.018191 4165743.166 0.1 Impropers 1133.001133 235664.236 0.0 Virial 2980.259604 53644.673 0.0 Update 59559.059559 1846330.846 0.0 Stop-CM 595.649559 5956.496 0.0 Calc-Ekin 5956.019118 160812.516 0.0 Lincs 11610.011610 696600.697 0.0 Lincs-Mat 588728.588728 2354914.355 0.0 Constraint-V 130824.130824 1046593.047 0.0 Constraint-Vir 2980.409607 71529.831 0.0 Settle 35868.035868 11585375.585 0.2 ------------------------------------------------------------ ----------------- Total 6281098984.459 100.0 ------------------------------------------------------------ ----------------- R E A L C Y C L E A N D T I M E A C C O U N T I N G On 1 MPI rank, each using 32 OpenMP threads Computing: Num Num Call Wall time Giga-Cycles Ranks Threads Count (s) total sum % ------------------------------------------------------------ ----------------- Neighbor search 1 32 25001 170.606 13073.577 1.5 Launch GPU ops. 1 32 1000001 97.251 7452.377 0.8 Force 1 32 1000001 2462.595 188709.029 21.0 PME mesh 1 32 1000001 7214.132 552819.972 61.5 Wait GPU local 1 32 1000001 22.963 1759.683 0.2 NB X/F buffer ops. 1 32 1975001 303.888 23287.017 2.6 Write traj. 1 32 2190 41.970 3216.155 0.4 Update 1 32 2000002 374.895 28728.243 3.2 Constraints 1 32 2000002 718.184 55034.545 6.1 Rest 315.793 24199.295 2.7 ------------------------------------------------------------ ----------------- Total 11722.279 898279.893 100.0 ------------------------------------------------------------ ----------------- Breakdown of PME mesh computation ------------------------------------------------------------ ----------------- PME spread/gather 1 32 4000004 5659.890 433718.207 48.3 PME 3D-FFT 1 32 4000004 1447.568 110927.319 12.3 PME solve Elec 1 32 2000002 85.838 6577.816 0.7 ------------------------------------------------------------ ----------------- GPU timings ------------------------------------------------------------ ----------------- Computing: Count Wall t (s) ms/step % ------------------------------------------------------------ ----------------- Pair list H2D 25001 14.012 0.560 0.6 X / q H2D 1000001 171.474 0.171 7.7 Nonbonded F kernel 970000 1852.997 1.910 82.8 Nonbonded F+ene k. 5000 13.053 2.611 0.6 Nonbonded F+prune k. 20000 47.018 2.351 2.1 Nonbonded F+ene+prune k. 5001 15.825 3.164 0.7 F D2H 1000001 124.521 0.125 5.6 ------------------------------------------------------------ ----------------- Total 2238.898 2.239 100.0 ------------------------------------------------------------ ----------------- Force evaluation time GPU/CPU: 2.239 ms/9.677 ms = 0.231 For optimal performance this ratio should be close to 1! NOTE: The GPU has >25% less load than the CPU. This imbalance causes performance loss. Core t (s) Wall t (s) (%) Time: 374361.605 11722.279 3193.6 3h15:22 (ns/day) (hour/ns) Performance: 14.741 1.628 Finished mdrun on rank 0 Sun Jul 9 07:18:10 2017 -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.