BS”D In case anybody is interested we have tested Gromacs on a Threadripper machine with two GPU’s.
Hardware: Ryzen Threadripper 1950X 16 core CPU (multithreading on), with Corsair H100i V2 Liquid cooling Asus Prime X399-A M/B 2 X Geforce GTX 1080 GPU’s 32 GB of 3200MHz memory Samsung 850 Pro 512GB SSD OS, software: Centos 7.4, with 4.14 Kernel from ElRepo gcc 4.8.5 and gcc 5.5.0 fftw 3.3.7 (AVX2 enabled) Cuda 8 Gromacs 2016.4 Gromacs 2018-rc1 and final 2018. Using thread-MPI I managed to compile gcc 5.5.0, but when I went to use it to compile Gromacs, the compiler could not recognise the hardware, although the native gcc 4.8.5 had no problem. In 2016.4, I was able to specify which SIMD set to use, so this was not an issue. In any case there was very little difference between gcc 5.5.0 and 4.8.5. So I used 4.8.5 for 2018. Any ideas how to overcome this problem with 5.5.0? ———————————— Gromacs 2016.4 ———————————— System: Protein/DNA complex, with 438,397 atoms (including waters/ions), 100 ps npt equilibration. Allowing Gromacs to choose how it wanted to allocate the hardware gave 8 tMPI ranks, 4 thread per rank, both GPU’s 12.4 ns/day When I told it to use 4 tMPI ranks, 8 threads per rank, both GPU’s 12.2 ns/day Running on “real” cores only 4 tMPI ranks, 4 threads per rank, 2 GPU’s 10.2 ns/day 1 tMPI rank, 16 threads per rank, *one* GPU (“half” the machine; pin on, but pinstride and pinoffset automatic) 10.6 ns/day 1 tMP rank, 16 threads per rank, one gpu, and manually set all pinning options: gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pin on -ntomp 16 -ntmpi 1 -gpu_id 0 -pinoffset 0 -pinstride 2 12.3 ns/day Presumably, the gain here is because “pintstride 2” caused the job to run on the “real” (1,2,3…15) cores, and not on virtual cores. The automatic pinstride above used cores [0,16], [1,17], [2,18]…[7,23], half of which are virtual and so gave only 10.6ns/day. ** So there very little gain from the second GPU, and very little gain from multithreading. ** Using AVX_256 and not AVX2_256 with above command gave a small speed up (although using AVX instead of AVX2 for FFTW made things worse). 12.5 ns/day To compare with an Intel Xeon Silver system: 2 x Xeon Silver 4116 (2.1GHz base clock, 12 cores each, no Hyperthreading), 64GB memory 2 x Geforce 1080’s (as used in the above tests) gcc 4.8.5 Gromacs 2016.4, with MPI, AVX_256 (compiled on an older GPU machine, and not by me). 2 MPI ranks, 12 threads each rank, 2 GPU’s 11.7 ns/day 4 MPI ranks, 6 threads each rank, 2 GPU’s 13.0 ns/day 6 MPI ranks, 4 threads each rank, 2 GPU’s 14.0 ns/day To compare with the AMD machine, same number of cores 1 MPI rank, 16 threads, 1 GPU 11.2 ns/day ————————————————— Gromacs 2018 rc1 (using gcc 4.8.5) ————————————————— Using AVX_256 In ‘classic’ mode, not using gpu for PME 8 tMPI ranks, 4 threads per rank, 2 GPU’s 12.7 ns/day (modest speed up from 12.4 ns/day with 2016.4) Now use a gpu for PME gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on used 1 tMPI rank, 32 OpenMP threads, 1 GPU 14.9 ns/day Forcing the program to use both GPU’s gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 4 -npme 1 -gputasks 0011 -nb gpu 18.5 ns/day Now with AVX2_128 19.0 ns/day Now force Dynamic Load Balancing gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 4 -npme 1 -gputasks 0011 -nb gpu -dlb yes 20.1 ns/day Now use more (8) tMPI ranks gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 8 -npme 1 -gputasks 00001111 -nb gpu -dlb yes 20.7 ns/day And finally, using 2018 (AVX2_128) with the above command line 20.9 ns/day Here are the final lines from the log file Dynamic load balancing report: DLB was permanently on during the run per user request. Average load imbalance: 7.7%. The balanceable part of the MD step is 51%, load imbalance is computed from this. Part of the total run time spent waiting due to load imbalance: 3.9%. Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X 0 % Average PME mesh/force load: 1.275 Part of the total run time spent waiting due to PP/PME imbalance: 9.4 % NOTE: 9.4 % performance was lost because the PME ranks had more work to do than the PP ranks. You might want to increase the number of PME ranks or increase the cut-off and the grid spacing. R E A L C Y C L E A N D T I M E A C C O U N T I N G On 7 MPI ranks doing PP, each using 4 OpenMP threads, and on 1 MPI rank doing PME, using 4 OpenMP threads Computing: Num Num Call Wall time Giga-Cycles Ranks Threads Count (s) total sum % ----------------------------------------------------------------------------- Domain decomp. 7 4 500 13.721 1306.196 2.9 DD comm. load 7 4 500 0.366 34.875 0.1 DD comm. bounds 7 4 500 0.036 3.445 0.0 Send X to PME 7 4 50001 7.047 670.854 1.5 Neighbor search 7 4 501 6.060 576.925 1.3 Launch GPU ops. 7 4 100002 11.335 1079.049 2.4 Comm. coord. 7 4 49500 38.156 3632.409 8.1 Force 7 4 50001 38.357 3651.633 8.1 Wait + Comm. F 7 4 50001 42.186 4016.143 8.9 PME mesh * 1 4 50001 205.801 2798.887 6.2 PME wait for PP * 207.924 2827.762 6.3 Wait + Recv. PME F 7 4 50001 70.682 6728.928 14.9 Wait PME GPU gather 7 4 50001 28.106 2675.682 5.9 Wait GPU NB nonloc. 7 4 50001 20.463 1948.121 4.3 Wait GPU NB local 7 4 50001 12.992 1236.845 2.7 NB X/F buffer ops. 7 4 199002 24.396 2322.498 5.2 Write traj. 7 4 501 9.081 864.479 1.9 Update 7 4 50001 24.809 2361.775 5.2 Constraints 7 4 50001 79.806 7597.527 16.9 Comm. energies 7 4 2501 11.961 1138.713 2.5 ----------------------------------------------------------------------------- Total 413.769 45018.045 100.0 ----------------------------------------------------------------------------- (*) Note that with separate PME ranks, the walltime column actually sums to twice the total reported, but the cycle count total and % are correct. ----------------------------------------------------------------------------- Core t (s) Wall t (s) (%) Time: 13240.604 413.769 3200.0 (ns/day) (hour/ns) Performance: 20.882 1.149 -------------------------------------------------------------------- Harry M. Greenblatt Associate Staff Scientist Dept of Structural Biology harry.greenbl...@weizmann.ac.il<../../owa/redir.aspx?C=QQgUExlE8Ueu2zs5OGxuL5gubHf97c8IyXxHOfOIqyzCgIQtXppXx1YBYaN5yrHbaDn2xAb8moU.&URL=mailto%3aharry.greenblatt%40weizmann.ac.il> Weizmann Institute of Science Phone: 972-8-934-6340 234 Herzl St. Facsimile: 972-8-934-3361 Rehovot, 7610001 Israel -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.