Hi, Thanks for the report!
Did you build with or without hwloc? There is a known issue with the automatic pin stride when not using hwloc which will lead to a "compact" pinning (using half of the cores with 2 threads/core) when <=half of the threads are launched (instead of using all cores 1 thread/core which is the default on Intel). When it comes to running "wide" ranks (i.e. many OpenMP threads per rank) on Zen/Ryzen, things are not straightforward, so the default 16/32 threads on 16 cores + 1 GPU is not great. If already running domain-decomposition, 4-8 threads/rank is generally best, but unfortunately this will often not be better than just using no DD and taking the hit of threading inefficiency. A few more comments in-line. On Wed, Jan 24, 2018 at 10:14 AM, Harry Mark Greenblatt < harry.greenbl...@weizmann.ac.il> wrote: > BS”D > > In case anybody is interested we have tested Gromacs on a Threadripper > machine with two GPU’s. > > Hardware: > > Ryzen Threadripper 1950X 16 core CPU (multithreading on), with Corsair > H100i V2 Liquid cooling > Asus Prime X399-A M/B > 2 X Geforce GTX 1080 GPU’s > 32 GB of 3200MHz memory > Samsung 850 Pro 512GB SSD > > OS, software: > > Centos 7.4, with 4.14 Kernel from ElRepo > gcc 4.8.5 and gcc 5.5.0 > fftw 3.3.7 (AVX2 enabled) > Cuda 8 > Gromacs 2016.4 > Gromacs 2018-rc1 and final 2018. > Using thread-MPI > > > I managed to compile gcc 5.5.0, but when I went to use it to compile > Gromacs, the compiler could not recognise the hardware, although the native > gcc 4.8.5 had no problem. > In 2016.4, I was able to specify which SIMD set to use, so this was not an > issue. In any case there was very little difference between gcc 5.5.0 and > 4.8.5. So I used 4.8.5 for 2018. > Any ideas how to overcome this problem with 5.5.0? > > ———————————— > Gromacs 2016.4 > ———————————— > > System: Protein/DNA complex, with 438,397 atoms (including waters/ions), > 100 ps npt equilibration. > > Allowing Gromacs to choose how it wanted to allocate the hardware gave > > 8 tMPI ranks, 4 thread per rank, both GPU’s > > 12.4 ns/day > > When I told it to use 4 tMPI ranks, 8 threads per rank, both GPU’s > > 12.2 ns/day > > > Running on “real” cores only > > 4 tMPI ranks, 4 threads per rank, 2 GPU’s > > 10.2 ns/day > > 1 tMPI rank, 16 threads per rank, *one* GPU (“half” the machine; pin on, > but pinstride and pinoffset automatic) > > 10.6 ns/day > > 1 tMP rank, 16 threads per rank, one gpu, and manually set all pinning > options: > > gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pin on -ntomp 16 -ntmpi 1 > -gpu_id 0 -pinoffset 0 -pinstride 2 > > 12.3 ns/day > > Presumably, the gain here is because “pintstride 2” caused the job to run > on the “real” (1,2,3…15) cores, and not on virtual cores. The automatic > pinstride above used cores [0,16], [1,17], [2,18]…[7,23], half of which are > virtual and so gave only 10.6ns/day. > > ** So there very little gain from the second GPU, and very little gain > from multithreading. ** > > Using AVX_256 and not AVX2_256 with above command gave a small speed up > (although using AVX instead of AVX2 for FFTW made things worse). > > 12.5 ns/day > > > To compare with an Intel Xeon Silver system: > 2 x Xeon Silver 4116 (2.1GHz base clock, 12 cores each, no > Hyperthreading), 64GB memory > 2 x Geforce 1080’s (as used in the above tests) > > gcc 4.8.5 > Gromacs 2016.4, with MPI, AVX_256 (compiled on an older GPU machine, and > not by me). > AVX2_256 should give some benefit, but not a lot. (BTW, on Silver do not use AVX_512, even on the Gold / 2FMA Skylake-X, when running with GPUs AVX2 tends to be is better.) > 2 MPI ranks, 12 threads each rank, 2 GPU’s > > 11.7 ns/day > > 4 MPI ranks, 6 threads each rank, 2 GPU’s > > 13.0 ns/day > > 6 MPI ranks, 4 threads each rank, 2 GPU’s > > 14.0 ns/day > Similar effect as noted wrt Ryzen. > > To compare with the AMD machine, same number of cores > > 1 MPI rank, 16 threads, 1 GPU > > 11.2 ns/day > (Side-note: a bit apples and oranges comparison, isn't it?) > > ————————————————— > Gromacs 2018 rc1 (using gcc 4.8.5) > ————————————————— > > Using AVX_256 > You should be using AVX2_128 or AVX2_256 or Zen! The former will be fastest in CPU-only runs, the latter can often be (a bit) faster in GPU accelerated runs. > > In ‘classic’ mode, not using gpu for PME > > 8 tMPI ranks, 4 threads per rank, 2 GPU’s > > 12.7 ns/day (modest speed up from 12.4 ns/day with 2016.4) > > Now use a gpu for PME > > gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on > > used 1 tMPI rank, 32 OpenMP threads, 1 GPU > > 14.9 ns/day > > Forcing the program to use both GPU’s > > gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 4 > -npme 1 -gputasks 0011 -nb gpu > > 18.5 ns/day > > Now with AVX2_128 > > 19.0 ns/day > > Now force Dynamic Load Balancing > > gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 4 > -npme 1 -gputasks 0011 -nb gpu -dlb yes > I would recommend *against* doing that unless you have concrete cases where this is better than "-dlb auto" -- and if you have such cases, please share them as it is not expected behavior. (Note: DLB has acquired the capability to observe when turning it on it leads to performance drop and it switches off automatically in such cases!) > > 20.1 ns/day > > Now use more (8) tMPI ranks > > gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 8 > -npme 1 -gputasks 00001111 -nb gpu -dlb yes > > 20.7 ns/day > Good job! A few more tweaks for the ambitious: - Note that PME does not need many threads, so you could further tune this run to use, say 1-2 threads for the PME rank and more for the rest rank. This might or might not give an improvement. > And finally, using 2018 (AVX2_128) with the above command line > > 20.9 ns/day > > Here are the final lines from the log file > > Dynamic load balancing report: > DLB was permanently on during the run per user request. > Average load imbalance: 7.7%. > The balanceable part of the MD step is 51%, load imbalance is computed > from this. > Part of the total run time spent waiting due to load imbalance: 3.9%. > Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X > 0 % > Average PME mesh/force load: 1.275 > Part of the total run time spent waiting due to PP/PME imbalance: 9.4 % > > NOTE: 9.4 % performance was lost because the PME ranks > had more work to do than the PP ranks. > You might want to increase the number of PME ranks > or increase the cut-off and the grid spacing. > > > R E A L C Y C L E A N D T I M E A C C O U N T I N G > > On 7 MPI ranks doing PP, each using 4 OpenMP threads, and > on 1 MPI rank doing PME, using 4 OpenMP threads > > Computing: Num Num Call Wall time Giga-Cycles > Ranks Threads Count (s) total sum % > ------------------------------------------------------------ > ----------------- > Domain decomp. 7 4 500 13.721 1306.196 2.9 > DD comm. load 7 4 500 0.366 34.875 0.1 > DD comm. bounds 7 4 500 0.036 3.445 0.0 > Send X to PME 7 4 50001 7.047 670.854 1.5 > Neighbor search 7 4 501 6.060 576.925 1.3 > Launch GPU ops. 7 4 100002 11.335 1079.049 2.4 > Comm. coord. 7 4 49500 38.156 3632.409 8.1 > Force 7 4 50001 38.357 3651.633 8.1 > Wait + Comm. F 7 4 50001 42.186 4016.143 8.9 > PME mesh * 1 4 50001 205.801 2798.887 6.2 > PME wait for PP * 207.924 2827.762 6.3 > Wait + Recv. PME F 7 4 50001 70.682 6728.928 14.9 > Wait PME GPU gather 7 4 50001 28.106 2675.682 5.9 > Wait GPU NB nonloc. 7 4 50001 20.463 1948.121 4.3 > Wait GPU NB local 7 4 50001 12.992 1236.845 2.7 > NB X/F buffer ops. 7 4 199002 24.396 2322.498 5.2 > Write traj. 7 4 501 9.081 864.479 1.9 > Update 7 4 50001 24.809 2361.775 5.2 > Constraints 7 4 50001 79.806 7597.527 16.9 > Comm. energies 7 4 2501 11.961 1138.713 2.5 > ------------------------------------------------------------ > ----------------- > Total 413.769 45018.045 100.0 > ------------------------------------------------------------ > ----------------- > (*) Note that with separate PME ranks, the walltime column actually sums to > twice the total reported, but the cycle count total and % are correct. > ------------------------------------------------------------ > ----------------- > > Core t (s) Wall t (s) (%) > Time: 13240.604 413.769 3200.0 > (ns/day) (hour/ns) > Performance: 20.882 1.149 > > > > > -------------------------------------------------------------------- > Harry M. Greenblatt > Associate Staff Scientist > Dept of Structural Biology harry.greenbl...@weizmann.ac.il > <../../owa/redir.aspx?C=QQgUExlE8Ueu2zs5OGxuL5gubHf97c8IyX > xHOfOIqyzCgIQtXppXx1YBYaN5yrHbaDn2xAb8moU.&URL=mailto%3aharry.greenblatt% > 40weizmann.ac.il> > Weizmann Institute of Science Phone: 972-8-934-6340 > 234 Herzl St. Facsimile: 972-8-934-3361 > Rehovot, 7610001 > Israel > > -- > Gromacs Users mailing list > > * Please search the archive at http://www.gromacs.org/Support > /Mailing_Lists/GMX-Users_List before posting! > > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > * For (un)subscribe requests visit > https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or > send a mail to gmx-users-requ...@gromacs.org. -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.