Re: [gmx-users] Tests with Threadripper and dual gpu setup

Szilárd Páll Fri, 09 Feb 2018 09:31:52 -0800

Hi,

Thanks for the report!


Did you build with or without hwloc? There is a known issue with the
automatic pin stride when not using hwloc which will lead to a "compact"
pinning (using half of the cores with 2 threads/core) when <=half of the
threads are launched (instead of using all cores 1 thread/core which is the
default on Intel).

When it comes to running "wide" ranks (i.e. many OpenMP threads per rank)
on Zen/Ryzen, things are not straightforward, so the default 16/32 threads
on 16 cores + 1 GPU is not great. If already running domain-decomposition,
4-8 threads/rank is generally best, but unfortunately this will often not
be better than just using no DD and taking the hit of threading
inefficiency.

A few more comments in-line.

On Wed, Jan 24, 2018 at 10:14 AM, Harry Mark Greenblatt <
harry.greenbl...@weizmann.ac.il> wrote:

> BS”D
>
> In case anybody is interested we have tested Gromacs on a Threadripper
> machine with two GPU’s.
>
> Hardware:
>
> Ryzen Threadripper 1950X 16 core CPU (multithreading on), with Corsair
> H100i V2 Liquid cooling
> Asus Prime X399-A M/B
> 2 X Geforce GTX 1080 GPU’s
> 32 GB of 3200MHz memory
> Samsung 850 Pro 512GB SSD
>
> OS, software:
>
> Centos 7.4, with 4.14 Kernel from ElRepo
> gcc 4.8.5 and gcc 5.5.0
> fftw 3.3.7 (AVX2 enabled)
> Cuda 8
> Gromacs 2016.4
> Gromacs 2018-rc1 and final 2018.
> Using thread-MPI
>
>
> I managed to compile gcc 5.5.0, but when I went to use it to compile
> Gromacs, the compiler could not recognise the hardware, although the native
> gcc 4.8.5 had no problem.
> In 2016.4, I was able to specify which SIMD set to use, so this was not an
> issue.   In any case there was very little difference between gcc 5.5.0 and
> 4.8.5.  So I used 4.8.5 for 2018.
> Any ideas how to overcome this problem with 5.5.0?
>
> ————————————
> Gromacs 2016.4
> ————————————
>
> System: Protein/DNA complex, with 438,397 atoms (including waters/ions),
> 100 ps npt equilibration.
>
> Allowing Gromacs to choose how it wanted to allocate the hardware gave
>
> 8 tMPI ranks, 4 thread per rank, both GPU’s
>
> 12.4 ns/day
>
> When I told it to use 4 tMPI ranks, 8 threads per rank, both GPU’s
>
> 12.2 ns/day
>
>
> Running on “real” cores only
>
> 4 tMPI ranks, 4 threads per rank, 2 GPU’s
>
> 10.2 ns/day
>
> 1 tMPI rank, 16 threads per rank, *one* GPU (“half” the machine; pin on,
> but pinstride and pinoffset automatic)
>
> 10.6 ns/day
>
> 1 tMP rank, 16 threads per rank, one gpu, and manually set all pinning
> options:
>
> gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pin on -ntomp 16 -ntmpi 1
> -gpu_id 0 -pinoffset 0 -pinstride 2
>
> 12.3 ns/day
>
> Presumably, the gain here is because “pintstride 2” caused the job to run
> on the “real” (1,2,3…15) cores, and not on virtual cores.  The automatic
> pinstride above used cores [0,16], [1,17], [2,18]…[7,23], half of which are
> virtual and so gave only 10.6ns/day.
>
> ** So there very little gain from the second GPU, and very little gain
> from multithreading. **
>
> Using AVX_256 and not AVX2_256 with above command gave a small speed up
> (although using AVX instead of AVX2 for FFTW made things worse).
>
> 12.5 ns/day
>
>
> To compare with an Intel Xeon Silver system:
> 2 x Xeon Silver 4116 (2.1GHz base clock, 12 cores each, no
> Hyperthreading), 64GB memory
> 2 x Geforce 1080’s (as used in the above tests)
>
> gcc 4.8.5
> Gromacs 2016.4, with MPI, AVX_256 (compiled on an older GPU machine, and
> not by me).
>

AVX2_256 should give some benefit, but not a lot. (BTW, on Silver do not
use AVX_512, even on the Gold / 2FMA Skylake-X, when running with GPUs AVX2
tends to be is better.)


> 2 MPI ranks, 12 threads each rank, 2 GPU’s
>
> 11.7 ns/day
>
> 4 MPI ranks, 6 threads each rank, 2 GPU’s
>
> 13.0 ns/day
>
> 6 MPI ranks, 4 threads each rank, 2 GPU’s
>
> 14.0 ns/day
>

Similar effect as noted wrt Ryzen.


>
> To compare with the AMD machine, same number of cores
>
> 1 MPI rank, 16 threads, 1 GPU
>
> 11.2 ns/day
>

(Side-note: a bit apples and oranges comparison, isn't it?)


>
> —————————————————
> Gromacs 2018 rc1 (using gcc 4.8.5)
> —————————————————
>
> Using AVX_256
>

You should be using AVX2_128 or AVX2_256 or Zen! The former will be fastest
in CPU-only runs, the latter can often be (a bit) faster in GPU accelerated
runs.


>
> In ‘classic’ mode, not using gpu for PME
>
> 8 tMPI ranks, 4 threads per rank, 2 GPU’s
>
> 12.7 ns/day (modest speed up from 12.4 ns/day with 2016.4)
>
> Now use a gpu for PME
>
> gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on
>
> used 1 tMPI rank, 32 OpenMP threads, 1 GPU
>
> 14.9 ns/day
>
> Forcing the program to use both GPU’s
>
> gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 4
> -npme 1 -gputasks 0011 -nb gpu
>
> 18.5 ns/day
>
> Now with AVX2_128
>
> 19.0 ns/day
>
> Now force Dynamic Load Balancing
>
> gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 4
> -npme 1 -gputasks 0011 -nb gpu -dlb yes
>

I would recommend *against* doing that unless you have concrete cases where
this is better than "-dlb auto" -- and if you have such cases, please share
them as it is not expected behavior. (Note: DLB has acquired the capability
to observe when turning it on it leads to performance drop and it switches
off automatically in such cases!)


>
> 20.1 ns/day
>
> Now use more (8) tMPI ranks
>
> gmx mdrun -v -deffnm test.npt -s test.npt.tpr -pme gpu -pin on -ntmpi 8
> -npme 1 -gputasks 00001111 -nb gpu -dlb yes
>
> 20.7 ns/day
>

Good job!

A few more tweaks for the ambitious:
-
Note that PME does not need many threads, so you could further tune this
run to use, say 1-2 threads for the PME rank and more for the rest rank.
This might or might not give an improvement.


> And finally, using 2018 (AVX2_128) with the above command line
>
> 20.9 ns/day
>
> Here are the final lines from the log file
>
> Dynamic load balancing report:
>  DLB was permanently on during the run per user request.
>  Average load imbalance: 7.7%.
>  The balanceable part of the MD step is 51%, load imbalance is computed
> from this.
>  Part of the total run time spent waiting due to load imbalance: 3.9%.
>  Steps where the load balancing was limited by -rdd, -rcon and/or -dds: X
> 0 %
>  Average PME mesh/force load: 1.275
>  Part of the total run time spent waiting due to PP/PME imbalance: 9.4 %
>
> NOTE: 9.4 % performance was lost because the PME ranks
>       had more work to do than the PP ranks.
>       You might want to increase the number of PME ranks
>       or increase the cut-off and the grid spacing.
>
>
>      R E A L   C Y C L E   A N D   T I M E   A C C O U N T I N G
>
> On 7 MPI ranks doing PP, each using 4 OpenMP threads, and
> on 1 MPI rank doing PME, using 4 OpenMP threads
>
>  Computing:          Num   Num      Call    Wall time         Giga-Cycles
>                      Ranks Threads  Count      (s)         total sum    %
> ------------------------------------------------------------
> -----------------
>  Domain decomp.         7    4        500      13.721       1306.196   2.9
>  DD comm. load          7    4        500       0.366         34.875   0.1
>  DD comm. bounds        7    4        500       0.036          3.445   0.0
>  Send X to PME          7    4      50001       7.047        670.854   1.5
>  Neighbor search        7    4        501       6.060        576.925   1.3
>  Launch GPU ops.        7    4     100002      11.335       1079.049   2.4
>  Comm. coord.           7    4      49500      38.156       3632.409   8.1
>  Force                  7    4      50001      38.357       3651.633   8.1
>  Wait + Comm. F         7    4      50001      42.186       4016.143   8.9
>  PME mesh *             1    4      50001     205.801       2798.887   6.2
>  PME wait for PP *                            207.924       2827.762   6.3
>  Wait + Recv. PME F     7    4      50001      70.682       6728.928  14.9
>  Wait PME GPU gather    7    4      50001      28.106       2675.682   5.9
>  Wait GPU NB nonloc.    7    4      50001      20.463       1948.121   4.3
>  Wait GPU NB local      7    4      50001      12.992       1236.845   2.7
>  NB X/F buffer ops.     7    4     199002      24.396       2322.498   5.2
>  Write traj.            7    4        501       9.081        864.479   1.9
>  Update                 7    4      50001      24.809       2361.775   5.2
>  Constraints            7    4      50001      79.806       7597.527  16.9
>  Comm. energies         7    4       2501      11.961       1138.713   2.5
> ------------------------------------------------------------
> -----------------
> Total                                        413.769      45018.045 100.0
> ------------------------------------------------------------
> -----------------
> (*) Note that with separate PME ranks, the walltime column actually sums to
>     twice the total reported, but the cycle count total and % are correct.
> ------------------------------------------------------------
> -----------------
>
>                Core t (s)   Wall t (s)        (%)
>        Time:    13240.604      413.769     3200.0
>                  (ns/day)    (hour/ns)
> Performance:       20.882        1.149
>
>
>
>
> --------------------------------------------------------------------
> Harry M. Greenblatt
> Associate Staff Scientist
> Dept of Structural Biology           harry.greenbl...@weizmann.ac.il
> <../../owa/redir.aspx?C=QQgUExlE8Ueu2zs5OGxuL5gubHf97c8IyX
> xHOfOIqyzCgIQtXppXx1YBYaN5yrHbaDn2xAb8moU.&URL=mailto%3aharry.greenblatt%
> 40weizmann.ac.il>
> Weizmann Institute of Science        Phone:  972-8-934-6340
> 234 Herzl St.                        Facsimile:   972-8-934-3361
> Rehovot, 7610001
> Israel
>
> --
> Gromacs Users mailing list
>
> * Please search the archive at http://www.gromacs.org/Support
> /Mailing_Lists/GMX-Users_List before posting!
>
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
> * For (un)subscribe requests visit
> https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
> send a mail to gmx-users-requ...@gromacs.org.
-- 
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] Tests with Threadripper and dual gpu setup

Reply via email to