Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

Alex Fri, 24 Apr 2020 17:20:44 -0700

Hi Szilárd,

My comment was as follows:

1. We have been unable to pin threads (mdrun overrides -pin on) with orwithout stride set to 4.

2. We basically accepted that power9/V100 performance (in ns/day) onidentical systems is much worse than that we get from an Intel-basedmachine. Both jobs are set with -nt 32 and using four GPUs.

3. We have not tried to reach out to IBM or take any other steps. As Isaid, we accepted crappy performance.

We would of course very much appreciate any further clarification fromyou -- without pointing to the specific issue (e.g., with OS), I amunable to productively bug our sysadmins (the cluster isinstitution-wide and there are only two people who have to deal with allthe users). I myself do not have admin privileges on this machine. Theonly reason I commented was that Jon revitalized my old thread. ;)


Alex

On 4/24/2020 2:31 PM, Szilárd Páll wrote:

On Fri, Apr 24, 2020 at 5:55 AM Alex <nedoma...@gmail.com> wrote:

Hi Kevin,

We've been having issues with Power9/V100 very similar to what Jon
described and basically settled on what I believe is sub-par
performance. We tested it on systems with ~30-50K particles and threads
simply cannot be pinned.


What does that mean, how did you verify that?
The Linux kernel can in general set affinities on ppc64el, whether that's
requested by mdrun or some other tool, so if you have observed that the
affinity mask is not respected (or it does not change), that more likely OS
/ setup issue, I'd think.

What is different compared to x86 is that the hardware thread layout is
different on Power9 (with default Linux kernel configs) and hardware
threads are exposed as consecutive "CPUs" by the OS rather than strided by
#cores.

I could try to sum up some details on how to sett affinities (with mdrun or
external tools), if that is of interest. However, it really should be
something that's possible to do even using the job scheduler (+ along
reasonable system configuration).

As far as Gromacs is concerned, our brand-new
Power9 nodes operate as if they were based on Intel CPUs (two threads
per core)


Unless the hardware thread layout has been changed, that's perhaps not the
case, see above.

and zero advantage of IBM parallelization is being taken.

You mean the SMT4?

Other users of the same nodes reported similar issues with other
software, which to me suggests that our sysadmins don't really know how
to set these nodes up.

At this point, if someone could figure out a clear set of build
instructions in combination with slurm/mdrun inputs, it would be very
much appreciated.

Have you checked  public documentation on ORNL's sites? GROMACS has been
used successfully on Summit. What about IBM support?

--
Szilárd

Alex

On 4/23/2020 9:37 PM, Kevin Boyd wrote:

I'm not entirely sure how thread-pinning plays with slurm allocations on
partial nodes. I always reserve the entire node when I use thread

pinning,

and run a bunch of simulations by pinning to different cores manually,
rather than relying on slurm to divvy up resources for multiple jobs.

Looking at both logs now, a few more points

* Your benchmarks are short enough that little things like cores spinning
up frequencies can matter. I suggest running longer (increase nsteps in

the

mdp or at the command line), and throwing away your initial benchmark

data

(see -resetstep and -resethway) to avoid artifacts
* Your benchmark system is quite small for such a powerful GPU. I might
expect better performance running multiple simulations per-GPU if the
workflows being run can rely on replicates, and a larger system would
probably scale better to the V100.
* The P100/intel system appears to have pinned cores properly, it's
unclear whether it had a real impact on these benchmarks
* It looks like the CPU-based computations were the primary contributors

to

the observed difference in performance. That should decrease or go away
with increased core counts and shifting the update phase to the GPU. It

may

be (I have no prior experience to indicate either way) that the intel

cores

are simply better on a 1-1 basis than the Power cores. If you have 4-8
cores per simulation (try -ntomp 4 and increasing the allocation of your
slurm job), the individual core performance shouldn't matter too
much, you're just certainly bottlenecked on one CPU core per GPU, which

can

emphasize performance differences..

Kevin

On Thu, Apr 23, 2020 at 6:43 PM Jonathan D. Halverson <
halver...@princeton.edu> wrote:

*Message sent from a system outside of UConn.*


Hi Kevin,

md.log for the Intel run is here:

https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log.intel-broadwell-P100

Thanks for the info on constraints with 2020. I'll try some runs with
different values of -pinoffset for 2019.6.

I know a group at NIST is having the same or similar problems with
POWER9/V100.

Jon
________________________________
From: gromacs.org_gmx-users-boun...@maillist.sys.kth.se <
gromacs.org_gmx-users-boun...@maillist.sys.kth.se> on behalf of Kevin
Boyd <kevin.b...@uconn.edu>
Sent: Thursday, April 23, 2020 9:08 PM
To: gmx-us...@gromacs.org <gmx-us...@gromacs.org>
Subject: Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

Hi,

Can you post the full log for the Intel system? I typically find the

real

cycle and time accounting section a better place to start debugging
performance issues.

A couple quick notes, but need a side-by-side comparison for more useful
analysis, and these points may apply to both systems so may not be your
root cause:
* At first glance, your Power system spends 1/3 of its time in

constraint

calculation, which is unusual. This can be reduced 2 ways - first, by
adding more CPU cores. It doesn't make a ton of sense to benchmark on

one

core if your applications will use more. Second, if you upgrade to

Gromacs

2020 you can probably put the constraint calculation on the GPU with
-update GPU.
* The Power system log has this line:

https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log#L304

indicating
that threads perhaps were not actually pinned. Try adding -pinoffset 0

(or

some other core) to specify where you want the process pinned.

Kevin

On Thu, Apr 23, 2020 at 9:40 AM Jonathan D. Halverson <
halver...@princeton.edu> wrote:

*Message sent from a system outside of UConn.*


We are finding that GROMACS (2018.x, 2019.x, 2020.x) performs worse on

an

IBM POWER9/V100 node versus an Intel Broadwell/P100. Both are running

RHEL

7.7 and Slurm 19.05.5. We have no concerns about GROMACS on our Intel
nodes. Everything below is about of the POWER9/V100 node.

We ran the RNASE benchmark with 2019.6 with PME and cubic box using 1
CPU-core and 1 GPU (
ftp://ftp.gromacs.org/pub/benchmarks/rnase_bench_systems.tar.gz) and
found that the Broadwell/P100 gives 144 ns/day while POWER9/V100 gives

ns/day. The difference in performance is roughly the same for the

larger

ADH benchmark and when different numbers of CPU-cores are used. GROMACS

is

always underperforming on our POWER9/V100 nodes. We have pinning turned

on

(see Slurm script at bottom).

Below is our build procedure on the POWER9/V100 node:

version_gmx=2019.6
wget ftp://ftp.gromacs.org/pub/gromacs/gromacs-${version_gmx}.tar.gz
tar zxvf gromacs-${version_gmx}.tar.gz
cd gromacs-${version_gmx}
mkdir build && cd build

module purge
module load rh/devtoolset/7
module load cudatoolkit/10.2

OPTFLAGS="-Ofast -mcpu=power9 -mtune=power9 -mvsx -DNDEBUG"

cmake3 .. -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=gcc -DCMAKE_C_FLAGS_RELEASE="$OPTFLAGS" \
-DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_FLAGS_RELEASE="$OPTFLAGS" \
-DGMX_BUILD_MDRUN_ONLY=OFF -DGMX_MPI=OFF -DGMX_OPENMP=ON \
-DGMX_SIMD=IBM_VSX -DGMX_DOUBLE=OFF \
-DGMX_BUILD_OWN_FFTW=ON \
-DGMX_GPU=ON -DGMX_CUDA_TARGET_SM=70 \
-DGMX_OPENMP_MAX_THREADS=128 \
-DCMAKE_INSTALL_PREFIX=$HOME/.local \
-DGMX_COOL_QUOTES=OFF -DREGRESSIONTEST_DOWNLOAD=ON

make -j 10
make check
make install

45 of the 46 tests pass with the exception being HardwareUnitTests.

There

are several posts about this and apparently it is not a concern. The

full

build log is here:

https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/build.log


Here is more info about our POWER9/V100 node:

$ lscpu
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    4
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          6
Model:                 2.3 (pvr 004e 1203)
Model name:            POWER9, altivec supported
CPU max MHz:           3800.0000
CPU min MHz:           2300.0000

You see that we have 4 hardware threads per physical core. If we use 4
hardware threads on the RNASE benchmark instead of 1 the performance

goes

to 119 ns/day which is still about 20% less than the Broadwell/P100

value.

When using multiple CPU-cores on the POWER9/V100 there is significant
variation in the execution time of the code.

There are four GPUs per POWER9/V100 node:

$ nvidia-smi -q
Driver Version                      : 440.33.01
CUDA Version                        : 10.2
GPU 00000004:04:00.0
      Product Name                    : Tesla V100-SXM2-32GB

The GPUs have been shown to perform as expected on other applications.




The following lines are found in md.log for the POWER9/V100 run:

Overriding thread affinity set outside gmx mdrun
Pinning threads with an auto-selected logical core stride of 128
NOTE: Thread affinity was not set.

The full md.log is available here:

https://github.com/jdh4/running_gromacs/blob/master/03_benchmarks/md.log




Below are the MegaFlops Accounting for the POWER9/V100 versus
Broadwell/P100:

================ IBM POWER9 WITH NVIDIA V100 ================
Computing:                               M-Number         M-Flops  %

Flops

-----------------------------------------------------------------------------

   Pair Search distance check             297.763872        2679.875

0.0

   NxN Ewald Elec. + LJ [F]            244214.215808    16118138.243

98.0

   NxN Ewald Elec. + LJ [V&F]            2483.565760      265741.536

1.6

   1,4 nonbonded interactions              53.415341        4807.381

0.0

   Shift-X                                  3.029040          18.174

0.0

   Angles                                  37.043704        6223.342

0.0

   Propers                                 55.825582       12784.058

0.1

   Impropers                                4.220422         877.848

0.0

   Virial                                   2.432585          43.787

0.0

   Stop-CM                                  2.452080          24.521

0.0

   Calc-Ekin                               48.128080        1299.458

0.0

   Lincs                                   20.536159        1232.170

0.0

   Lincs-Mat                              444.613344        1778.453

0.0

   Constraint-V                           261.192228        2089.538

0.0

   Constraint-Vir                           2.430161          58.324

0.0

   Settle                                  73.382008       23702.389

0.1

-----------------------------------------------------------------------------

   Total                                                16441499.096

   100.0

-----------------------------------------------------------------------------

================ INTEL BROADWELL WITH NVIDIA P100 ================
   Computing:                               M-Number         M-Flops  %

Flops

-----------------------------------------------------------------------------

   Pair Search distance check             271.334272        2442.008

0.0

   NxN Ewald Elec. + LJ [F]            191599.850112    12645590.107

98.0

   NxN Ewald Elec. + LJ [V&F]            1946.866432      208314.708

1.6

   1,4 nonbonded interactions              53.415341        4807.381

0.0

   Shift-X                                  3.029040          18.174

0.0

   Bonds                                   10.541054         621.922

0.0

   Angles                                  37.043704        6223.342

0.0

   Propers                                 55.825582       12784.058

0.1

   Impropers                                4.220422         877.848

0.0

   Virial                                   2.432585          43.787

0.0

   Stop-CM                                  2.452080          24.521

0.0

   Calc-Ekin                               48.128080        1299.458

0.0

   Lincs                                    9.992997         599.580

0.0

   Lincs-Mat                               50.775228         203.101

0.0

   Constraint-V                           240.108012        1920.864

0.0

   Constraint-Vir                           2.323707          55.769

0.0

   Settle                                  73.382008       23702.389

0.2

-----------------------------------------------------------------------------

   Total                                                12909529.017

   100.0

-----------------------------------------------------------------------------

Some of the rows are identical between the two tables above. The

largest

difference
is observed for the "NxN Ewald Elec. + LJ [F]" row.



Here is our Slurm script:

#!/bin/bash
#SBATCH --job-name=gmx           # create a short name for your job
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all

nodes

#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if
multi-threaded tasks)
#SBATCH --mem=4G                 # memory per node (4G per cpu-core is
default)
#SBATCH --time=00:10:00          # total run time limit (HH:MM:SS)
#SBATCH --gres=gpu:1             # number of gpus per node

module purge
module load cudatoolkit/10.2

BCH=../rnase_cubic
gmx grompp -f $BCH/pme_verlet.mdp -c $BCH/conf.gro -p $BCH/topol.top -o
bench.tpr
gmx mdrun -pin on -ntmpi $SLURM_NTASKS -ntomp $SLURM_CPUS_PER_TASK -s
bench.tpr



How do we get optimal performance out of GROMACS on our POWER9/V100

nodes?

Jon
--
Gromacs Users mailing list

* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.

--
Gromacs Users mailing list

* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.
--
Gromacs Users mailing list

* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.

--
Gromacs Users mailing list

* Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before
posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or
send a mail to gmx-users-requ...@gromacs.org.

--
Gromacs Users mailing list

* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting!

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

* For (un)subscribe requests visit
https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a 
mail to gmx-users-requ...@gromacs.org.

Re: [gmx-users] GROMACS performance issues on POWER9/V100 node

Reply via email to