Re: [QE-users] Sub optimal performance on 32 core AMD machine

Tobias Klöffel Tue, 17 Nov 2020 11:06:41 -0800

Dear all,
Just some quick remarks:

OpenMP in QE is added on top of MPI, so with just one 32 / 24 cores, itis more or less useless to use OpenMP at all, this is true for almostany code with hybrid parallelization.

You claim, you have played around with OpenMPI/OpenMP settings butnothing changed, which settings?

If you don't share your system, at least the computational relevantparameters, like FFT grid size, number of plane wave coefficients, typeof pseudo potentials, type of calculation etc nobody can guess what kindof performance you may or may not expect.E.g. if may also that your simulations are just too small, if they runjust for seconds on a quad core machine, nothing will change going tomore cores.

The debug environment switch for MKL depends on the version of the MKLitself, more recent versions need more 'sophisticated' workarounds, butif you google you will find them, too.

Last point, if you don't specify exactly how your workstations areconfigured, nobody knows what you are talking about, same for compilersetc, they all carry a version number.


best,
Tobias

On 11/17/20 7:24 PM, Pamela Whitfield wrote:

Michal
I have a very similar use-case and looked into many of the same issueswhen I got my Threadripper 3960X system at the beginning of the yearto supplement my old dual-Xeon setup. In the past few days I've beenrevisiting compilation as I got hold of a Quadro GV100 for GPUacceleration of my optimizations.
Basically it seems as though code compiled for Zen2 either can'thandle code compiled for both MPI and OpenMP at all, or does so poorlyeven when it runs.Best performance for pw.x on v6.5 (I've been playing with GIPAW andthere's no 6.6 compatible version yet) has been with a simple gccOpenMPI compilation without openmp threading and with about 20 MPIcores on my 24 core CPU. Compiling with GCC or PGI compiler madelittle difference, although only the more recent PGI compilers willhave zen2 optimization.I get little benefit from Intel MKL over openblas/lapack/fftw3 evenwith the debug tweaks, etc.Puget Systems numbers with other programs suggest that OpenMP onlyperforms better than OpenMPI with Threadripper but I find the oppositewith QE.I did try disabling hyperthreading in the BIOS but that made nodifference to the performance.
GPU compilation really shows the issue with MPI/OpenMP clashing. Withthe Xeons I could compile code with MKL that would run well on aQuadro K6000 while offloading to the CPU with MPI when needed. Itcould still be a compiler issue (have to use PGI with the GPU version)but it just doesn't work with the 3960X, and some things don't threadwell with pure OpenMP (e..g dftd3 versus dftd2) so I'll still need touse separately compiled versions of 6.5 for different problems.
BTW with a dual CPU system you may benefit from pinning threads toparticular CPUs - it works on the dual Xeon in any case. MyThreadripper balances the load across the cores in a pretty dynamicmanner and that's on a single socket.
Best regards
Pam Whitfield

Independent Consultant





Message: 1
Date: Mon, 16 Nov 2020 15:19:04 +0100
From: Michal Husak <michal.hu...@vscht.cz <mailto:michal.hu...@vscht.cz>>
To: Quantum ESPRESSO users Forum <users@lists.quantum-espresso.org<mailto:users@lists.quantum-espresso.org>>
Subject: [QE-users] Sub optimal performance on 32 core AMD machine
Message-ID: <fe59d3a8-2ace-4f66-a5c6-e83b01387...@cln92.vscht.cz<mailto:fe59d3a8-2ace-4f66-a5c6-e83b01387...@cln92.vscht.cz>>
Content-Type: text/plain; charset="UTF-8"; format=flowed

I had purchased a new PC with 2x 16 core AMD EPYC processors . 64
cores with hyper threading ...
I was hoping my QM programs (Quantum Espresso, CASTEP) will run on the new
system faster, than on my old 4 core i7 Intel machine (8 year old) ....

To my great surprise, the opposite is almost true :-(.
My main task is scf and geometry optimization of middle sized organic
molecular crystals (abut 100 C,H,N per unit cell) ...

I was playing with OpenMPI/OpenMP setup changes ...
I was playing with the secret MKL_DEBUG_CPU_TYPE=5 parameter
(responsible for slow run of Intel MKL compiled code on AMD) ...

Nothing helps, the best speed is obteined when I  use only 4 cores
(OpenMPI or OpenMP - results similar) ...
Using 16 or 32 cores gives almost no benefit ...
The CPU load for run on 1/4/816/32 coresponds to the nubmer of CPU
set = they try to do something ...

Any idea what I should check, try optimize ?

Maybe the bottleneck is memory access, not CPU power  (I have 128
GB  almost not used RAM) ?

Michal Husak

UCT Prague



_______________________________________________
Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
users mailing list users@lists.quantum-espresso.org
https://lists.quantum-espresso.org/mailman/listinfo/users



--
M.Sc. Tobias Klöffel
=======================================================
HPC (High Performance Computing) group
Erlangen Regional Computing Center(RRZE)
Friedrich-Alexander-Universität Erlangen-Nürnberg
Martensstr. 1
91058 Erlangen

Room: 1.133
Phone: +49 (0) 9131 / 85 - 20101

=======================================================

E-mail: tobias.kloef...@fau.de

_______________________________________________
Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
users mailing list users@lists.quantum-espresso.org
https://lists.quantum-espresso.org/mailman/listinfo/users

Re: [QE-users] Sub optimal performance on 32 core AMD machine

Reply via email to