Hi folks, I have been trying to use the QE 7.0 GPU-accelerated version of pw.x
lately, and have noticed that it is significantly (10x) slower than the CPU
version. The GPU nodes I use have an AMD EPYC 7763 processor (64 cores, 128
threads) and 4 NVIDIA A100 (40gb each) GPUs, and the CPU nodes have 2 AMD EPYC
7763 processors. The time reports from runs on identical input files are below
(GPU first, then CPU):
GPU Version:
init_run : 14.17s CPU 19.29s WALL ( 1 calls)
electrons : 1352.63s CPU 1498.17s WALL ( 19 calls)
update_pot : 144.15s CPU 158.77s WALL ( 18 calls)
forces : 144.74s CPU 158.92s WALL ( 19 calls)
Called by init_run:
wfcinit : 0.14s CPU 2.10s WALL ( 1 calls)
2.10s GPU ( 1 calls)
potinit : 12.83s CPU 13.78s WALL ( 1 calls)
hinit0 : 0.29s CPU 0.35s WALL ( 1 calls)
Called by electrons:
c_bands : 30.64s CPU 38.04s WALL ( 173 calls)
sum_band : 36.93s CPU 40.47s WALL ( 173 calls)
v_of_rho : 1396.71s CPU 1540.48s WALL ( 185 calls)
newd : 13.67s CPU 20.30s WALL ( 185 calls)
9.04s GPU ( 167 calls)
mix_rho : 26.02s CPU 27.31s WALL ( 173 calls)
vdW_kernel : 4.99s CPU 5.01s WALL ( 1 calls)
Called by c_bands:
init_us_2 : 0.24s CPU 0.39s WALL ( 347 calls)
init_us_2:gp : 0.23s CPU 0.38s WALL ( 347 calls)
regterg : 29.53s CPU 36.07s WALL ( 173 calls)
Called by *egterg:
rdiaghg : 0.61s CPU 1.74s WALL ( 585 calls)
1.72s GPU ( 585 calls)
h_psi : 26.71s CPU 33.73s WALL ( 611 calls)
33.69s GPU ( 611 calls)
s_psi : 0.08s CPU 0.16s WALL ( 611 calls)
0.14s GPU ( 611 calls)
g_psi : 0.00s CPU 0.04s WALL ( 437 calls)
0.04s GPU ( 437 calls)
Called by h_psi:
h_psi:calbec : 0.27s CPU 0.32s WALL ( 611 calls)
0.32s GPU ( 611 calls)
vloc_psi : 26.11s CPU 33.04s WALL ( 611 calls)
33.02s GPU ( 611 calls)
add_vuspsi : 0.06s CPU 0.14s WALL ( 611 calls)
0.13s GPU ( 611 calls)
General routines
calbec : 0.32s CPU 0.37s WALL ( 860 calls)
fft : 778.93s CPU 892.58s WALL ( 12061 calls)
13.39s GPU ( 1263 calls)
ffts : 12.40s CPU 12.96s WALL ( 173 calls)
fftw : 30.44s CPU 39.53s WALL ( 3992 calls)
38.80s GPU ( 3992 calls)
Parallel routines
PWSCF : 27m46.53s CPU 30m49.28s WALL
CPU Version:
init_run : 2.35s CPU 2.79s WALL ( 1 calls)
electrons : 99.04s CPU 142.56s WALL ( 19 calls)
update_pot : 9.01s CPU 13.47s WALL ( 18 calls)
forces : 9.89s CPU 14.35s WALL ( 19 calls)
Called by init_run:
wfcinit : 0.08s CPU 0.17s WALL ( 1 calls)
potinit : 1.27s CPU 1.50s WALL ( 1 calls)
hinit0 : 0.27s CPU 0.33s WALL ( 1 calls)
Called by electrons:
c_bands : 28.09s CPU 33.01s WALL ( 173 calls)
sum_band : 13.69s CPU 14.89s WALL ( 173 calls)
v_of_rho : 56.29s CPU 95.06s WALL ( 185 calls)
newd : 5.60s CPU 6.38s WALL ( 185 calls)
mix_rho : 1.37s CPU 1.65s WALL ( 173 calls)
vdW_kernel : 0.84s CPU 0.88s WALL ( 1 calls)
Called by c_bands:
init_us_2 : 0.54s CPU 0.62s WALL ( 347 calls)
init_us_2:cp : 0.54s CPU 0.62s WALL ( 347 calls)
regterg : 27.54s CPU 32.31s WALL ( 173 calls)
Called by *egterg:
rdiaghg : 0.45s CPU 0.49s WALL ( 584 calls)
h_psi : 23.00s CPU 27.54s WALL ( 610 calls)
s_psi : 0.64s CPU 0.66s WALL ( 610 calls)
g_psi : 0.04s CPU 0.04s WALL ( 436 calls)
Called by h_psi:
h_psi:calbec : 1.53s CPU 1.75s WALL ( 610 calls)
vloc_psi : 20.46s CPU 24.73s WALL ( 610 calls)
vloc_psi:tg_ : 1.62s CPU 1.71s WALL ( 610 calls)
add_vuspsi : 0.82s CPU 0.86s WALL ( 610 calls)
General routines
calbec : 2.20s CPU 2.52s WALL ( 859 calls)
fft : 40.10s CPU 76.07s WALL ( 12061 calls)
ffts : 0.66s CPU 0.73s WALL ( 173 calls)
fftw : 18.72s CPU 22.92s WALL ( 8916 calls)
Parallel routines
fft_scatt_xy : 15.80s CPU 20.80s WALL ( 21150 calls)
fft_scatt_yz : 27.55s CPU 58.79s WALL ( 21150 calls)
fft_scatt_tg : 3.60s CPU 4.31s WALL ( 8916 calls)
PWSCF : 2m 1.29s CPU 2m54.94s WALL
This version of QE was compiled on the Perlmutter supercomputer at NERSC. Here
are the compile specifications:
# Modules
Currently Loaded Modules:
1) craype-x86-milan
2) libfabric/1.11.0.4.114
3) craype-network-ofi
4) perftools-base/22.04.0
5) xpmem/2.3.2-2.2_7.5__g93dd7ee.shasta
6) xalt/2.10.2
7) nvidia/21.11 (g,c)
8) craype/2.7.15 (c)
9) cray-dsmml/0.2.2
10) cray-mpich/8.1.15 (mpi)
11) PrgEnv-nvidia/8.3.3 (cpe)
12) Nsight-Compute/2022.1.1
13) Nsight-Systems/2022.2.1
14) cudatoolkit/11.5 (g)
15) cray-fftw/3.3.8.13 (math)
16) cray-hdf5-parallel/1.12.1.1 (io)
export
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cray/pe/fftw/3.3.8.13/x86_milan/lib:/opt/cray/pe/hdf5-parallel/1.12.1.1/nvidia/20.7/lib
./configure CC=cc CXX=CC FC=ftn MPIF90=ftn --with-cuda=$CUDA_HOME
--with-cuda-cc=80 --with-cuda-runtime=11.0 --enable-parallel --enable-openmp
--disable-shared --with-scalapack=yes FFLAGS="-Mpreprocess"
FCFLAGS="-Mpreprocess" LDFLAGS="-acc" --with-libxc
--with-libxc-prefix=/global/common/software/nersc/pm-2021q4/sw/libxc/v5.2.2/alv-gpu
--with-hdf5=${HDF5_DIR}
make veryclean
make all
# go to EPW directory: make; then go to main binary directory and link to epw.x
executable
If there is any more information required, please let me know and I will try to
get it promptly!
_______________________________________________
The Quantum ESPRESSO community stands by the Ukrainian
people and expresses its concerns about the devastating
effects that the Russian military offensive has on their
country and on the free and peaceful scientific, cultural,
and economic cooperation amongst peoples
_______________________________________________
Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
users mailing list [email protected]
https://lists.quantum-espresso.org/mailman/listinfo/users