[QE-users] [QE-GPU] GPU runs significantly slower than CPU runs.

Dyer, Brock Fri, 25 Oct 2024 01:18:42 -0700

Hi folks, I have been trying to use the QE 7.0 GPU-accelerated version of pw.x 
lately, and have noticed that it is significantly (10x) slower than the CPU 
version. The GPU nodes I use have an AMD EPYC 7763 processor (64 cores, 128 
threads) and 4 NVIDIA A100 (40gb each) GPUs, and the CPU nodes have 2 AMD EPYC 
7763 processors. The time reports from runs on identical input files are below 
(GPU first, then CPU):


GPU Version:

     init_run     :     14.17s CPU     19.29s WALL (       1 calls)
     electrons    :   1352.63s CPU   1498.17s WALL (      19 calls)
     update_pot   :    144.15s CPU    158.77s WALL (      18 calls)
     forces       :    144.74s CPU    158.92s WALL (      19 calls)

     Called by init_run:
     wfcinit      :      0.14s CPU      2.10s WALL (       1 calls)
                                        2.10s GPU  (       1 calls)
     potinit      :     12.83s CPU     13.78s WALL (       1 calls)
     hinit0       :      0.29s CPU      0.35s WALL (       1 calls)

     Called by electrons:
     c_bands      :     30.64s CPU     38.04s WALL (     173 calls)
     sum_band     :     36.93s CPU     40.47s WALL (     173 calls)
     v_of_rho     :   1396.71s CPU   1540.48s WALL (     185 calls)
     newd         :     13.67s CPU     20.30s WALL (     185 calls)
                                        9.04s GPU  (     167 calls)
     mix_rho      :     26.02s CPU     27.31s WALL (     173 calls)
     vdW_kernel   :      4.99s CPU      5.01s WALL (       1 calls)

     Called by c_bands:
     init_us_2    :      0.24s CPU      0.39s WALL (     347 calls)
     init_us_2:gp :      0.23s CPU      0.38s WALL (     347 calls)
     regterg      :     29.53s CPU     36.07s WALL (     173 calls)

     Called by *egterg:
     rdiaghg      :      0.61s CPU      1.74s WALL (     585 calls)
                                                1.72s GPU  (     585 calls)
     h_psi        :     26.71s CPU     33.73s WALL (     611 calls)
                                                33.69s GPU  (     611 calls)
     s_psi        :      0.08s CPU      0.16s WALL (     611 calls)
                                                0.14s GPU  (     611 calls)
     g_psi        :      0.00s CPU      0.04s WALL (     437 calls)
                                                0.04s GPU  (     437 calls)

     Called by h_psi:
     h_psi:calbec :      0.27s CPU      0.32s WALL (     611 calls)
                                                    0.32s GPU  (     611 calls)
     vloc_psi     :     26.11s CPU     33.04s WALL (     611 calls)
                                                   33.02s GPU  (     611 calls)
     add_vuspsi   :      0.06s CPU      0.14s WALL (     611 calls)
                                                   0.13s GPU  (     611 calls)

     General routines
     calbec       :      0.32s CPU      0.37s WALL (     860 calls)
     fft          :    778.93s CPU    892.58s WALL (   12061 calls)
                                                13.39s GPU  (    1263 calls)
     ffts         :     12.40s CPU     12.96s WALL (     173 calls)
     fftw         :     30.44s CPU     39.53s WALL (    3992 calls)
                                                38.80s GPU  (    3992 calls)

     Parallel routines

     PWSCF        :  27m46.53s CPU  30m49.28s WALL


CPU Version:


     init_run     :      2.35s CPU      2.79s WALL (       1 calls)

     electrons    :     99.04s CPU    142.56s WALL (      19 calls)
     update_pot   :      9.01s CPU     13.47s WALL (      18 calls)
     forces       :      9.89s CPU     14.35s WALL (      19 calls)

     Called by init_run:
     wfcinit      :      0.08s CPU      0.17s WALL (       1 calls)
     potinit      :      1.27s CPU      1.50s WALL (       1 calls)
     hinit0       :      0.27s CPU      0.33s WALL (       1 calls)

     Called by electrons:
     c_bands      :     28.09s CPU     33.01s WALL (     173 calls)
     sum_band     :     13.69s CPU     14.89s WALL (     173 calls)
     v_of_rho     :     56.29s CPU     95.06s WALL (     185 calls)
     newd         :      5.60s CPU      6.38s WALL (     185 calls)
     mix_rho      :      1.37s CPU      1.65s WALL (     173 calls)
     vdW_kernel   :      0.84s CPU      0.88s WALL (       1 calls)

     Called by c_bands:
     init_us_2    :      0.54s CPU      0.62s WALL (     347 calls)
     init_us_2:cp :      0.54s CPU      0.62s WALL (     347 calls)
     regterg      :     27.54s CPU     32.31s WALL (     173 calls)

     Called by *egterg:
     rdiaghg      :      0.45s CPU      0.49s WALL (     584 calls)
     h_psi        :     23.00s CPU     27.54s WALL (     610 calls)
     s_psi        :      0.64s CPU      0.66s WALL (     610 calls)
     g_psi        :      0.04s CPU      0.04s WALL (     436 calls)

     Called by h_psi:
     h_psi:calbec :      1.53s CPU      1.75s WALL (     610 calls)
     vloc_psi     :     20.46s CPU     24.73s WALL (     610 calls)
     vloc_psi:tg_ :      1.62s CPU      1.71s WALL (     610 calls)
     add_vuspsi   :      0.82s CPU      0.86s WALL (     610 calls)

     General routines
     calbec       :      2.20s CPU      2.52s WALL (     859 calls)
     fft          :     40.10s CPU     76.07s WALL (   12061 calls)
     ffts         :      0.66s CPU      0.73s WALL (     173 calls)
     fftw         :     18.72s CPU     22.92s WALL (    8916 calls)

     Parallel routines
     fft_scatt_xy :     15.80s CPU     20.80s WALL (   21150 calls)
     fft_scatt_yz :     27.55s CPU     58.79s WALL (   21150 calls)
     fft_scatt_tg :      3.60s CPU      4.31s WALL (    8916 calls)

     PWSCF        :   2m 1.29s CPU   2m54.94s WALL


This version of QE was compiled on the Perlmutter supercomputer at NERSC. Here 
are the compile specifications:

# Modules

Currently Loaded Modules:
  1) craype-x86-milan
  2) libfabric/1.11.0.4.114
  3) craype-network-ofi
  4) perftools-base/22.04.0
  5) xpmem/2.3.2-2.2_7.5__g93dd7ee.shasta
  6) xalt/2.10.2
  7) nvidia/21.11                         (g,c)
  8) craype/2.7.15                        (c)
  9) cray-dsmml/0.2.2
 10) cray-mpich/8.1.15                    (mpi)
 11) PrgEnv-nvidia/8.3.3                  (cpe)
 12) Nsight-Compute/2022.1.1
 13) Nsight-Systems/2022.2.1
 14) cudatoolkit/11.5                     (g)
 15) cray-fftw/3.3.8.13                   (math)
 16) cray-hdf5-parallel/1.12.1.1          (io)

export 
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cray/pe/fftw/3.3.8.13/x86_milan/lib:/opt/cray/pe/hdf5-parallel/1.12.1.1/nvidia/20.7/lib

./configure CC=cc CXX=CC FC=ftn MPIF90=ftn --with-cuda=$CUDA_HOME 
--with-cuda-cc=80 --with-cuda-runtime=11.0 --enable-parallel --enable-openmp 
--disable-shared --with-scalapack=yes FFLAGS="-Mpreprocess" 
FCFLAGS="-Mpreprocess" LDFLAGS="-acc" --with-libxc 
--with-libxc-prefix=/global/common/software/nersc/pm-2021q4/sw/libxc/v5.2.2/alv-gpu
 --with-hdf5=${HDF5_DIR}

make veryclean
make all

# go to EPW directory: make; then go to main binary directory and link to epw.x 
executable


If there is any more information required, please let me know and I will try to 
get it promptly!

_______________________________________________
The Quantum ESPRESSO community stands by the Ukrainian
people and expresses its concerns about the devastating
effects that the Russian military offensive has on their
country and on the free and peaceful scientific, cultural,
and economic cooperation amongst peoples
_______________________________________________
Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
users mailing list [email protected]
https://lists.quantum-espresso.org/mailman/listinfo/users

[QE-users] [QE-GPU] GPU runs significantly slower than CPU runs.

Reply via email to