Re: [QE-users] [QE-GPU] GPU runs significantly slower than CPU runs.

Dyer, Brock Tue, 29 Oct 2024 11:02:01 -0700

My current submission script uses 4 tasks per node, and my input only has 1 
k-point. I feel it is pertinent to mention that I am running molecular systems, 
not a crystal or any sort of repeating structure. There are only 31 Kohn-Sham 
states in the system, and the FFT grid is (192,192,192). I just sort of assumed 
that the GPU code would always be faster than CPU, maybe not by much, but 
definitely 8-10x slower than the CPU code. Is that an unrealistic expectation?


________________________________
From: Paolo Giannozzi <[email protected]>
Sent: Monday, October 28, 2024 12:04 PM
To: Quantum ESPRESSO users Forum <[email protected]>
Cc: Dyer, Brock <[email protected]>
Subject: Re: [QE-users] [QE-GPU] GPU runs significantly slower than CPU runs.

The performances on GPU depend upon a lot of factors, e.g., the size of
the system and how the code is run. One should run one MPI per GPU and
use low-communication parallelization (e.g. on k points) whenever possible.

Paolo

On 10/24/24 17:38, Dyer, Brock wrote:
>
> Hi folks, I have been trying to use the QE 7.0 GPU-accelerated version
> of pw.x lately, and have noticed that it is significantly (10x) slower
> than the CPU version. The GPU nodes I use have an AMD EPYC 7763
> processor (64 cores, 128 threads) and 4 NVIDIA A100 (40gb each) GPUs,
> and the CPU nodes have 2 AMD EPYC 7763 processors. The time reports from
> runs on identical input files are below (GPU first, then CPU):
>
> GPU Version:
>
>       init_run     :     14.17s CPU     19.29s WALL (       1 calls)
>       electrons    :   1352.63s CPU   1498.17s WALL (      19 calls)
>       update_pot   :    144.15s CPU    158.77s WALL (      18 calls)
>       forces       :    144.74s CPU    158.92s WALL (      19 calls)
>
>       Called by init_run:
>       wfcinit      :      0.14s CPU      2.10s WALL (       1 calls)
>                                          2.10s GPU  (       1 calls)
>       potinit      :     12.83s CPU     13.78s WALL (       1 calls)
>       hinit0       :      0.29s CPU      0.35s WALL (       1 calls)
>
>       Called by electrons:
>       c_bands      :     30.64s CPU     38.04s WALL (     173 calls)
>       sum_band     :     36.93s CPU     40.47s WALL (     173 calls)
>       v_of_rho     :   1396.71s CPU   1540.48s WALL (     185 calls)
>       newd         :     13.67s CPU     20.30s WALL (     185 calls)
>                                          9.04s GPU  (     167 calls)
>       mix_rho      :     26.02s CPU     27.31s WALL (     173 calls)
>       vdW_kernel   :      4.99s CPU      5.01s WALL (       1 calls)
>
>       Called by c_bands:
>       init_us_2    :      0.24s CPU      0.39s WALL (     347 calls)
>       init_us_2:gp :      0.23s CPU      0.38s WALL (     347 calls)
>       regterg      :     29.53s CPU     36.07s WALL (     173 calls)
>
>       Called by *egterg:
>       rdiaghg      :      0.61s CPU      1.74s WALL (     585 calls)
>                                                  1.72s GPU  (     585 calls)
>       h_psi        :     26.71s CPU     33.73s WALL (     611 calls)
>                                                  33.69s GPU  (     611
> calls)
>       s_psi        :      0.08s CPU      0.16s WALL (     611 calls)
>                                                  0.14s GPU  (     611 calls)
>       g_psi        :      0.00s CPU      0.04s WALL (     437 calls)
>                                                  0.04s GPU  (     437 calls)
>
>       Called by h_psi:
>       h_psi:calbec :      0.27s CPU      0.32s WALL (     611 calls)
>                                                      0.32s GPU  (
> 611 calls)
>       vloc_psi     :     26.11s CPU     33.04s WALL (     611 calls)
>                                                     33.02s GPU  (
> 611 calls)
>       add_vuspsi   :      0.06s CPU      0.14s WALL (     611 calls)
>                                                     0.13s GPU  (     611
> calls)
>
>       General routines
>       calbec       :      0.32s CPU      0.37s WALL (     860 calls)
>       fft          :    778.93s CPU    892.58s WALL (   12061 calls)
>                                                  13.39s GPU  (    1263
> calls)
>       ffts         :     12.40s CPU     12.96s WALL (     173 calls)
>       fftw         :     30.44s CPU     39.53s WALL (    3992 calls)
>                                                  38.80s GPU  (    3992
> calls)
>       Parallel routines
>       PWSCF        :  27m46.53s CPU  30m49.28s WALL
>
> CPU Version:
>
>
>       init_run     :      2.35s CPU      2.79s WALL (       1 calls)
>
>       electrons    :     99.04s CPU    142.56s WALL (      19 calls)
>       update_pot   :      9.01s CPU     13.47s WALL (      18 calls)
>       forces       :      9.89s CPU     14.35s WALL (      19 calls)
>
>       Called by init_run:
>       wfcinit      :      0.08s CPU      0.17s WALL (       1 calls)
>       potinit      :      1.27s CPU      1.50s WALL (       1 calls)
>       hinit0       :      0.27s CPU      0.33s WALL (       1 calls)
>
>       Called by electrons:
>       c_bands      :     28.09s CPU     33.01s WALL (     173 calls)
>       sum_band     :     13.69s CPU     14.89s WALL (     173 calls)
>       v_of_rho     :     56.29s CPU     95.06s WALL (     185 calls)
>       newd         :      5.60s CPU      6.38s WALL (     185 calls)
>       mix_rho      :      1.37s CPU      1.65s WALL (     173 calls)
>       vdW_kernel   :      0.84s CPU      0.88s WALL (       1 calls)
>
>       Called by c_bands:
>       init_us_2    :      0.54s CPU      0.62s WALL (     347 calls)
>       init_us_2:cp :      0.54s CPU      0.62s WALL (     347 calls)
>       regterg      :     27.54s CPU     32.31s WALL (     173 calls)
>
>       Called by *egterg:
>       rdiaghg      :      0.45s CPU      0.49s WALL (     584 calls)
>       h_psi        :     23.00s CPU     27.54s WALL (     610 calls)
>       s_psi        :      0.64s CPU      0.66s WALL (     610 calls)
>       g_psi        :      0.04s CPU      0.04s WALL (     436 calls)
>
>       Called by h_psi:
>       h_psi:calbec :      1.53s CPU      1.75s WALL (     610 calls)
>       vloc_psi     :     20.46s CPU     24.73s WALL (     610 calls)
>       vloc_psi:tg_ :      1.62s CPU      1.71s WALL (     610 calls)
>       add_vuspsi   :      0.82s CPU      0.86s WALL (     610 calls)
>
>       General routines
>       calbec       :      2.20s CPU      2.52s WALL (     859 calls)
>       fft          :     40.10s CPU     76.07s WALL (   12061 calls)
>       ffts         :      0.66s CPU      0.73s WALL (     173 calls)
>       fftw         :     18.72s CPU     22.92s WALL (    8916 calls)
>
>       Parallel routines
>       fft_scatt_xy :     15.80s CPU     20.80s WALL (   21150 calls)
>       fft_scatt_yz :     27.55s CPU     58.79s WALL (   21150 calls)
>       fft_scatt_tg :      3.60s CPU      4.31s WALL (    8916 calls)
>
>       PWSCF        :   2m 1.29s CPU   2m54.94s WALL
>
>
> This version of QE was compiled on the Perlmutter supercomputer at
> NERSC. Here are the compile specifications:
>
> # Modules
>
>
> Currently Loaded Modules:
>    1) craype-x86-milan
>    2) libfabric/1.11.0.4.114
>    3) craype-network-ofi
>    4) perftools-base/22.04.0
>    5) xpmem/2.3.2-2.2_7.5__g93dd7ee.shasta
>    6) xalt/2.10.2
>    7) nvidia/21.11                         (g,c)
>    8) craype/2.7.15                        (c)
>    9) cray-dsmml/0.2.2
>   10) cray-mpich/8.1.15                    (mpi)
>   11) PrgEnv-nvidia/8.3.3                  (cpe)
>   12) Nsight-Compute/2022.1.1
>   13) Nsight-Systems/2022.2.1
>   14) cudatoolkit/11.5                     (g)
>   15) cray-fftw/3.3.8.13                   (math)
>   16) cray-hdf5-parallel/1.12.1.1          (io)
>
> export
> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cray/pe/fftw/3.3.8.13/x86_milan/lib:/opt/cray/pe/hdf5-parallel/1.12.1.1/nvidia/20.7/lib
>
> ./configure CC=cc CXX=CC FC=ftn MPIF90=ftn --with-cuda=$CUDA_HOME
> --with-cuda-cc=80 --with-cuda-runtime=11.0 --enable-parallel
> --enable-openmp --disable-shared --with-scalapack=yes
> FFLAGS="-Mpreprocess" FCFLAGS="-Mpreprocess" LDFLAGS="-acc" --with-libxc
> --with-libxc-prefix=/global/common/software/nersc/pm-2021q4/sw/libxc/v5.2.2/alv-gpu
>  --with-hdf5=${HDF5_DIR}
>
> make veryclean
> make all
>
> # go to EPW directory: make; then go to main binary directory and link
> to epw.x executable
>
>
> If there is any more information required, please let me know and I will
> try to get it promptly!
>
>
>
> _______________________________________________
> The Quantum ESPRESSO community stands by the Ukrainian
> people and expresses its concerns about the devastating
> effects that the Russian military offensive has on their
> country and on the free and peaceful scientific, cultural,
> and economic cooperation amongst peoples
> _______________________________________________
> Quantum ESPRESSO is supported by MaX 
> (https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fwww.max-centre.eu&c=E,1,cWbnpY9zUPefGXxiZSF12XOautWJAuDDFiJbFYnq9WBzpkUV230lnuKhUlKJtRU9lbXSojSGIQLi8LepX_Phu8BlxnVOkg8GOAQC40NKg2g3WE-1tGdHyA,,&typo=1)
> users mailing list [email protected]
> https://linkprotect.cudasvc.com/url?a=https%3a%2f%2flists.quantum-espresso.org%2fmailman%2flistinfo%2fusers&c=E,1,TNLdBqTaC8ekN7v0b1PfY8fJCd1FWca8_UUXHVf-5CQl5b-ay9pPk-X2E9lKriQUUcM3dalM445rfZw0W2XftHvT0QQaHLVstRxSUG7muw,,&typo=1

--
Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche,
Univ. Udine, via delle Scienze 206, 33100 Udine Italy, +39-0432-558216

_______________________________________________
The Quantum ESPRESSO community stands by the Ukrainian
people and expresses its concerns about the devastating
effects that the Russian military offensive has on their
country and on the free and peaceful scientific, cultural,
and economic cooperation amongst peoples
_______________________________________________
Quantum ESPRESSO is supported by MaX (www.max-centre.eu)
users mailing list [email protected]
https://lists.quantum-espresso.org/mailman/listinfo/users

Re: [QE-users] [QE-GPU] GPU runs significantly slower than CPU runs.

Reply via email to