My current submission script uses 4 tasks per node, and my input only has 1 k-point. I feel it is pertinent to mention that I am running molecular systems, not a crystal or any sort of repeating structure. There are only 31 Kohn-Sham states in the system, and the FFT grid is (192,192,192). I just sort of assumed that the GPU code would always be faster than CPU, maybe not by much, but definitely 8-10x slower than the CPU code. Is that an unrealistic expectation?
________________________________ From: Paolo Giannozzi <[email protected]> Sent: Monday, October 28, 2024 12:04 PM To: Quantum ESPRESSO users Forum <[email protected]> Cc: Dyer, Brock <[email protected]> Subject: Re: [QE-users] [QE-GPU] GPU runs significantly slower than CPU runs. The performances on GPU depend upon a lot of factors, e.g., the size of the system and how the code is run. One should run one MPI per GPU and use low-communication parallelization (e.g. on k points) whenever possible. Paolo On 10/24/24 17:38, Dyer, Brock wrote: > > Hi folks, I have been trying to use the QE 7.0 GPU-accelerated version > of pw.x lately, and have noticed that it is significantly (10x) slower > than the CPU version. The GPU nodes I use have an AMD EPYC 7763 > processor (64 cores, 128 threads) and 4 NVIDIA A100 (40gb each) GPUs, > and the CPU nodes have 2 AMD EPYC 7763 processors. The time reports from > runs on identical input files are below (GPU first, then CPU): > > GPU Version: > > init_run : 14.17s CPU 19.29s WALL ( 1 calls) > electrons : 1352.63s CPU 1498.17s WALL ( 19 calls) > update_pot : 144.15s CPU 158.77s WALL ( 18 calls) > forces : 144.74s CPU 158.92s WALL ( 19 calls) > > Called by init_run: > wfcinit : 0.14s CPU 2.10s WALL ( 1 calls) > 2.10s GPU ( 1 calls) > potinit : 12.83s CPU 13.78s WALL ( 1 calls) > hinit0 : 0.29s CPU 0.35s WALL ( 1 calls) > > Called by electrons: > c_bands : 30.64s CPU 38.04s WALL ( 173 calls) > sum_band : 36.93s CPU 40.47s WALL ( 173 calls) > v_of_rho : 1396.71s CPU 1540.48s WALL ( 185 calls) > newd : 13.67s CPU 20.30s WALL ( 185 calls) > 9.04s GPU ( 167 calls) > mix_rho : 26.02s CPU 27.31s WALL ( 173 calls) > vdW_kernel : 4.99s CPU 5.01s WALL ( 1 calls) > > Called by c_bands: > init_us_2 : 0.24s CPU 0.39s WALL ( 347 calls) > init_us_2:gp : 0.23s CPU 0.38s WALL ( 347 calls) > regterg : 29.53s CPU 36.07s WALL ( 173 calls) > > Called by *egterg: > rdiaghg : 0.61s CPU 1.74s WALL ( 585 calls) > 1.72s GPU ( 585 calls) > h_psi : 26.71s CPU 33.73s WALL ( 611 calls) > 33.69s GPU ( 611 > calls) > s_psi : 0.08s CPU 0.16s WALL ( 611 calls) > 0.14s GPU ( 611 calls) > g_psi : 0.00s CPU 0.04s WALL ( 437 calls) > 0.04s GPU ( 437 calls) > > Called by h_psi: > h_psi:calbec : 0.27s CPU 0.32s WALL ( 611 calls) > 0.32s GPU ( > 611 calls) > vloc_psi : 26.11s CPU 33.04s WALL ( 611 calls) > 33.02s GPU ( > 611 calls) > add_vuspsi : 0.06s CPU 0.14s WALL ( 611 calls) > 0.13s GPU ( 611 > calls) > > General routines > calbec : 0.32s CPU 0.37s WALL ( 860 calls) > fft : 778.93s CPU 892.58s WALL ( 12061 calls) > 13.39s GPU ( 1263 > calls) > ffts : 12.40s CPU 12.96s WALL ( 173 calls) > fftw : 30.44s CPU 39.53s WALL ( 3992 calls) > 38.80s GPU ( 3992 > calls) > Parallel routines > PWSCF : 27m46.53s CPU 30m49.28s WALL > > CPU Version: > > > init_run : 2.35s CPU 2.79s WALL ( 1 calls) > > electrons : 99.04s CPU 142.56s WALL ( 19 calls) > update_pot : 9.01s CPU 13.47s WALL ( 18 calls) > forces : 9.89s CPU 14.35s WALL ( 19 calls) > > Called by init_run: > wfcinit : 0.08s CPU 0.17s WALL ( 1 calls) > potinit : 1.27s CPU 1.50s WALL ( 1 calls) > hinit0 : 0.27s CPU 0.33s WALL ( 1 calls) > > Called by electrons: > c_bands : 28.09s CPU 33.01s WALL ( 173 calls) > sum_band : 13.69s CPU 14.89s WALL ( 173 calls) > v_of_rho : 56.29s CPU 95.06s WALL ( 185 calls) > newd : 5.60s CPU 6.38s WALL ( 185 calls) > mix_rho : 1.37s CPU 1.65s WALL ( 173 calls) > vdW_kernel : 0.84s CPU 0.88s WALL ( 1 calls) > > Called by c_bands: > init_us_2 : 0.54s CPU 0.62s WALL ( 347 calls) > init_us_2:cp : 0.54s CPU 0.62s WALL ( 347 calls) > regterg : 27.54s CPU 32.31s WALL ( 173 calls) > > Called by *egterg: > rdiaghg : 0.45s CPU 0.49s WALL ( 584 calls) > h_psi : 23.00s CPU 27.54s WALL ( 610 calls) > s_psi : 0.64s CPU 0.66s WALL ( 610 calls) > g_psi : 0.04s CPU 0.04s WALL ( 436 calls) > > Called by h_psi: > h_psi:calbec : 1.53s CPU 1.75s WALL ( 610 calls) > vloc_psi : 20.46s CPU 24.73s WALL ( 610 calls) > vloc_psi:tg_ : 1.62s CPU 1.71s WALL ( 610 calls) > add_vuspsi : 0.82s CPU 0.86s WALL ( 610 calls) > > General routines > calbec : 2.20s CPU 2.52s WALL ( 859 calls) > fft : 40.10s CPU 76.07s WALL ( 12061 calls) > ffts : 0.66s CPU 0.73s WALL ( 173 calls) > fftw : 18.72s CPU 22.92s WALL ( 8916 calls) > > Parallel routines > fft_scatt_xy : 15.80s CPU 20.80s WALL ( 21150 calls) > fft_scatt_yz : 27.55s CPU 58.79s WALL ( 21150 calls) > fft_scatt_tg : 3.60s CPU 4.31s WALL ( 8916 calls) > > PWSCF : 2m 1.29s CPU 2m54.94s WALL > > > This version of QE was compiled on the Perlmutter supercomputer at > NERSC. Here are the compile specifications: > > # Modules > > > Currently Loaded Modules: > 1) craype-x86-milan > 2) libfabric/1.11.0.4.114 > 3) craype-network-ofi > 4) perftools-base/22.04.0 > 5) xpmem/2.3.2-2.2_7.5__g93dd7ee.shasta > 6) xalt/2.10.2 > 7) nvidia/21.11 (g,c) > 8) craype/2.7.15 (c) > 9) cray-dsmml/0.2.2 > 10) cray-mpich/8.1.15 (mpi) > 11) PrgEnv-nvidia/8.3.3 (cpe) > 12) Nsight-Compute/2022.1.1 > 13) Nsight-Systems/2022.2.1 > 14) cudatoolkit/11.5 (g) > 15) cray-fftw/3.3.8.13 (math) > 16) cray-hdf5-parallel/1.12.1.1 (io) > > export > LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cray/pe/fftw/3.3.8.13/x86_milan/lib:/opt/cray/pe/hdf5-parallel/1.12.1.1/nvidia/20.7/lib > > ./configure CC=cc CXX=CC FC=ftn MPIF90=ftn --with-cuda=$CUDA_HOME > --with-cuda-cc=80 --with-cuda-runtime=11.0 --enable-parallel > --enable-openmp --disable-shared --with-scalapack=yes > FFLAGS="-Mpreprocess" FCFLAGS="-Mpreprocess" LDFLAGS="-acc" --with-libxc > --with-libxc-prefix=/global/common/software/nersc/pm-2021q4/sw/libxc/v5.2.2/alv-gpu > --with-hdf5=${HDF5_DIR} > > make veryclean > make all > > # go to EPW directory: make; then go to main binary directory and link > to epw.x executable > > > If there is any more information required, please let me know and I will > try to get it promptly! > > > > _______________________________________________ > The Quantum ESPRESSO community stands by the Ukrainian > people and expresses its concerns about the devastating > effects that the Russian military offensive has on their > country and on the free and peaceful scientific, cultural, > and economic cooperation amongst peoples > _______________________________________________ > Quantum ESPRESSO is supported by MaX > (https://linkprotect.cudasvc.com/url?a=https%3a%2f%2fwww.max-centre.eu&c=E,1,cWbnpY9zUPefGXxiZSF12XOautWJAuDDFiJbFYnq9WBzpkUV230lnuKhUlKJtRU9lbXSojSGIQLi8LepX_Phu8BlxnVOkg8GOAQC40NKg2g3WE-1tGdHyA,,&typo=1) > users mailing list [email protected] > https://linkprotect.cudasvc.com/url?a=https%3a%2f%2flists.quantum-espresso.org%2fmailman%2flistinfo%2fusers&c=E,1,TNLdBqTaC8ekN7v0b1PfY8fJCd1FWca8_UUXHVf-5CQl5b-ay9pPk-X2E9lKriQUUcM3dalM445rfZw0W2XftHvT0QQaHLVstRxSUG7muw,,&typo=1 -- Paolo Giannozzi, Dip. Scienze Matematiche Informatiche e Fisiche, Univ. Udine, via delle Scienze 206, 33100 Udine Italy, +39-0432-558216
_______________________________________________ The Quantum ESPRESSO community stands by the Ukrainian people and expresses its concerns about the devastating effects that the Russian military offensive has on their country and on the free and peaceful scientific, cultural, and economic cooperation amongst peoples _______________________________________________ Quantum ESPRESSO is supported by MaX (www.max-centre.eu) users mailing list [email protected] https://lists.quantum-espresso.org/mailman/listinfo/users
