Hello, I am able to successfully run hypre on gpus but the problem seems to consumption a lot of memory. I ran ksp/ksp/tutorial/ex45 on a grid of 320 x 320 x 320 using 6 gpus by the following options
mpirun -n 6 ./ex45 -da_grid_x 320 -da_grid_y 320 -da_grid_z 320 -dm_mat_type hypre -dm_vec_type cuda -ksp_type cg -pc_type hypre -pc_hypre_type boomeramg -ksp_monitor -log_view -malloc_dump -memory_view -malloc_log From the log_view out (also attached) I get the following memory consumption: Summary of Memory Usage in PETSc Maximum (over computational time) process memory: total 9.7412e+09 max 1.6999e+09 min 1.5368e+09 Current process memory: total 8.1640e+09 max 1.4359e+09 min 1.2733e+09 Maximum (over computational time) space PetscMalloc()ed: total 7.7661e+08 max 1.3401e+08 min 1.1148e+08 Current space PetscMalloc()ed: total 1.8356e+06 max 3.0594e+05 min 3.0594e+05 Each gpu is a Nvidia Tesla V100 – even using 4 gpus the system runs out of cuda memory alloc for the above problem. From the above listed memory output I believe the problem should be able to run on one gpu. Is the memory usage of hypre not listed include above? Best, Karthik. This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses.
0 KSP Residual norm 3.354654370474e+03 1 KSP Residual norm 1.369260898558e+03 2 KSP Residual norm 4.509282508695e+02 3 KSP Residual norm 7.819563394025e+01 4 KSP Residual norm 1.741266244858e+01 5 KSP Residual norm 3.208614741531e+00 6 KSP Residual norm 4.495268736218e-01 7 KSP Residual norm 6.305590303007e-02 8 KSP Residual norm 1.247226090546e-02 Residual norm 2.19965e-05 Summary of Memory Usage in PETSc Maximum (over computational time) process memory: total 9.7412e+09 max 1.6999e+09 min 1.5368e+09 Current process memory: total 8.1640e+09 max 1.4359e+09 min 1.2733e+09 Maximum (over computational time) space PetscMalloc()ed: total 7.7661e+08 max 1.3401e+08 min 1.1148e+08 Current space PetscMalloc()ed: total 1.8356e+06 max 3.0594e+05 min 3.0594e+05 ************************************************************************************************************************ *** WIDEN YOUR WINDOW TO 120 CHARACTERS. Use 'enscript -r -fCourier9' to print this document *** ************************************************************************************************************************ ---------------------------------------------- PETSc Performance Summary: ---------------------------------------------- ./ex45 on a arch-linux2-c-opt named sqg2b13.bullx with 6 processors, by kxc07-lxm25 Fri Dec 3 12:03:20 2021 Using Petsc Development GIT revision: v3.16.1-353-g887dddf386 GIT Date: 2021-11-19 20:24:41 +0000 Max Max/Min Avg Total Time (sec): 3.918e+01 1.000 3.918e+01 Objects: 2.300e+01 1.000 2.300e+01 Flop: 5.478e+08 1.009 5.461e+08 3.277e+09 Flop/sec: 1.398e+07 1.009 1.394e+07 8.364e+07 Memory: 1.340e+08 1.202 1.294e+08 7.766e+08 MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00 MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00 MPI Reductions: 7.200e+01 1.000 Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract) e.g., VecAXPY() for real vectors of length N --> 2N flop and VecAXPY() for complex vectors of length N --> 8N flop Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- -- Message Lengths -- -- Reductions -- Avg %Total Avg %Total Count %Total Avg %Total Count %Total 0: Main Stage: 3.9179e+01 100.0% 3.2768e+09 100.0% 0.000e+00 0.0% 0.000e+00 0.0% 5.400e+01 75.0% ------------------------------------------------------------------------------------------------------------------------ See the 'Profiling' chapter of the users' manual for details on interpreting output. Phase summary info: Count: number of times phase was executed Time and Flop: Max - maximum over all processors Ratio - ratio of maximum to minimum over all processors Mess: number of messages sent AvgLen: average message length (bytes) Reduct: number of global reductions Global: entire computation Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop(). %T - percent time in this phase %F - percent flop in this phase %M - percent messages in this phase %L - percent message lengths in this phase %R - percent reductions in this phase Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors) GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU time over all processors) CpuToGpu Count: total number of CPU to GPU copies per processor CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per processor) GpuToCpu Count: total number of GPU to CPU copies per processor GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per processor) GPU %F: percent flops on GPU in this event ------------------------------------------------------------------------------------------------------------------------ Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F --------------------------------------------------------------------------------------------------------------------------------------------------------------- --- Event Stage 0: Main Stage BuildTwoSided 1 1.0 4.3384e-019319.4 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 1 0 0 0 1 1 0 0 0 2 0 0 0 0.00e+00 0 0.00e+00 0 BuildTwoSidedF 1 1.0 4.3386e-017318.8 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 1 0 0 0 1 1 0 0 0 2 0 0 0 0.00e+00 0 0.00e+00 0 MatMult 9 1.0 1.7409e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 MatAssemblyBegin 1 1.0 4.3390e-014536.4 0.00e+00 0.0 0.0e+00 0.0e+00 1.0e+00 1 0 0 0 1 1 0 0 0 2 0 0 0 0.00e+00 0 0.00e+00 0 MatAssemblyEnd 1 1.0 1.8470e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 5 0 0 0 0 5 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 KSPSetUp 1 1.0 1.2262e-01 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 6.0e+00 0 0 0 0 8 0 0 0 0 11 0 0 0 0.00e+00 0 0.00e+00 0 KSPSolve 1 1.0 7.1338e-01 1.6 5.26e+08 1.0 0.0e+00 0.0e+00 2.5e+01 1 96 0 0 35 1 96 0 0 46 4410 363268 1 4.37e+01 0 0.00e+00 100 DMCreateMat 1 1.0 1.3376e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 2.0e+00 3 0 0 0 3 3 0 0 0 4 0 0 0 0.00e+00 0 0.00e+00 0 SFSetGraph 1 1.0 5.8724e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 VecTDot 16 1.0 4.8565e-03 1.4 1.75e+08 1.0 0.0e+00 0.0e+00 1.6e+01 0 32 0 0 22 0 32 0 0 30 215911 483839 0 0.00e+00 0 0.00e+00 100 VecNorm 10 1.0 3.7066e-03 2.4 1.10e+08 1.0 0.0e+00 0.0e+00 1.0e+01 0 20 0 0 14 0 20 0 0 19 176808 528830 0 0.00e+00 0 0.00e+00 100 VecCopy 2 1.0 8.3062e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 VecSet 16 1.0 2.6356e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 VecAXPY 17 1.0 6.6446e-03 1.0 1.86e+08 1.0 0.0e+00 0.0e+00 0.0e+00 0 34 0 0 0 0 34 0 0 0 167671 385598 1 4.37e+01 0 0.00e+00 100 VecAYPX 7 1.0 2.0584e-03 1.0 7.67e+07 1.0 0.0e+00 0.0e+00 0.0e+00 0 14 0 0 0 0 14 0 0 0 222868 231525 0 0.00e+00 0 0.00e+00 100 VecCUDACopyTo 2 1.0 7.1194e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 2 8.74e+01 0 0.00e+00 0 PCSetUp 1 1.0 3.3350e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 85 0 0 0 0 85 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 PCApply 9 1.0 6.7858e-01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 1 0 0 0 0 0 0 1 4.37e+01 0 0.00e+00 0 --------------------------------------------------------------------------------------------------------------------------------------------------------------- Memory usage is given in bytes: Object Type Creations Destructions Memory Descendants' Mem. Reports information only for process 0. --- Event Stage 0: Main Stage Krylov Solver 1 1 1672 0. DMKSP interface 1 1 664 0. Matrix 1 1 3008 0. Distributed Mesh 1 1 5560 0. Index Set 2 2 22257168 0. IS L to G Mapping 1 1 22257320 0. Star Forest Graph 3 3 3672 0. Discrete System 1 1 904 0. Weak Form 1 1 624 0. Vector 8 8 262977408 0. Preconditioner 1 1 1512 0. Viewer 2 1 848 0. ======================================================================================================================== Average time to get PetscTime(): 2.65e-08 Average time for MPI_Barrier(): 5.0166e-06 Average time for zero size MPI_Send(): 4.92433e-06 #PETSc Option Table entries: -da_grid_x 320 -da_grid_y 320 -da_grid_z 320 -dm_mat_type hypre -dm_vec_type cuda -ksp_monitor -ksp_type cg -log_view -malloc_dump -malloc_log -memory_view -pc_hypre_type boomeramg -pc_type hypre #End of PETSc Option Table entries Compiled without FORTRAN kernels Compiled with full precision matrices (default) sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 sizeof(PetscScalar) 8 sizeof(PetscInt) 4 Configure options: --with-debugging=0 --with-blaslapack-dir=/lustre/scafellpike/local/apps/intel/intel_cs/2018.0.128/mkl --with-cuda=1 --with-cuda-arch=70 --download-hypre=yes --download-hypre-configure-arguments="--with-cuda=yes --enable-gpu-profiling=yes --enable-cusparse=yes --enable-cublas=yes --enable-curand=yes --enable-unified-memory=yes HYPRE_CUDA_SM=70" --with-shared-libraries=1 --known-mpi-shared-libraries=1 --with-cc=mpicc --with-cxx=mpicxx -with-fc=mpif90 ----------------------------------------- Libraries compiled on 2021-12-02 20:26:42 on hcxlogin3 Machine characteristics: Linux-3.10.0-1127.el7.x86_64-x86_64-with-redhat-7.8-Maipo Using PETSc directory: /lustre/scafellpike/local/HT04048/lxm25/kxc07-lxm25/petsc-main/petsc Using PETSc arch: arch-linux2-c-opt ----------------------------------------- Using C compiler: mpicc -fPIC -Wall -Wwrite-strings -Wno-strict-aliasing -Wno-unknown-pragmas -fstack-protector -fvisibility=hidden -g -O Using Fortran compiler: mpif90 -fPIC -Wall -ffree-line-length-0 -Wno-unused-dummy-argument -g -O ----------------------------------------- Using include paths: -I/lustre/scafellpike/local/HT04048/lxm25/kxc07-lxm25/petsc-main/petsc/include -I/lustre/scafellpike/local/HT04048/lxm25/kxc07-lxm25/petsc-main/petsc/arch-linux2-c-opt/include -I/lustre/scafellpike/local/apps/intel/intel_cs/2018.0.128/mkl/include -I/lustre/scafellpike/local/apps/cuda/11.2/include ----------------------------------------- Using C linker: mpicc Using Fortran linker: mpif90 Using libraries: -Wl,-rpath,/lustre/scafellpike/local/HT04048/lxm25/kxc07-lxm25/petsc-main/petsc/arch-linux2-c-opt/lib -L/lustre/scafellpike/local/HT04048/lxm25/kxc07-lxm25/petsc-main/petsc/arch-linux2-c-opt/lib -lpetsc -Wl,-rpath,/lustre/scafellpike/local/HT04048/lxm25/kxc07-lxm25/petsc-main/petsc/arch-linux2-c-opt/lib -L/lustre/scafellpike/local/HT04048/lxm25/kxc07-lxm25/petsc-main/petsc/arch-linux2-c-opt/lib -Wl,-rpath,/lustre/scafellpike/local/apps/intel/intel_cs/2018.0.128/mkl/lib/intel64 -L/lustre/scafellpike/local/apps/intel/intel_cs/2018.0.128/mkl/lib/intel64 -Wl,-rpath,/lustre/scafellpike/local/apps/cuda/11.2/lib64 -L/lustre/scafellpike/local/apps/cuda/11.2/lib64 -L/lustre/scafellpike/local/apps/cuda/11.2/lib64/stubs -Wl,-rpath,/lustre/scafellpike/local/apps/gcc7/openmpi/4.0.4-cuda11.2/lib -L/lustre/scafellpike/local/apps/gcc7/openmpi/4.0.4-cuda11.2/lib -Wl,-rpath,/opt/lsf/10.1/linux3.10-glibc2.17-x86_64/lib -L/opt/lsf/10.1/linux3.10-glibc2.17-x86_64/lib -Wl,-rpath,/lustre/scafellpike/local/apps/gcc7/gcc/7.2.0/lib/gcc/x86_64-pc-linux-gnu/7.2.0 -L/lustre/scafellpike/local/apps/gcc7/gcc/7.2.0/lib/gcc/x86_64-pc-linux-gnu/7.2.0 -Wl,-rpath,/lustre/scafellpike/local/apps/gcc7/gcc/7.2.0/lib/gcc -L/lustre/scafellpike/local/apps/gcc7/gcc/7.2.0/lib/gcc -Wl,-rpath,/lustre/scafellpike/local/apps/gcc7/gcc/7.2.0/lib64 -L/lustre/scafellpike/local/apps/gcc7/gcc/7.2.0/lib64 -Wl,-rpath,/lustre/scafellpike/local/apps/gcc7/gcc/7.2.0/lib -L/lustre/scafellpike/local/apps/gcc7/gcc/7.2.0/lib -lHYPRE -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lpthread -lm -lcudart -lcufft -lcublas -lcusparse -lcusolver -lcurand -lcuda -lX11 -lstdc++ -ldl -lmpi_usempi_ignore_tkr -lmpi_mpifh -lmpi -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -ldl -----------------------------------------