Hello,

I am able to successfully run hypre on gpus but the problem seems to 
consumption a lot of memory.  I ran ksp/ksp/tutorial/ex45 on a grid of 320 x 
320 x 320 using 6 gpus by the following options


mpirun -n 6 ./ex45 -da_grid_x 320 -da_grid_y 320 -da_grid_z 320 -dm_mat_type 
hypre -dm_vec_type cuda -ksp_type cg -pc_type hypre -pc_hypre_type boomeramg  
-ksp_monitor -log_view -malloc_dump  -memory_view -malloc_log



From the log_view out (also attached) I get the following memory consumption:

Summary of Memory Usage in PETSc
Maximum (over computational time) process memory:           total 9.7412e+09 
max 1.6999e+09 min 1.5368e+09
Current process memory:                                                         
       total 8.1640e+09 max 1.4359e+09 min 1.2733e+09
Maximum (over computational time) space PetscMalloc()ed: total 7.7661e+08 max 
1.3401e+08 min 1.1148e+08
Current space PetscMalloc()ed:                                                  
    total 1.8356e+06 max 3.0594e+05 min 3.0594e+05

Each gpu is a Nvidia Tesla V100 – even using 4 gpus the system runs out of cuda 
memory alloc for the above problem. From the above listed memory output I 
believe the problem should be able to run on one gpu. Is the memory usage of 
hypre not listed include above?

Best,
Karthik.


This email and any attachments are intended solely for the use of the named 
recipients. If you are not the intended recipient you must not use, disclose, 
copy or distribute this email or any of its attachments and should notify the 
sender immediately and delete this email from your system. UK Research and 
Innovation (UKRI) has taken every reasonable precaution to minimise risk of 
this email or any attachments containing viruses or malware but the recipient 
should carry out its own virus and malware checks before opening the 
attachments. UKRI does not accept any liability for any losses or damages which 
the recipient may sustain due to presence of any viruses.
  0 KSP Residual norm 3.354654370474e+03 
  1 KSP Residual norm 1.369260898558e+03 
  2 KSP Residual norm 4.509282508695e+02 
  3 KSP Residual norm 7.819563394025e+01 
  4 KSP Residual norm 1.741266244858e+01 
  5 KSP Residual norm 3.208614741531e+00 
  6 KSP Residual norm 4.495268736218e-01 
  7 KSP Residual norm 6.305590303007e-02 
  8 KSP Residual norm 1.247226090546e-02 
Residual norm 2.19965e-05
Summary of Memory Usage in PETSc
Maximum (over computational time) process memory:        total 9.7412e+09 max 
1.6999e+09 min 1.5368e+09
Current process memory:                                  total 8.1640e+09 max 
1.4359e+09 min 1.2733e+09
Maximum (over computational time) space PetscMalloc()ed: total 7.7661e+08 max 
1.3401e+08 min 1.1148e+08
Current space PetscMalloc()ed:                           total 1.8356e+06 max 
3.0594e+05 min 3.0594e+05
************************************************************************************************************************
***             WIDEN YOUR WINDOW TO 120 CHARACTERS.  Use 'enscript -r 
-fCourier9' to print this document            ***
************************************************************************************************************************

---------------------------------------------- PETSc Performance Summary: 
----------------------------------------------

./ex45 on a arch-linux2-c-opt named sqg2b13.bullx with 6 processors, by 
kxc07-lxm25 Fri Dec  3 12:03:20 2021
Using Petsc Development GIT revision: v3.16.1-353-g887dddf386  GIT Date: 
2021-11-19 20:24:41 +0000

                         Max       Max/Min     Avg       Total
Time (sec):           3.918e+01     1.000   3.918e+01
Objects:              2.300e+01     1.000   2.300e+01
Flop:                 5.478e+08     1.009   5.461e+08  3.277e+09
Flop/sec:             1.398e+07     1.009   1.394e+07  8.364e+07
Memory:               1.340e+08     1.202   1.294e+08  7.766e+08
MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00
MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00
MPI Reductions:       7.200e+01     1.000

Flop counting convention: 1 flop = 1 real number operation of type 
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length N --> 2N 
flop
                            and VecAXPY() for complex vectors of length N --> 
8N flop

Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  -- 
Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total    Count   %Total     
Avg         %Total    Count   %Total
 0:      Main Stage: 3.9179e+01 100.0%  3.2768e+09 100.0%  0.000e+00   0.0%  
0.000e+00        0.0%  5.400e+01  75.0%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on interpreting 
output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flop: Max - maximum over all processors
                  Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   AvgLen: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush() and 
PetscLogStagePop().
      %T - percent time in this phase         %F - percent flop in this phase
      %M - percent messages in this phase     %L - percent message lengths in 
this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all 
processors)
   GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU time 
over all processors)
   CpuToGpu Count: total number of CPU to GPU copies per processor
   CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per 
processor)
   GpuToCpu Count: total number of GPU to CPU copies per processor
   GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per 
processor)
   GPU %F: percent flops on GPU in this event
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flop                             
 --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu - GPU
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---------------------------------------------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

BuildTwoSided          1 1.0 4.3384e-019319.4 0.00e+00 0.0 0.0e+00 0.0e+00 
1.0e+00  1  0  0  0  1   1  0  0  0  2     0       0      0 0.00e+00    0 
0.00e+00  0
BuildTwoSidedF         1 1.0 4.3386e-017318.8 0.00e+00 0.0 0.0e+00 0.0e+00 
1.0e+00  1  0  0  0  1   1  0  0  0  2     0       0      0 0.00e+00    0 
0.00e+00  0
MatMult                9 1.0 1.7409e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
0.00e+00  0
MatAssemblyBegin       1 1.0 4.3390e-014536.4 0.00e+00 0.0 0.0e+00 0.0e+00 
1.0e+00  1  0  0  0  1   1  0  0  0  2     0       0      0 0.00e+00    0 
0.00e+00  0
MatAssemblyEnd         1 1.0 1.8470e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  5  0  0  0  0   5  0  0  0  0     0       0      0 0.00e+00    0 
0.00e+00  0
KSPSetUp               1 1.0 1.2262e-01 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 
6.0e+00  0  0  0  0  8   0  0  0  0 11     0       0      0 0.00e+00    0 
0.00e+00  0
KSPSolve               1 1.0 7.1338e-01 1.6 5.26e+08 1.0 0.0e+00 0.0e+00 
2.5e+01  1 96  0  0 35   1 96  0  0 46  4410   363268      1 4.37e+01    0 
0.00e+00 100
DMCreateMat            1 1.0 1.3376e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
2.0e+00  3  0  0  0  3   3  0  0  0  4     0       0      0 0.00e+00    0 
0.00e+00  0
SFSetGraph             1 1.0 5.8724e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
0.00e+00  0
VecTDot               16 1.0 4.8565e-03 1.4 1.75e+08 1.0 0.0e+00 0.0e+00 
1.6e+01  0 32  0  0 22   0 32  0  0 30 215911   483839      0 0.00e+00    0 
0.00e+00 100
VecNorm               10 1.0 3.7066e-03 2.4 1.10e+08 1.0 0.0e+00 0.0e+00 
1.0e+01  0 20  0  0 14   0 20  0  0 19 176808   528830      0 0.00e+00    0 
0.00e+00 100
VecCopy                2 1.0 8.3062e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
0.00e+00  0
VecSet                16 1.0 2.6356e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
0.00e+00  0
VecAXPY               17 1.0 6.6446e-03 1.0 1.86e+08 1.0 0.0e+00 0.0e+00 
0.0e+00  0 34  0  0  0   0 34  0  0  0 167671   385598      1 4.37e+01    0 
0.00e+00 100
VecAYPX                7 1.0 2.0584e-03 1.0 7.67e+07 1.0 0.0e+00 0.0e+00 
0.0e+00  0 14  0  0  0   0 14  0  0  0 222868   231525      0 0.00e+00    0 
0.00e+00 100
VecCUDACopyTo          2 1.0 7.1194e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      2 8.74e+01    0 
0.00e+00  0
PCSetUp                1 1.0 3.3350e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00 85  0  0  0  0  85  0  0  0  0     0       0      0 0.00e+00    0 
0.00e+00  0
PCApply                9 1.0 6.7858e-01 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0   1  0  0  0  0     0       0      1 4.37e+01    0 
0.00e+00  0
---------------------------------------------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

       Krylov Solver     1              1         1672     0.
     DMKSP interface     1              1          664     0.
              Matrix     1              1         3008     0.
    Distributed Mesh     1              1         5560     0.
           Index Set     2              2     22257168     0.
   IS L to G Mapping     1              1     22257320     0.
   Star Forest Graph     3              3         3672     0.
     Discrete System     1              1          904     0.
           Weak Form     1              1          624     0.
              Vector     8              8    262977408     0.
      Preconditioner     1              1         1512     0.
              Viewer     2              1          848     0.
========================================================================================================================
Average time to get PetscTime(): 2.65e-08
Average time for MPI_Barrier(): 5.0166e-06
Average time for zero size MPI_Send(): 4.92433e-06
#PETSc Option Table entries:
-da_grid_x 320
-da_grid_y 320
-da_grid_z 320
-dm_mat_type hypre
-dm_vec_type cuda
-ksp_monitor
-ksp_type cg
-log_view
-malloc_dump
-malloc_log
-memory_view
-pc_hypre_type boomeramg
-pc_type hypre
#End of PETSc Option Table entries
Compiled without FORTRAN kernels
Compiled with full precision matrices (default)
sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
sizeof(PetscScalar) 8 sizeof(PetscInt) 4
Configure options: --with-debugging=0 
--with-blaslapack-dir=/lustre/scafellpike/local/apps/intel/intel_cs/2018.0.128/mkl
 --with-cuda=1 --with-cuda-arch=70 --download-hypre=yes 
--download-hypre-configure-arguments="--with-cuda=yes 
--enable-gpu-profiling=yes --enable-cusparse=yes --enable-cublas=yes 
--enable-curand=yes  --enable-unified-memory=yes HYPRE_CUDA_SM=70" 
--with-shared-libraries=1 --known-mpi-shared-libraries=1 --with-cc=mpicc 
--with-cxx=mpicxx -with-fc=mpif90
-----------------------------------------
Libraries compiled on 2021-12-02 20:26:42 on hcxlogin3 
Machine characteristics: 
Linux-3.10.0-1127.el7.x86_64-x86_64-with-redhat-7.8-Maipo
Using PETSc directory: 
/lustre/scafellpike/local/HT04048/lxm25/kxc07-lxm25/petsc-main/petsc
Using PETSc arch: arch-linux2-c-opt
-----------------------------------------

Using C compiler: mpicc  -fPIC -Wall -Wwrite-strings -Wno-strict-aliasing 
-Wno-unknown-pragmas -fstack-protector -fvisibility=hidden -g -O   
Using Fortran compiler: mpif90  -fPIC -Wall -ffree-line-length-0 
-Wno-unused-dummy-argument -g -O     
-----------------------------------------

Using include paths: 
-I/lustre/scafellpike/local/HT04048/lxm25/kxc07-lxm25/petsc-main/petsc/include 
-I/lustre/scafellpike/local/HT04048/lxm25/kxc07-lxm25/petsc-main/petsc/arch-linux2-c-opt/include
 -I/lustre/scafellpike/local/apps/intel/intel_cs/2018.0.128/mkl/include 
-I/lustre/scafellpike/local/apps/cuda/11.2/include
-----------------------------------------

Using C linker: mpicc
Using Fortran linker: mpif90
Using libraries: 
-Wl,-rpath,/lustre/scafellpike/local/HT04048/lxm25/kxc07-lxm25/petsc-main/petsc/arch-linux2-c-opt/lib
 
-L/lustre/scafellpike/local/HT04048/lxm25/kxc07-lxm25/petsc-main/petsc/arch-linux2-c-opt/lib
 -lpetsc 
-Wl,-rpath,/lustre/scafellpike/local/HT04048/lxm25/kxc07-lxm25/petsc-main/petsc/arch-linux2-c-opt/lib
 
-L/lustre/scafellpike/local/HT04048/lxm25/kxc07-lxm25/petsc-main/petsc/arch-linux2-c-opt/lib
 
-Wl,-rpath,/lustre/scafellpike/local/apps/intel/intel_cs/2018.0.128/mkl/lib/intel64
 -L/lustre/scafellpike/local/apps/intel/intel_cs/2018.0.128/mkl/lib/intel64 
-Wl,-rpath,/lustre/scafellpike/local/apps/cuda/11.2/lib64 
-L/lustre/scafellpike/local/apps/cuda/11.2/lib64 
-L/lustre/scafellpike/local/apps/cuda/11.2/lib64/stubs 
-Wl,-rpath,/lustre/scafellpike/local/apps/gcc7/openmpi/4.0.4-cuda11.2/lib 
-L/lustre/scafellpike/local/apps/gcc7/openmpi/4.0.4-cuda11.2/lib 
-Wl,-rpath,/opt/lsf/10.1/linux3.10-glibc2.17-x86_64/lib 
-L/opt/lsf/10.1/linux3.10-glibc2.17-x86_64/lib 
-Wl,-rpath,/lustre/scafellpike/local/apps/gcc7/gcc/7.2.0/lib/gcc/x86_64-pc-linux-gnu/7.2.0
 
-L/lustre/scafellpike/local/apps/gcc7/gcc/7.2.0/lib/gcc/x86_64-pc-linux-gnu/7.2.0
 -Wl,-rpath,/lustre/scafellpike/local/apps/gcc7/gcc/7.2.0/lib/gcc 
-L/lustre/scafellpike/local/apps/gcc7/gcc/7.2.0/lib/gcc 
-Wl,-rpath,/lustre/scafellpike/local/apps/gcc7/gcc/7.2.0/lib64 
-L/lustre/scafellpike/local/apps/gcc7/gcc/7.2.0/lib64 
-Wl,-rpath,/lustre/scafellpike/local/apps/gcc7/gcc/7.2.0/lib 
-L/lustre/scafellpike/local/apps/gcc7/gcc/7.2.0/lib -lHYPRE -lmkl_intel_lp64 
-lmkl_core -lmkl_sequential -lpthread -lm -lcudart -lcufft -lcublas -lcusparse 
-lcusolver -lcurand -lcuda -lX11 -lstdc++ -ldl -lmpi_usempi_ignore_tkr 
-lmpi_mpifh -lmpi -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lstdc++ -ldl
-----------------------------------------

Reply via email to