Another small question -- I'm a little confused around timing GPU codes with PETSc. I have a code that looks like: ``` start = now() for (int i = 0; i < 10; i++) { MatMult(A, x, y); } end = now() print(end - start / 10) ```
If I run this program with `-vec_type cuda -mat_type aijcusparse`, the GPUs are indeed utilized, but the recorded time is very tiny (i imagine just tracking the cost of launching cuda kernels). However, if I add `-log_view` to the command line arguments, then the resulting time printed matches what is recorded by `nvprof`. What is the correct way to benchmark PETSc with GPUs without having -log_view turned on? Thanks, Rohan On Sat, Jan 15, 2022 at 7:37 AM Barry Smith <bsm...@petsc.dev> wrote: > > Oh yes, you are correct for this operation since the handling of > different nonzero pattern is not trivial to implement well for the GPU. > > On Jan 15, 2022, at 1:17 AM, Rohan Yadav <roh...@alumni.cmu.edu> wrote: > > Scanning the source code for mpiseqaijcusparse confirms my thoughts -- > when used with DIFFERENT_NONZERO_PATTERN, it falls back to calling > MatAXPY_SeqAIJ, copying the data back over to the host. > > Rohan > > On Fri, Jan 14, 2022 at 10:16 PM Rohan Yadav <roh...@alumni.cmu.edu> > wrote: > >> >> >> ---------- Forwarded message --------- >> From: Rohan Yadav <roh...@alumni.cmu.edu> >> Date: Fri, Jan 14, 2022 at 10:03 PM >> Subject: Re: [petsc-dev] Using PETSC with GPUs >> To: Barry Smith <bsm...@petsc.dev> >> >> >> Ok, I'll try looking with greps like and see what I find. >> >> > My guess why your code is not using the seqaijcusparse is that you are >> not setting the type before you call MatLoad() hence it loads with SeqAIJ. >> -mat_type does not magically change a type once a matrix has a set type. I >> agree our documentation on how to make objects be GPU objects is horrible >> now. >> >> I printed out my matrices with the PetscViewer objects and can confirm >> that the type is seqaijcusparse. Perhaps for the way I'm using it >> (DIFFERENT_NONZERO_PATTERN) the kernel is unsupported? I'm not sure how to >> get any more diagnostic info about why the cuda kernel isn't called... >> >> Rohan >> >> On Fri, Jan 14, 2022 at 9:46 PM Barry Smith <bsm...@petsc.dev> wrote: >> >>> >>> This changes rapidly and depends on if the backend is CUDA, HIP, Sycl, >>> or Kokkos. The only way to find out definitively is with, for example, >>> >>> git grep MatMult_ | egrep -i "(cusparse|cublas|cuda)" >>> >>> >>> Because of our, unfortunately, earlier naming choices you need to kind >>> of know what to grep for, for CUDA it may be cuSparse or cuBLAS >>> >>> Not yet merged branches may also have some operations that are still >>> being developed. >>> >>> My guess why your code is not using the seqaijcusparse is that you are >>> not setting the type before you call MatLoad() hence it loads with SeqAIJ. >>> -mat_type does not magically change a type once a matrix has a set type. I >>> agree our documentation on how to make objects be GPU objects is horrible >>> now. >>> >>> Barry >>> >>> >>> On Jan 15, 2022, at 12:31 AM, Rohan Yadav <roh...@alumni.cmu.edu> wrote: >>> >>> I was wondering if there is a definitive list for what operations are >>> and aren't supported for distributed GPU execution. For some operations, >>> like `MatMult`, it is clear that MPIAIJCUSPARSE implements MatMult from the >>> documentation, but other operations it is unclear, such as MatMatMult. >>> Another scenario is the MatAXPY kernel, which supposedly has a >>> SeqAIJCUSPARSE implementation, which I take means that it can only execute >>> on a single GPU. However, even if I pass -mat_type seqaijcusparse to the >>> kernel it doesn't seem to utilize the GPU. >>> >>> Rohan >>> >>> On Fri, Jan 14, 2022 at 4:05 PM Barry Smith <bsm...@petsc.dev> wrote: >>> >>>> >>>> Just use 1 MPI rank. >>>> >>>> >>>> ------------------------------------------------------------------------------------------------------------------------ >>>> Event Count Time (sec) Flop >>>> --- Global --- --- Stage ---- Total GPU - CpuToGpu - - >>>> GpuToCpu - GPU >>>> Max Ratio Max Ratio Max Ratio Mess AvgLen >>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size >>>> Count Size %F >>>> >>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >>>> >>>> --- Event Stage 0: Main Stage >>>> >>>> BuildTwoSided 1 1.0 1.8650e-013467.8 0.00e+00 0.0 2.0e+00 >>>> 4.0e+00 1.0e+00 0 0 3 0 2 0 0 3 0 4 0 0 0 >>>> 0.00e+00 0 0.00e+00 0 >>>> MatMult 30 1.0 6.6642e+01 1.0 1.16e+10 1.0 6.4e+01 >>>> 6.4e+08 1.0e+00 65100 91 93 2 65100 91 93 4 346 0 0 >>>> 0.00e+00 31 2.65e+04 0 >>>> >>>> From this it is clear the matrix never ended up on the GPU, but the >>>> vector did. For each multiply, it is copying the vector from the GPU to the >>>> CPU and then doing the MatMult on the CPU. If the MatMult was done on the >>>> GPU the file number in the row would be 100% indicating all the flops were >>>> done on the GPU and the fifth from the end value of 0 would be some large >>>> number, being the flop rate on the GPU. >>>> >>>> >>>> >>>> On Jan 14, 2022, at 4:59 PM, Rohan Yadav <roh...@alumni.cmu.edu> wrote: >>>> >>>> A log_view is attached at the end of the mail. >>>> >>>> I am running on a large problem size (639 million nonzeros). >>>> >>>> > * I assume you are assembling the matrix on the CPU. The copy of data >>>> to the GPU takes time and you really should be creating the matrix on the >>>> GPU >>>> >>>> How do I do this? I'm loading the matrix in from a file, but I'm >>>> running the computation several times (and with a warmup), so I would >>>> expect that the data is copied onto the GPU the first time. My (cpu) code >>>> to do this is here: >>>> https://github.com/rohany/taco/blob/5c0a4f4419ba392838590ce24e0043f632409e7b/petsc/benchmark.cpp#L68 >>>> . >>>> >>>> Log view: >>>> >>>> ---------------------------------------------- PETSc Performance >>>> Summary: ---------------------------------------------- >>>> >>>> ./bin/benchmark on a named lassen75 with 2 processors, by yadav2 Fri >>>> Jan 14 13:54:09 2022 >>>> Using Petsc Release Version 3.16.3, unknown >>>> >>>> Max Max/Min Avg Total >>>> Time (sec): 1.026e+02 1.000 1.026e+02 >>>> Objects: 1.200e+01 1.000 1.200e+01 >>>> Flop: 1.156e+10 1.009 1.151e+10 2.303e+10 >>>> Flop/sec: 1.127e+08 1.009 1.122e+08 2.245e+08 >>>> MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01 >>>> MPI Message Lengths: 2.210e+10 1.000 6.313e+08 4.419e+10 >>>> MPI Reductions: 4.100e+01 1.000 >>>> >>>> Flop counting convention: 1 flop = 1 real number operation of type >>>> (multiply/divide/add/subtract) >>>> e.g., VecAXPY() for real vectors of length >>>> N --> 2N flop >>>> and VecAXPY() for complex vectors of length >>>> N --> 8N flop >>>> >>>> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages >>>> --- -- Message Lengths -- -- Reductions -- >>>> Avg %Total Avg %Total Count >>>> %Total Avg %Total Count %Total >>>> 0: Main Stage: 1.0257e+02 100.0% 2.3025e+10 100.0% 7.000e+01 >>>> 100.0% 6.313e+08 100.0% 2.300e+01 56.1% >>>> >>>> >>>> ------------------------------------------------------------------------------------------------------------------------ >>>> See the 'Profiling' chapter of the users' manual for details on >>>> interpreting output. >>>> Phase summary info: >>>> Count: number of times phase was executed >>>> Time and Flop: Max - maximum over all processors >>>> Ratio - ratio of maximum to minimum over all >>>> processors >>>> Mess: number of messages sent >>>> AvgLen: average message length (bytes) >>>> Reduct: number of global reductions >>>> Global: entire computation >>>> Stage: stages of a computation. Set stages with PetscLogStagePush() >>>> and PetscLogStagePop(). >>>> %T - percent time in this phase %F - percent flop in this >>>> phase >>>> %M - percent messages in this phase %L - percent message >>>> lengths in this phase >>>> %R - percent reductions in this phase >>>> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time >>>> over all processors) >>>> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max >>>> GPU time over all processors) >>>> CpuToGpu Count: total number of CPU to GPU copies per processor >>>> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per >>>> processor) >>>> GpuToCpu Count: total number of GPU to CPU copies per processor >>>> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per >>>> processor) >>>> GPU %F: percent flops on GPU in this event >>>> >>>> ------------------------------------------------------------------------------------------------------------------------ >>>> Event Count Time (sec) Flop >>>> --- Global --- --- Stage ---- Total GPU - CpuToGpu - - >>>> GpuToCpu - GPU >>>> Max Ratio Max Ratio Max Ratio Mess AvgLen >>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size >>>> Count Size %F >>>> >>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >>>> >>>> --- Event Stage 0: Main Stage >>>> >>>> BuildTwoSided 1 1.0 1.8650e-013467.8 0.00e+00 0.0 2.0e+00 >>>> 4.0e+00 1.0e+00 0 0 3 0 2 0 0 3 0 4 0 0 0 >>>> 0.00e+00 0 0.00e+00 0 >>>> MatMult 30 1.0 6.6642e+01 1.0 1.16e+10 1.0 6.4e+01 >>>> 6.4e+08 1.0e+00 65100 91 93 2 65100 91 93 4 346 0 0 >>>> 0.00e+00 31 2.65e+04 0 >>>> MatAssemblyBegin 1 1.0 3.1100e-07 1.1 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 >>>> 0.00e+00 0 0.00e+00 0 >>>> MatAssemblyEnd 1 1.0 1.9798e+01 1.0 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 4.0e+00 19 0 0 0 10 19 0 0 0 17 0 0 0 >>>> 0.00e+00 0 0.00e+00 0 >>>> MatLoad 1 1.0 3.5519e+01 1.0 0.00e+00 0.0 6.0e+00 >>>> 5.4e+08 1.6e+01 35 0 9 7 39 35 0 9 7 70 0 0 0 >>>> 0.00e+00 0 0.00e+00 0 >>>> VecSet 5 1.0 5.8959e-02 1.1 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 >>>> 0.00e+00 0 0.00e+00 0 >>>> VecScatterBegin 30 1.0 5.4085e+00 1.0 0.00e+00 0.0 6.4e+01 >>>> 6.4e+08 1.0e+00 5 0 91 93 2 5 0 91 93 4 0 0 0 >>>> 0.00e+00 0 0.00e+00 0 >>>> VecScatterEnd 30 1.0 9.2544e+00 2.5 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 0.0e+00 6 0 0 0 0 6 0 0 0 0 0 0 0 >>>> 0.00e+00 0 0.00e+00 0 >>>> VecCUDACopyFrom 31 1.0 4.0174e-01 1.0 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 >>>> 0.00e+00 31 2.65e+04 0 >>>> SFSetGraph 1 1.0 4.4912e-02 1.0 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 >>>> 0.00e+00 0 0.00e+00 0 >>>> SFSetUp 1 1.0 5.2595e+00 1.0 0.00e+00 0.0 4.0e+00 >>>> 1.7e+08 1.0e+00 5 0 6 2 2 5 0 6 2 4 0 0 0 >>>> 0.00e+00 0 0.00e+00 0 >>>> SFPack 30 1.0 3.4021e-02 1.0 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 >>>> 0.00e+00 0 0.00e+00 0 >>>> SFUnpack 30 1.0 1.9222e-05 1.5 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 >>>> 0.00e+00 0 0.00e+00 0 >>>> >>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >>>> >>>> Memory usage is given in bytes: >>>> >>>> Object Type Creations Destructions Memory Descendants' >>>> Mem. >>>> Reports information only for process 0. >>>> >>>> --- Event Stage 0: Main Stage >>>> >>>> Matrix 3 0 0 0. >>>> Viewer 2 0 0 0. >>>> Vector 4 1 1792 0. >>>> Index Set 2 2 335250404 0. >>>> Star Forest Graph 1 0 0 0. >>>> >>>> ======================================================================================================================== >>>> Average time to get PetscTime(): 3.77e-08 >>>> Average time for MPI_Barrier(): 8.754e-07 >>>> Average time for zero size MPI_Send(): 2.6755e-06 >>>> #PETSc Option Table entries: >>>> -log_view >>>> -mat_type aijcusparse >>>> -matrix /p/gpfs1/yadav2/tensors//petsc/kmer_V1r.petsc >>>> -n 20 >>>> -vec_type cuda >>>> -warmup 10 >>>> #End of PETSc Option Table entries >>>> Compiled without FORTRAN kernels >>>> Compiled with full precision matrices (default) >>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 >>>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4 >>>> Configure options: --download-c2html=0 --download-hwloc=0 >>>> --download-sowing=0 --prefix=./petsc-install/ --with-64-bit-indices=0 >>>> --with-blaslapack-lib="/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/liblapack.so >>>> /usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/libblas.so" >>>> --with-cc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc >>>> --with-clanguage=C --with-cxx-dialect=C++17 >>>> --with-cxx=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpig++ >>>> --with-cuda=1 --with-debugging=0 >>>> --with-fc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran >>>> --with-fftw=0 >>>> --with-hdf5-dir=/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4 >>>> --with-hdf5=1 --with-mumps=0 --with-precision=double --with-scalapack=0 >>>> --with-scalar-type=real --with-shared-libraries=1 --with-ssl=0 >>>> --with-suitesparse=0 --with-trilinos=0 --with-valgrind=0 --with-x=0 >>>> --with-zlib-include=/usr/include --with-zlib-lib=/usr/lib64/libz.so >>>> --with-zlib=1 CFLAGS="-g -DNoChange" COPTFLAGS="-O3" CXXFLAGS="-O3" >>>> CXXOPTFLAGS="-O3" FFLAGS=-g CUDAFLAGS=-std=c++17 FOPTFLAGS= >>>> PETSC_ARCH=arch-linux-c-opt >>>> ----------------------------------------- >>>> Libraries compiled on 2022-01-14 20:56:04 on lassen99 >>>> Machine characteristics: >>>> Linux-4.14.0-115.21.2.1chaos.ch6a.ppc64le-ppc64le-with-redhat-7.6-Maipo >>>> Using PETSc directory: /g/g15/yadav2/taco/petsc/petsc/petsc-install >>>> Using PETSc arch: >>>> ----------------------------------------- >>>> >>>> Using C compiler: >>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc >>>> -g -DNoChange -fPIC "-O3" >>>> Using Fortran compiler: >>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran >>>> -g -fPIC >>>> ----------------------------------------- >>>> >>>> Using include paths: >>>> -I/g/g15/yadav2/taco/petsc/petsc/petsc-install/include >>>> -I/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/include >>>> -I/usr/include -I/usr/tce/packages/cuda/cuda-11.1.0/include >>>> ----------------------------------------- >>>> >>>> Using C linker: >>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc >>>> Using Fortran linker: >>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran >>>> Using libraries: >>>> -Wl,-rpath,/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib >>>> -L/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib -lpetsc >>>> -Wl,-rpath,/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib >>>> -L/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib >>>> -Wl,-rpath,/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib >>>> -L/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib >>>> -Wl,-rpath,/usr/tce/packages/cuda/cuda-11.1.0/lib64 >>>> -L/usr/tce/packages/cuda/cuda-11.1.0/lib64 >>>> -Wl,-rpath,/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib >>>> -L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib >>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 >>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 >>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc >>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc >>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 >>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 >>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib >>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib -llapack -lblas -lhdf5_hl >>>> -lhdf5 -lm /usr/lib64/libz.so -lcuda -lcudart -lcufft -lcublas -lcusparse >>>> -lcusolver -lcurand -lstdc++ -ldl -lmpiprofilesupport -lmpi_ibm_usempi >>>> -lmpi_ibm_mpifh -lmpi_ibm -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath >>>> -lpthread -lquadmath -lstdc++ -ldl >>>> ----------------------------------------- >>>> >>>> On Fri, Jan 14, 2022 at 1:43 PM Mark Adams <mfad...@lbl.gov> wrote: >>>> >>>>> There are a few things: >>>>> * GPU have higher latencies and so you basically need a large >>>>> enough problem to get GPU speedup >>>>> * I assume you are assembling the matrix on the CPU. The copy of data >>>>> to the GPU takes time and you really should be creating the matrix on the >>>>> GPU >>>>> * I agree with Barry, Roughly 1M / GPU is around where you start >>>>> seeing a win but this depends on a lot of things. >>>>> * There are startup costs, like the CPU-GPU copy. It is best to run >>>>> one mat-vec, or whatever, push a new stage and then run the benchmark. The >>>>> timing for this new stage will be separate in the log view data. Look at >>>>> that. >>>>> - You can fake this by running your benchmark many times to amortize >>>>> any setup costs. >>>>> >>>>> On Fri, Jan 14, 2022 at 4:27 PM Rohan Yadav <roh...@alumni.cmu.edu> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I'm looking to use PETSc with GPUs to do some linear algebra >>>>>> operations, like SpMV, SPMM etc. Building PETSc with `--with-cuda=1` and >>>>>> running with `-mat_type aijcusparse -vec_type cuda` gives me a large >>>>>> slowdown from the same code running on the CPU. This is not entirely >>>>>> unexpected, as things like data transfer costs across the PCIE might >>>>>> erroneously be included in my timing. Are there some examples of >>>>>> benchmarking GPU computations with PETSc, or just the proper way to write >>>>>> code in PETSc that will work for CPUs and GPUs? >>>>>> >>>>>> Rohan >>>>>> >>>>> >>>> >>> >