Hi All,

I'm trying to use the MatMatMult function with 1 sparse matrix B and two
dense matrices A, C. I'm computing A = B * C.

My code is below:

```

void spmm(Mat B, int warmup, int niter) {
  Mat A, C;
  PetscInt i, j = 32, k;
  MatGetSize(B, &i, &k);
  MatCreateDenseCUDA(PETSC_COMM_WORLD, PETSC_DECIDE, PETSC_DECIDE, i,
j, NULL, &A);
  MatCreateDenseCUDA(PETSC_COMM_WORLD, PETSC_DECIDE, PETSC_DECIDE, k,
j, NULL, &C);

  // Initialize entries in the output.
  MatZeroEntries(A);
  setMatToConstant(C, 1.0);

  // Finally, do the computation.
  auto avgTime = benchmarkWithWarmup(warmup, niter, [&]() {
    MatMatMult(B, C, MAT_REUSE_MATRIX, PETSC_DEFAULT, &A);
  });
  PetscPrintf(PETSC_COMM_WORLD, "Average time: %lf ms.\n", avgTime * 1000);
}

```

where benchmarkWithWarmup is a simple wrapper function that runs a
lambda several times.


I'm running this function with arguments `-vec_type cuda -mat_type
aijcusparse`, and see that the performance is relatively slow. I'm
wondering if I'm using the API incorrectly, or the computation is
executing as expected. `nvprof` shows that much of the time is spent
in a device to host memcpys:

```
            Type  Time(%)      Time     Calls       Avg       Min
 Max  Name
 GPU activities:   87.32%  11.9978s        33  363.57ms  1.5040us
388.26ms  [CUDA memcpy DtoH]
                    8.71%  1.19611s        30  39.870ms  37.421ms
39.976ms  void cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned
int=128, bool=0, bool=0, unsigned int=8, unsigned int=16, unsigned
int=4, unsigned int=0>, int, double, double, double>(bool=0,
cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0,
bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned
int=0>, int, double, double, double>,
cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0,
bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned
int=0>, int, double, double, double>, bool=0 const *, bool=0 const ,
bool=0, bool=0, int, cusparseOperation_t,
cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0,
bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned
int=0>, int, double, double, double> const *,
cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0,
bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned
int=0>, int, double, double, double> const , unsigned int=8 const *,
unsigned int=8 const ,
cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0,
bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned
int=0>, int, double, double, double>, unsigned int=16*,
cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0,
bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned
int=0>, int, double, double, double>)
                    3.87%  531.56ms        14  37.968ms  1.0240us
227.29ms  [CUDA memcpy HtoD]
                    0.07%  9.7452ms         6  1.6242ms  1.0880us
3.2481ms  [CUDA memset]
                    0.02%  2.8727ms         1  2.8727ms  2.8727ms
2.8727ms  void 
thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<double>,
double>, unsigned long>,
thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<double>,
double>, unsigned long>(thrust::device_ptr<double>, double)
                    0.01%  1.4953ms         2  747.67us  56.188us
1.4392ms  void 
thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>,
int>, unsigned long>,
thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>,
int>, unsigned long>(thrust::device_ptr<int>, int)```

The logview output is:

```
---------------------------------------------- PETSc Performance
Summary: ----------------------------------------------

/g/g15/yadav2/taco/petsc/bin/benchmark on a  named lassen457 with 2
processors, by yadav2 Wed Feb  2 17:23:19 2022
Using Petsc Release Version 3.16.3, unknown

                         Max       Max/Min     Avg       Total
Time (sec):           1.163e+02     1.000   1.163e+02
Objects:              4.800e+01     1.000   4.800e+01
Flop:                 6.338e+11     1.065   6.144e+11  1.229e+12
Flop/sec:             5.451e+09     1.065   5.284e+09  1.057e+10
MPI Messages:         3.500e+01     1.000   3.500e+01  7.000e+01
MPI Message Lengths:  2.544e+09     1.000   7.267e+07  5.087e+09
MPI Reductions:       8.100e+01     1.000

Flop counting convention: 1 flop = 1 real number operation of type
(multiply/divide/add/subtract)
                            e.g., VecAXPY() for real vectors of length
N --> 2N flop
                            and VecAXPY() for complex vectors of
length N --> 8N flop

Summary of Stages:   ----- Time ------  ----- Flop ------  ---
Messages ---  -- Message Lengths --  -- Reductions --
                        Avg     %Total     Avg     %Total    Count
%Total     Avg         %Total    Count   %Total
 0:      Main Stage: 1.1628e+02 100.0%  1.2288e+12 100.0%  7.000e+01
100.0%  7.267e+07      100.0%  6.300e+01  77.8%

------------------------------------------------------------------------------------------------------------------------
See the 'Profiling' chapter of the users' manual for details on
interpreting output.
Phase summary info:
   Count: number of times phase was executed
   Time and Flop: Max - maximum over all processors
                  Ratio - ratio of maximum to minimum over all processors
   Mess: number of messages sent
   AvgLen: average message length (bytes)
   Reduct: number of global reductions
   Global: entire computation
   Stage: stages of a computation. Set stages with PetscLogStagePush()
and PetscLogStagePop().
      %T - percent time in this phase         %F - percent flop in this phase
      %M - percent messages in this phase     %L - percent message
lengths in this phase
      %R - percent reductions in this phase
   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time
over all processors)
   GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max
GPU time over all processors)
   CpuToGpu Count: total number of CPU to GPU copies per processor
   CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies
per processor)
   GpuToCpu Count: total number of GPU to CPU copies per processor
   GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies
per processor)
   GPU %F: percent flops on GPU in this event
------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flop
          --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -
 - GpuToCpu - GPU
                   Max Ratio  Max     Ratio   Max  Ratio  Mess
AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count
Size   Count   Size  %F
---------------------------------------------------------------------------------------------------------------------------------------------------------------

--- Event Stage 0: Main Stage

BuildTwoSided          2 1.0 4.4400e-01567.5 0.00e+00 0.0 2.0e+00
4.0e+00 2.0e+00  0  0  3  0  2   0  0  3  0  3     0       0      0
0.00e+00    0 0.00e+00  0
BuildTwoSidedF         1 1.0 4.4395e-0115659.1 0.00e+00 0.0 0.0e+00
0.0e+00 1.0e+00  0  0  0  0  1   0  0  0  0  2     0       0      0
0.00e+00    0 0.00e+00  0
MatAssemblyBegin      32 1.0 4.4400e-017378.9 0.00e+00 0.0 0.0e+00
0.0e+00 1.0e+00  0  0  0  0  1   0  0  0  0  2     0       0      0
0.00e+00    0 0.00e+00  0
MatAssemblyEnd        32 1.0 1.8511e+00 2.2 0.00e+00 0.0 0.0e+00
0.0e+00 6.0e+00  1  0  0  0  7   1  0  0  0 10     0       0      0
0.00e+00    0 0.00e+00  0
MatZeroEntries         1 1.0 3.3306e-03 1.0 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
0.00e+00    0 0.00e+00  0
MatLoad                1 1.0 1.7220e+01 1.0 0.00e+00 0.0 6.0e+00
-8.8e+07 2.1e+01 15  0  9-10 26  15  0  9-10 33     0       0      0
0.00e+00    0 0.00e+00  0
MatMatMultSym         60 1.0 9.2215e-01 2.6 0.00e+00 0.0 4.0e+00
7.3e+05 3.2e+01  1  0  6  0 40   1  0  6  0 51     0       0      0
0.00e+00    0 0.00e+00  0
MatMatMultNum         30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01
9.4e+07 0.0e+00 37100 86110  0  37100 86110  0 28598   920026      2
6.71e+03   30 8.73e+04 98
MatCUSPARSCopyTo       1 1.0 4.4761e-01 1.0 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      1
3.80e+03    0 0.00e+00  0
MatDenseCopyTo         1 1.0 2.2742e-01 1.0 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      1
2.91e+03    0 0.00e+00  0
MatDenseCopyFrom      31 1.0 1.2006e+01 1.0 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00 10  0  0  0  0  10  0  0  0  0     0       0      0
0.00e+00   31 9.02e+04  0
VecSet                 3 1.0 4.1917e-04 1.1 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
0.00e+00    0 0.00e+00  0
SFSetGraph             1 1.0 1.9180e-04 1.1 0.00e+00 0.0 0.0e+00
0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0
0.00e+00    0 0.00e+00  0
SFSetUp                1 1.0 1.3672e-02 1.1 0.00e+00 0.0 4.0e+00
7.3e+05 1.0e+00  0  0  6  0  1   0  0  6  0  2     0       0      0
0.00e+00    0 0.00e+00  0
---------------------------------------------------------------------------------------------------------------------------------------------------------------

Memory usage is given in bytes:

Object Type          Creations   Destructions     Memory  Descendants' Mem.
Reports information only for process 0.

--- Event Stage 0: Main Stage

              Matrix    37             30   2867511840     0.
              Viewer     2              0            0     0.
              Vector     4              1         1792     0.
           Index Set     2              2      1495248     0.
   Star Forest Graph     3              0            0     0.
========================================================================================================================
Average time to get PetscTime(): 3.83e-08
Average time for MPI_Barrier(): 7.874e-07
Average time for zero size MPI_Send(): 3.4035e-06
#PETSc Option Table entries:
-bench spmm
-enable_gpu
-log_view
-mat_type aijcusparse
-matload_block_size 1
-matrix /p/gpfs1/yadav2/tensors/petsc/arabic-2005.petsc
-n 20
-vec_type cuda
-warmup 10
```


Thanks,


Rohan Yadav

Reply via email to