It looks cusparsestruct->stream is always created (not NULL). I don't know logic of the "if (!cusparsestruct->stream)". --Junchao Zhang
On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>> wrote: In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards the end of the function it had if (!yy) { /* MatMult */ if (!cusparsestruct->stream) { ierr = WaitForGPU();CHKERRCUDA(ierr); } } I assume we don't need the logic to do this only in the MatMult() with no add case and should just do this all the time, for the purposes of timing if no other reason. Is there some reason to NOT do this because of worries the about effects that these WaitForGPU() invocations might have on performance? I notice other problems in aijcusparse.cu<http://aijcusparse.cu>, now that I look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I see that we have GPU timing calls around the cusparse_csr_spmv() (but no WaitForGPU() inside the timed region). I believe this is another area in which we get a meaningless timing. It looks like we need a WaitForGPU() there, and then maybe inside the timed region handling the scatter. (I don't know if this stuff happens asynchronously or not.) But do we potentially want two WaitForGPU() calls in one function, just to help with getting timings? I don't have a good idea of how much overhead this adds. --Richard On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote: I made the following changes: 1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end ierr = WaitForGPU();CHKERRCUDA(ierr); ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr); ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr); PetscFunctionReturn(0); 2) In MatMult_MPIAIJCUSPARSE, use the following code sequence. The old code swapped the first two lines. Since with -log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed the order to have better overlap. ierr = VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr); ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr); ierr = VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr); ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr); 3) Log time directly in the test code so we can also know execution time without -log_view (hence cuda synchronization). I manually calculated the Total Mflop/s for these cases for easy comparison. <<Note the CPU versions are copied from yesterday's results>> ------------------------------------------------------------------------------------------------------------------------ Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F --------------------------------------------------------------------------------------------------------------------------------------------------------------- 6 MPI ranks, MatMult 100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 0.0e+00 24 99 97 18 0 100100100100 0 4743 0 0 0.00e+00 0 0.00e+00 0 VecScatterBegin 100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00 0 0 97 18 0 0 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0 VecScatterEnd 100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 3 0 0 0 0 13 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 24 MPI ranks MatMult 100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00 8 99 97 25 0 100100100100 0 17948 0 0 0.00e+00 0 0.00e+00 0 VecScatterBegin 100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00 0 0 97 25 0 0 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0 VecScatterEnd 100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 2 0 0 0 0 19 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 42 MPI ranks MatMult 100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 0.0e+00 23 99 97 30 0 100100100100 0 27493 0 0 0.00e+00 0 0.00e+00 0 VecScatterBegin 100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 0.0e+00 0 0 97 30 0 1 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0 VecScatterEnd 100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 6 0 0 0 0 24 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 6 MPI ranks + 6 GPUs + regular SF + log_view MatMult 100 1.0 1.6863e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 0.0e+00 0 99 97 18 0 100100100100 0 335743 629278 100 1.02e+02 100 2.69e+02 100 VecScatterBegin 100 1.0 5.0157e-02 1.6 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00 0 0 97 18 0 24 0100100 0 0 0 0 0.00e+00 100 2.69e+02 0 VecScatterEnd 100 1.0 4.9155e-02 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 20 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 VecCUDACopyTo 100 1.0 9.5078e-03 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 4 0 0 0 0 0 0 100 1.02e+02 0 0.00e+00 0 VecCopyFromSome 100 1.0 2.8485e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 14 0 0 0 0 0 0 0 0.00e+00 100 2.69e+02 0 6 MPI ranks + 6 GPUs + regular SF + No log_view MatMult: 100 1.0 1.4180e-01 399268 6 MPI ranks + 6 GPUs + CUDA-aware SF + log_view MatMult 100 1.0 1.1053e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 0.0e+00 1 99 97 18 0 100100100100 0 512224 642075 0 0.00e+00 0 0.00e+00 100 VecScatterBegin 100 1.0 8.3418e-03 1.5 0.00e+00 0.0 2.8e+03 2.2e+05 0.0e+00 0 0 97 18 0 6 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0 VecScatterEnd 100 1.0 2.2619e-02 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 16 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 6 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view MatMult: 100 1.0 9.8344e-02 575717 24 MPI ranks + 6 GPUs + regular SF + log_view MatMult 100 1.0 1.1572e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00 0 99 97 25 0 100100100100 0 489223 708601 100 4.61e+01 100 6.72e+01 100 VecScatterBegin 100 1.0 2.0641e-02 2.0 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00 0 0 97 25 0 13 0100100 0 0 0 0 0.00e+00 100 6.72e+01 0 VecScatterEnd 100 1.0 6.8114e-02 5.6 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 38 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 VecCUDACopyTo 100 1.0 6.6646e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 100 4.61e+01 0 0.00e+00 0 VecCopyFromSome 100 1.0 1.0546e-02 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 7 0 0 0 0 0 0 0 0.00e+00 100 6.72e+01 0 24 MPI ranks + 6 GPUs + regular SF + No log_view MatMult: 100 1.0 9.8254e-02 576201 24 MPI ranks + 6 GPUs + CUDA-aware SF + log_view MatMult 100 1.0 1.1602e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 0.0e+00 1 99 97 25 0 100100100100 0 487956 707524 0 0.00e+00 0 0.00e+00 100 VecScatterBegin 100 1.0 2.7088e-02 7.0 0.00e+00 0.0 1.9e+04 5.9e+04 0.0e+00 0 0 97 25 0 8 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0 VecScatterEnd 100 1.0 8.4262e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 1 0 0 0 0 52 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0 24 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view MatMult: 100 1.0 1.0397e-01 544510 [https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif]