> On Sep 21, 2019, at 11:00 AM, Zhang, Junchao <jczh...@mcs.anl.gov> wrote:
> We log gpu time before/after cusparse calls. 
> https://gitlab.com/petsc/petsc/blob/master/src%2Fmat%2Fimpls%2Faij%2Fseq%2Fseqcusparse%2Faijcusparse.cu#L1441
> But according to 
> https://docs.nvidia.com/cuda/cusparse/index.html#asynchronous-execution, 
> cusparse is asynchronous. Does that mean the gpu time is meaningless?
> --Junchao Zhang

  Yes it looks like those numbers are meaningless from that routine, thanks for 

  The ierr=WaitForGPU();CHKERRCUDA(ierr); is what is turned on for timing and 
would capture all the compute time. Perhaps it could be moved to appropriate 
places in that routine.  Of course when running with -log_view WaitForGPU(); is 
turned on and that may slow things down so we don't get the best numbers; I 
have no idea if they would be noticeably higher if the WaitForGPU was off.

  I don't understand how the streams are used if(!cusparsestruct->stream){  it 
seems some methods use them someplaces but not others.


> On Sat, Sep 21, 2019 at 8:30 AM Smith, Barry F. <bsm...@mcs.anl.gov> wrote:
>    Hannah, Junchao and Richard,
>     The on-GPU flop rates for 24 MPI ranks and 24 MPS GPUs looks totally 
> funky. 951558 and 973391 they are so much lower than unvirtualized 3084009
>   and 3133521 and yet the total time to solution is similar for the runs.
>     Is it possible these are being counted or calculated wrong? If not what 
> does this mean? Please check the code that computes them (I can't imagine it 
> is wrong but ...)
>     It means the GPUs are taking 3.x times more to do the multiplies in the 
> MPS case but where is that time coming from in the other numbers? 
> Communication time doesn't drop that much?
>     I can't present these numbers with this huge inconsistency
> Thanks,
>    Barry
> > On Sep 20, 2019, at 11:22 PM, Zhang, Junchao via petsc-dev 
> > <petsc-dev@mcs.anl.gov> wrote:
> > 
> > I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix Collection. 
> > Its size is about 2M x 2M. Then I ran the same MatMult 100 times on one 
> > node of Summit with -mat_type aijcusparse -vec_type cuda. I found MatMult 
> > was almost dominated by VecScatter in this simple test. Using 6 MPI ranks + 
> > 6 GPUs,  I found CUDA aware SF could improve performance. But if I enabled 
> > Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I found CUDA 
> > aware SF hurt performance. I don't know why and have to profile it. I will 
> > also collect  data with multiple nodes. Are the matrix and tests proper?
> > 
> > ------------------------------------------------------------------------------------------------------------------------
> > Event                Count      Time (sec)     Flop                         
> >      --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - 
> > GpuToCpu - GPU
> >                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
> > Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count 
> >   Size  %F
> > ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> > 6 MPI ranks (CPU version)
> > MatMult              100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
> > 0.0e+00 24 99 97 18  0 100100100100  0  4743       0      0 0.00e+00    0 
> > 0.00e+00  0
> > VecScatterBegin      100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
> > 0.0e+00  0  0 97 18  0   0  0100100  0     0       0      0 0.00e+00    0 
> > 0.00e+00  0
> > VecScatterEnd        100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  3  0  0  0  0  13  0  0  0  0     0       0      0 0.00e+00    0 
> > 0.00e+00  0
> > 
> > 6 MPI ranks + 6 GPUs + regular SF
> > MatMult              100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> > 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 
> > 2.69e+02 100
> > VecScatterBegin      100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 
> > 0.0e+00  0  0 97 18  0  64  0100100  0     0       0      0 0.00e+00  100 
> > 2.69e+02  0
> > VecScatterEnd        100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  0  0  0  0  0  22  0  0  0  0     0       0      0 0.00e+00    0 
> > 0.00e+00  0
> > VecCUDACopyTo        100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  0  0  0  0  0   5  0  0  0  0     0       0    100 1.02e+02    0 
> > 0.00e+00  0
> > VecCopyFromSome      100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  0  0  0  0  0  54  0  0  0  0     0       0      0 0.00e+00  100 
> > 2.69e+02  0
> > 
> > 6 MPI ranks + 6 GPUs + CUDA-aware SF
> > MatMult              100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> > 0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+00    0 
> > 0.00e+00 100
> > VecScatterBegin      100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 
> > 0.0e+00  1  0 97 18  0  70  0100100  0     0       0      0 0.00e+00    0 
> > 0.00e+00  0
> > VecScatterEnd        100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  0  0  0  0  0  17  0  0  0  0     0       0      0 0.00e+00    0 
> > 0.00e+00  0
> > 
> > 24 MPI ranks + 6 GPUs + regular SF
> > MatMult              100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
> > 0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01  100 
> > 6.72e+01 100
> > VecScatterBegin      100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 
> > 0.0e+00  0  0 97 25  0  34  0100100  0     0       0      0 0.00e+00  100 
> > 6.72e+01  0
> > VecScatterEnd        100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  1  0  0  0  0  42  0  0  0  0     0       0      0 0.00e+00    0 
> > 0.00e+00  0
> > VecCUDACopyTo        100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0    100 4.61e+01    0 
> > 0.00e+00  0
> > VecCopyFromSome      100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  0  0  0  0  0  29  0  0  0  0     0       0      0 0.00e+00  100 
> > 6.72e+01  0
> > 
> > 24 MPI ranks + 6 GPUs + CUDA-aware SF
> > MatMult              100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+04 
> > 0.0e+00  1 99 97 25  0 100100100100  0 387864   973391    0 0.00e+00    0 
> > 0.00e+00 100
> > VecScatterBegin      100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 5.9e+04 
> > 0.0e+00  1  0 97 25  0  35  0100100  0     0       0      0 0.00e+00    0 
> > 0.00e+00  0
> > VecScatterEnd        100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  1  0  0  0  0  48  0  0  0  0     0       0      0 0.00e+00    0 
> > 0.00e+00  0
> > 
> > 
> > --Junchao Zhang

Reply via email to