Note, the numerical problems that we have look a lot like a race condition
of some sort. Happens with empty processors and goes away under
cuda-memcheck (valgrind like thing).

I did try adding WaitForGPU() , but maybe I did do it right or there are
other synchronization mechanisms.


On Mon, Sep 23, 2019 at 6:28 PM Zhang, Junchao via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

> It looks cusparsestruct->stream is always created (not NULL).  I don't
> know logic of the "if (!cusparsestruct->stream)".
> --Junchao Zhang
>
>
> On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
>
>> In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards the end
>> of the function it had
>>
>>   if (!yy) { /* MatMult */
>>     if (!cusparsestruct->stream) {
>>       ierr = WaitForGPU();CHKERRCUDA(ierr);
>>     }
>>   }
>>
>> I assume we don't need the logic to do this only in the MatMult() with no
>> add case and should just do this all the time, for the purposes of timing
>> if no other reason. Is there some reason to NOT do this because of worries
>> the about effects that these WaitForGPU() invocations might have on
>> performance?
>>
>> I notice other problems in aijcusparse.cu, now that I look closer. In
>> MatMultTransposeAdd_SeqAIJCUSPARSE(), I see that we have GPU timing calls
>> around the cusparse_csr_spmv() (but no WaitForGPU() inside the timed
>> region). I believe this is another area in which we get a meaningless
>> timing. It looks like we need a WaitForGPU() there, and then maybe inside
>> the timed region handling the scatter. (I don't know if this stuff happens
>> asynchronously or not.) But do we potentially want two WaitForGPU() calls
>> in one function, just to help with getting timings? I don't have a good
>> idea of how much overhead this adds.
>>
>> --Richard
>>
>> On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
>>
>> I made the following changes:
>> 1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
>>   ierr = WaitForGPU();CHKERRCUDA(ierr);
>>   ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
>>   ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
>>   PetscFunctionReturn(0);
>> 2) In MatMult_MPIAIJCUSPARSE, use the following code sequence. The old
>> code swapped the first two lines. Since with
>> -log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed the order to
>> have better overlap.
>>   ierr =
>> VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>>   ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
>>   ierr =
>> VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
>>   ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
>> 3) Log time directly in the test code so we can also know execution
>> time without -log_view (hence cuda synchronization). I manually calculated
>> the Total Mflop/s for these cases for easy comparison.
>>
>> <<Note the CPU versions are copied from yesterday's results>>
>>
>>
>> ------------------------------------------------------------------------------------------------------------------------
>> Event                Count      Time (sec)     Flop
>>        --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   -
>> GpuToCpu - GPU
>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
>>  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size
>> Count   Size  %F
>>
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 6 MPI ranks,
>> MatMult              100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05
>> 0.0e+00 24 99 97 18  0 100100100100  0  4743       0      0 0.00e+00    0
>> 0.00e+00  0
>> VecScatterBegin      100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05
>> 0.0e+00  0  0 97 18  0   0  0100100  0     0       0      0 0.00e+00    0
>> 0.00e+00  0
>> VecScatterEnd        100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  3  0  0  0  0  13  0  0  0  0     0       0      0 0.00e+00    0
>> 0.00e+00  0
>>
>> 24 MPI ranks
>> MatMult              100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04
>> 0.0e+00  8 99 97 25  0 100100100100  0 17948       0      0 0.00e+00    0
>> 0.00e+00  0
>> VecScatterBegin      100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04
>> 0.0e+00  0  0 97 25  0   0  0100100  0     0       0      0 0.00e+00    0
>> 0.00e+00  0
>> VecScatterEnd        100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  2  0  0  0  0  19  0  0  0  0     0       0      0 0.00e+00    0
>> 0.00e+00  0
>>
>> 42 MPI ranks
>> MatMult              100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04
>> 0.0e+00 23 99 97 30  0 100100100100  0 27493       0      0 0.00e+00    0
>> 0.00e+00  0
>> VecScatterBegin      100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04
>> 0.0e+00  0  0 97 30  0   1  0100100  0     0       0      0 0.00e+00    0
>> 0.00e+00  0
>> VecScatterEnd        100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  6  0  0  0  0  24  0  0  0  0     0       0      0 0.00e+00    0
>> 0.00e+00  0
>>
>> 6 MPI ranks + 6 GPUs + regular SF + log_view
>> MatMult              100 1.0 1.6863e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05
>> 0.0e+00  0 99 97 18  0 100100100100  0 335743   629278  100 1.02e+02  100
>> 2.69e+02 100
>> VecScatterBegin      100 1.0 5.0157e-02 1.6 0.00e+00 0.0 2.8e+03 2.2e+05
>> 0.0e+00  0  0 97 18  0  24  0100100  0     0       0      0 0.00e+00  100
>> 2.69e+02  0
>> VecScatterEnd        100 1.0 4.9155e-02 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0  20  0  0  0  0     0       0      0 0.00e+00    0
>> 0.00e+00  0
>> VecCUDACopyTo        100 1.0 9.5078e-03 2.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   4  0  0  0  0     0       0    100 1.02e+02    0
>> 0.00e+00  0
>> VecCopyFromSome      100 1.0 2.8485e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0  14  0  0  0  0     0       0      0 0.00e+00  100
>> 2.69e+02  0
>>
>> 6 MPI ranks + 6 GPUs + regular SF  + No log_view
>> MatMult:             100 1.0 1.4180e-01
>>                                        399268
>>
>> 6 MPI ranks + 6 GPUs + CUDA-aware SF + log_view
>> MatMult              100 1.0 1.1053e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05
>> 0.0e+00  1 99 97 18  0 100100100100  0 512224   642075    0 0.00e+00    0
>> 0.00e+00 100
>> VecScatterBegin      100 1.0 8.3418e-03 1.5 0.00e+00 0.0 2.8e+03 2.2e+05
>> 0.0e+00  0  0 97 18  0   6  0100100  0     0       0      0 0.00e+00    0
>> 0.00e+00  0
>> VecScatterEnd        100 1.0 2.2619e-02 1.6 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0  16  0  0  0  0     0       0      0 0.00e+00    0
>> 0.00e+00  0
>>
>> 6 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view
>> MatMult:             100 1.0 9.8344e-02
>>                                        575717
>>
>> 24 MPI ranks + 6 GPUs + regular SF + log_view
>> MatMult              100 1.0 1.1572e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04
>> 0.0e+00  0 99 97 25  0 100100100100  0 489223   708601  100 4.61e+01  100
>> 6.72e+01 100
>> VecScatterBegin      100 1.0 2.0641e-02 2.0 0.00e+00 0.0 1.9e+04 5.9e+04
>> 0.0e+00  0  0 97 25  0  13  0100100  0     0       0      0 0.00e+00  100
>> 6.72e+01  0
>> VecScatterEnd        100 1.0 6.8114e-02 5.6 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0  38  0  0  0  0     0       0      0 0.00e+00    0
>> 0.00e+00  0
>> VecCUDACopyTo        100 1.0 6.6646e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0    100 4.61e+01    0
>> 0.00e+00  0
>> VecCopyFromSome      100 1.0 1.0546e-02 1.7 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  0  0  0  0  0   7  0  0  0  0     0       0      0 0.00e+00  100
>> 6.72e+01  0
>>
>> 24 MPI ranks + 6 GPUs + regular SF + No log_view
>> MatMult:             100 1.0 9.8254e-02
>>                                        576201
>>
>> 24 MPI ranks + 6 GPUs + CUDA-aware SF + log_view
>> MatMult              100 1.0 1.1602e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04
>> 0.0e+00  1 99 97 25  0 100100100100  0 487956   707524    0 0.00e+00    0
>> 0.00e+00 100
>> VecScatterBegin      100 1.0 2.7088e-02 7.0 0.00e+00 0.0 1.9e+04 5.9e+04
>> 0.0e+00  0  0 97 25  0   8  0100100  0     0       0      0 0.00e+00    0
>> 0.00e+00  0
>> VecScatterEnd        100 1.0 8.4262e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00  1  0  0  0  0  52  0  0  0  0     0       0      0 0.00e+00    0
>> 0.00e+00  0
>>
>> 24 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view
>> MatMult:             100 1.0 1.0397e-01
>>                                        544510
>>
>>
>>
>>
>>
>>
>>

Reply via email to