Thanks, you made several changes here, including switches with the workvector size. I guess I should import this logic to the transpose method(s), except for the yy==NULL branches ...
MatMult_ calls MatMultAdd with yy=0, but the transpose version have their own code. MatMultTranspose_SeqAIJCUSPARSE is very simple. Thanks again, Mark On Wed, Jul 10, 2019 at 9:22 AM Stefano Zampini <stefano.zamp...@gmail.com> wrote: > Mark, > > if the difference is on lvec, I suspect the bug has to do with compressed > row storage. I have fixed a similar bug in MatMult. > you want to check cusparsestruct->workVector->size() against A->cmap->n. > > Stefano > > Il giorno mer 10 lug 2019 alle ore 15:54 Mark Adams via petsc-dev < > petsc-dev@mcs.anl.gov> ha scritto: > >> >> >> On Wed, Jul 10, 2019 at 1:13 AM Smith, Barry F. <bsm...@mcs.anl.gov> >> wrote: >> >>> >>> ierr = VecGetLocalSize(xx,&nt);CHKERRQ(ierr); >>> if (nt != A->rmap->n) >>> SETERRQ2(PETSC_COMM_SELF,PETSC_ERR_ARG_SIZ,"Incompatible partition of A >>> (%D) and xx (%D)",A->rmap->n,nt); >>> ierr = VecScatterInitializeForGPU(a->Mvctx,xx);CHKERRQ(ierr); >>> ierr = (*a->B->ops->multtranspose)(a->B,xx,a->lvec);CHKERRQ(ierr); >>> >>> So the xx on the GPU appears ok? >> >> >> The norm is correct and ... >> >> >>> The a->B appears ok? >> >> >> yes >> >> >>> But on process 1 the result a->lvec is wrong? >>> >> >> yes >> >> >>> How do you look at the a->lvec? Do you copy it to the CPU and print it? >>> >> >> I use Vec[Mat]ViewFromOptions. Oh, that has not been implemented so I >> should copy it. Maybe I should make a CUDA version of these methods? >> >> >>> >>> ierr = (*a->A->ops->multtranspose)(a->A,xx,yy);CHKERRQ(ierr); >>> ierr = >>> VecScatterBegin(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr); >>> ierr = >>> VecScatterEnd(a->Mvctx,a->lvec,yy,ADD_VALUES,SCATTER_REVERSE);CHKERRQ(ierr); >>> ierr = VecScatterFinalizeForGPU(a->Mvctx);CHKERRQ(ierr); >>> >>> Digging around in MatMultTranspose_SeqAIJCUSPARSE doesn't help? >> >> >> This is where I have been digging around an printing stuff. >> >> >>> >>> Are you sure the problem isn't related to the "stream business"? >>> >> >> I don't know what that is but I have played around with adding >> cudaDeviceSynchronize >> >> >>> >>> /* This multiplication sequence is different sequence >>> than the CPU version. In particular, the diagonal block >>> multiplication kernel is launched in one stream. Then, >>> in a separate stream, the data transfers from DeviceToHost >>> (with MPI messaging in between), then HostToDevice are >>> launched. Once the data transfer stream is synchronized, >>> to ensure messaging is complete, the MatMultAdd kernel >>> is launched in the original (MatMult) stream to protect >>> against race conditions. >>> >>> This sequence should only be called for GPU computation. */ >>> >>> Note this comment isn't right and appears to be cut and paste from >>> somewhere else, since there is no MatMult() nor MatMultAdd kernel here? >>> >> >> Yes, I noticed this. Same as MatMult and not correct here. >> >> >>> >>> Anyway to "turn off the stream business" and see if the result is then >>> correct? >> >> >> How do you do that? I'm looking at docs on streams but not sure how its >> used here. >> >> >>> Perhaps the stream business was done correctly for MatMult() but was >>> never right for MatMultTranspose()? >>> >>> Barry >>> >>> BTW: Unrelated comment, the code >>> >>> ierr = VecSet(yy,0);CHKERRQ(ierr); >>> ierr = VecCUDAGetArrayWrite(yy,&yarray);CHKERRQ(ierr); >>> >>> has an unneeded ierr = VecSet(yy,0);CHKERRQ(ierr); here. >>> VecCUDAGetArrayWrite() requires that you ignore the values in yy and set >>> them all yourself so setting them to zero before calling >>> VecCUDAGetArrayWrite() does nothing except waste time. >>> >>> >> OK, I'll get rid of it. >> >> >>> >>> > On Jul 9, 2019, at 3:16 PM, Mark Adams via petsc-dev < >>> petsc-dev@mcs.anl.gov> wrote: >>> > >>> > I am stumped with this GPU bug(s). Maybe someone has an idea. >>> > >>> > I did find a bug in the cuda transpose mat-vec that cuda-memcheck >>> detected, but I still have differences between the GPU and CPU transpose >>> mat-vec. I've got it down to a very simple test: bicg/none on a tiny mesh >>> with two processors. It works on one processor or with cg/none. So it is >>> the transpose mat-vec. >>> > >>> > I see that the result of the off-diagonal (a->lvec) is different only >>> proc 1. I instrumented MatMultTranspose_MPIAIJ[CUSPARSE] with norms of mat >>> and vec and printed out matlab vectors. Below is the CPU output and then >>> the GPU with a view of the scatter object, which is identical as you can >>> see. >>> > >>> > The matlab B matrix and xx vector are identical. Maybe the GPU copy is >>> wrong ... >>> > >>> > The only/first difference between CPU and GPU is a->lvec (the off >>> diagonal contribution)on processor 1. (you can see the norms are >>> different). Here is the diff on the process 1 a->lvec vector (all values >>> are off). >>> > >>> > Any thoughts would be appreciated, >>> > Mark >>> > >>> > 15:30 1 /gpfs/alpine/scratch/adams/geo127$ diff lvgpu.m lvcpu.m >>> > 2,12c2,12 >>> > < % type: seqcuda >>> > < Vec_0x53738630_0 = [ >>> > < 9.5702137431412879e+00 >>> > < 2.1970298791152253e+01 >>> > < 4.5422290209190646e+00 >>> > < 2.0185031807270226e+00 >>> > < 4.2627312508573375e+01 >>> > < 1.0889191983882025e+01 >>> > < 1.6038202417695462e+01 >>> > < 2.7155672033607665e+01 >>> > < 6.2540357853223556e+00 >>> > --- >>> > > % type: seq >>> > > Vec_0x3a546440_0 = [ >>> > > 4.5565851251714653e+00 >>> > > 1.0460532998971189e+01 >>> > > 2.1626531807270220e+00 >>> > > 9.6105288923182408e-01 >>> > > 2.0295782656035659e+01 >>> > > 5.1845791066529463e+00 >>> > > 7.6361340020576058e+00 >>> > > 1.2929401011659799e+01 >>> > > 2.9776812928669392e+00 >>> > >>> > 15:15 130 /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g >>> 1 ./ex56 -cells 2,2,1 >>> > [0] 27 global equations, 9 vertices >>> > [0] 27 equations in vector, 9 vertices >>> > 0 SNES Function norm 1.223958326481e+02 >>> > 0 KSP Residual norm 1.223958326481e+02 >>> > [0] |x|= 1.223958326481e+02 |a->lvec|= 1.773965489475e+01 |B|= >>> 1.424708937136e+00 >>> > [1] |x|= 1.223958326481e+02 |a->lvec|= 2.844171413778e+01 |B|= >>> 1.424708937136e+00 >>> > [1] 1) |yy|= 2.007423334680e+02 >>> > [0] 1) |yy|= 2.007423334680e+02 >>> > [0] 2) |yy|= 1.957605719265e+02 >>> > [1] 2) |yy|= 1.957605719265e+02 >>> > [1] Number sends = 1; Number to self = 0 >>> > [1] 0 length = 9 to whom 0 >>> > Now the indices for all remote sends (in order by process sent to) >>> > [1] 9 >>> > [1] 10 >>> > [1] 11 >>> > [1] 12 >>> > [1] 13 >>> > [1] 14 >>> > [1] 15 >>> > [1] 16 >>> > [1] 17 >>> > [1] Number receives = 1; Number from self = 0 >>> > [1] 0 length 9 from whom 0 >>> > Now the indices for all remote receives (in order by process received >>> from) >>> > [1] 0 >>> > [1] 1 >>> > [1] 2 >>> > [1] 3 >>> > [1] 4 >>> > [1] 5 >>> > [1] 6 >>> > [1] 7 >>> > [1] 8 >>> > 1 KSP Residual norm 8.199932342150e+01 >>> > Linear solve did not converge due to DIVERGED_ITS iterations 1 >>> > Nonlinear solve did not converge due to DIVERGED_LINEAR_SOLVE >>> iterations 0 >>> > >>> > >>> > 15:19 /gpfs/alpine/scratch/adams/geo127$ jsrun -n 1 -c 2 -a 2 -g 1 >>> ./ex56 -cells 2,2,1 -ex56_dm_mat_type aijcusparse -ex56_dm_vec_type cuda >>> > [0] 27 global equations, 9 vertices >>> > [0] 27 equations in vector, 9 vertices >>> > 0 SNES Function norm 1.223958326481e+02 >>> > 0 KSP Residual norm 1.223958326481e+02 >>> > [0] |x|= 1.223958326481e+02 |a->lvec|= 1.773965489475e+01 |B|= >>> 1.424708937136e+00 >>> > [1] |x|= 1.223958326481e+02 |a->lvec|= 5.973624458725e+01 |B|= >>> 1.424708937136e+00 >>> > [0] 1) |yy|= 2.007423334680e+02 >>> > [1] 1) |yy|= 2.007423334680e+02 >>> > [0] 2) |yy|= 1.953571867633e+02 >>> > [1] 2) |yy|= 1.953571867633e+02 >>> > [1] Number sends = 1; Number to self = 0 >>> > [1] 0 length = 9 to whom 0 >>> > Now the indices for all remote sends (in order by process sent to) >>> > [1] 9 >>> > [1] 10 >>> > [1] 11 >>> > [1] 12 >>> > [1] 13 >>> > [1] 14 >>> > [1] 15 >>> > [1] 16 >>> > [1] 17 >>> > [1] Number receives = 1; Number from self = 0 >>> > [1] 0 length 9 from whom 0 >>> > Now the indices for all remote receives (in order by process received >>> from) >>> > [1] 0 >>> > [1] 1 >>> > [1] 2 >>> > [1] 3 >>> > [1] 4 >>> > [1] 5 >>> > [1] 6 >>> > [1] 7 >>> > [1] 8 >>> > 1 KSP Residual norm 8.199932342150e+01 >>> > >>> >>> > > -- > Stefano >