Re: [petsc-dev] [petsc-users] CUDA MatSetValues test

Barry Smith Wed, 02 Jun 2021 11:26:26 -0700


> On Jun 2, 2021, at 2:06 AM, Stefano Zampini <stefano.zamp...@gmail.com> wrote:
> 
> Barry
> 
> If I remember correctly, the ctable business was a limitation of Mark's 
> initial implementation, which I removed here 
> https://gitlab.com/petsc/petsc/-/merge_requests/3411/diffs?commit_id=36173707431f5583889720868beae9046e6e1fd2
>  
> <https://gitlab.com/petsc/petsc/-/merge_requests/3411/diffs?commit_id=36173707431f5583889720868beae9046e6e1fd2>.
>  Here colmap was  "int" already. We made a mistake letting this branch fall 
> so behind main. I will add my observations in the MR


   Thanks, likely it was the merge brought in the incorrect colmap. 

   Yes, it was my fault the branch go neglected, I become too emotional over 
the configure issue with CUDA, stating it needed a completely rewrite and then 
forgot about the branch. I have "fixed" the CUDA configure issue with an "ok" 
but hacky manner, after MPI is configured and changes the compilers, thus 
removing all the compiler flags, I force a rerun of parts of cuda.py to add 
back their contributions to the CUDA compiler flags. There is a circular 
dependency that this change resolves. 

  Barry



> 
> Il giorno mer 2 giu 2021 alle ore 09:20 Barry Smith <bsm...@petsc.dev 
> <mailto:bsm...@petsc.dev>> ha scritto:
> 
>   Switching the thread to petsc-dev since users don't need this.
> 
>   Had a nasty failure on one the CI, turned out the GPU colmap was declared 
> as PetscInt* but it was copied down from the CPU into the GPU memory as int. 
> The power of void* interface. Thus ex5cu.cu <http://ex5cu.cu/> which was 
> never CI tested was crashing very strangely. By declaring it properly as int* 
> I got the example to run with 64 bit indices. Submitted a new pipeline to see 
> what else pops up. 
> 
>    I am a bit worried about the csr2csc data structures and code, one line 
> declaring csr2csc_i inside the struct disappeared in the rebase on main, I 
> stuck it in but don't understand that code at all. I'll report further when 
> the CI is done or crashes. I will also need to make sure it is compiled pre 
> and post PETSC_PKG_CUDA_VERSION_GE(11,0,0)
> 
>    I will also investigate this CTABLE business, the GPU doesn't need to know 
> or care if the CPU is using a CTABLE or not; it just needs the colmap array 
> format that it gets from spreading out the garray anyway to be copied to the 
> GPU so PETSc can likely be built with CTABLES as normal; this will improve 
> the testing a lot. 
> 
> 
> 
> Barry
> 
> 
>> On May 30, 2021, at 11:54 AM, Barry Smith <bsm...@petsc.dev 
>> <mailto:bsm...@petsc.dev>> wrote:
>> 
>> 
>>    I believe I have finally successfully rebased the branch 
>> barry/2020-11-11/cleanup-matsetvaluesdevice against main and cleaned up all 
>> the issues. Please read the commit message.
>> 
>>   I have submitted a CI pipeline with ctables turned off temporarily for 
>> testing of the MatSetValuesDevice(). If it works hopefully Mark can maybe 
>> run a few additional tests of his Landau code that are not in the usual 
>> testing to verify and we can finally get the branch into main.
>> 
>>   Mark,
>> 
>>      Since this change is involved, it is likely your Landau mass matrix 
>> branch may not rebase cleanly. Let me know if you would like me to do the 
>> rebase and testing of your Landau mass matrix branch. I can get it ready to 
>> work with the results of barry/2020-11-11/cleanup-matsetvaluesdevice and 
>> then hand it back to you for further development.
>> 
>>   Barry
>> 
>> 
>>> On May 29, 2021, at 11:32 AM, Barry Smith <bsm...@petsc.dev 
>>> <mailto:bsm...@petsc.dev>> wrote:
>>> 
>>> 
>>>   I am working away on this branch, making some progress, also cleaning 
>>> things up with some small simplifications. Hope I can succeed, a bunch of 
>>> stuff got moved around and some structs had changes, the merge could not 
>>> handle some of these so I have to do a good amount of code wrangling to fix 
>>> it.
>>> 
>>>   I'll let you know as I progress.
>>> 
>>>   Barry
>>> 
>>> 
>>>> On May 28, 2021, at 10:53 PM, Barry Smith <bsm...@petsc.dev 
>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>> 
>>>> 
>>>>   I have rebased and tried to fix everything. I am now fixing the issues 
>>>> of --download-openmpi and cuda, once that is done I will test, rebase with 
>>>> main again if needed and restart the MR and get it into main.
>>>> 
>>>>   Barry
>>>> 
>>>> I was stupid to let the MR lay fallow, I should have figured out a 
>>>> solution to the openmpi and cuda issue instead of punting and waiting for 
>>>> a dream fix.
>>>> 
>>>> 
>>>> 
>>>>> On May 28, 2021, at 2:39 PM, Mark Adams <mfad...@lbl.gov 
>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> I did not intend to make any (real) changes. 
>>>>> The only thing that I did not intend to use from Barry's branch, that 
>>>>> conflicted, was the help and comment block at the top of ex5cu.cu 
>>>>> <http://ex5cu.cu/>
>>>>> 
>>>>> * I ended up with two declarations of PetscSplitCSRDataStructure
>>>>> * I added some includes to fix errors like this:
>>>>> /ccs/home/adams/petsc/include/../src/mat/impls/aij/seq/seqcusparse/cusparsematimpl.h(263):
>>>>>  error: incomplete type is not allowed
>>>>> * I end ended not having csr2csc_i in Mat_SeqAIJCUSPARSE so I get: 
>>>>> /autofs/nccs-svm1_home1/adams/petsc/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu
>>>>>  <http://aijcusparse.cu/>(1348): error: class "Mat_SeqAIJCUSPARSE" has no 
>>>>> member "csr2csc_i"
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, May 28, 2021 at 3:13 PM Stefano Zampini 
>>>>> <stefano.zamp...@gmail.com <mailto:stefano.zamp...@gmail.com>> wrote:
>>>>> I can take a quick look at it tomorrow, what are the main changes you 
>>>>> made since then?
>>>>> 
>>>>>> On May 28, 2021, at 9:51 PM, Mark Adams <mfad...@lbl.gov 
>>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>>> 
>>>>>> I am getting messed up in trying to resolve conflicts in rebasing over 
>>>>>> main.
>>>>>> Is there a better way of doing this?
>>>>>> Can I just tell git to use Barry's version and then test it?
>>>>>> Or should I just try it again?
>>>>>> 
>>>>>> On Fri, May 28, 2021 at 2:15 PM Mark Adams <mfad...@lbl.gov 
>>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>>> I am rebasing over main and its a bit of a mess. I must have missed 
>>>>>> something. I get this. I think the _n_SplitCSRMat must be wrong.
>>>>>> 
>>>>>> 
>>>>>> In file included from 
>>>>>> /autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/impls/basic/sfbasic.c:128:0:
>>>>>> /ccs/home/adams/petsc/include/petscmat.h:1976:32: error: conflicting 
>>>>>> types for 'PetscSplitCSRDataStructure'
>>>>>>  typedef struct _n_SplitCSRMat *PetscSplitCSRDataStructure;
>>>>>>                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>> /ccs/home/adams/petsc/include/petscmat.h:1922:31: note: previous 
>>>>>> declaration of 'PetscSplitCSRDataStructure' was here
>>>>>>  typedef struct _p_SplitCSRMat PetscSplitCSRDataStructure;
>>>>>>                                ^~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>>           CC arch-summit-opt-gnu-cuda/obj/vec/vec/impls/seq/dvec2.o
>>>>>> 
>>>>>> On Fri, May 28, 2021 at 1:50 PM Stefano Zampini 
>>>>>> <stefano.zamp...@gmail.com <mailto:stefano.zamp...@gmail.com>> wrote:
>>>>>> OpenMPI.py depends on cuda.py in that, if cuda is present, configures 
>>>>>> using cuda. MPI.py or MPICH.py do not depend on cuda.py (MPICH, only 
>>>>>> weakly, it adds a print if cuda is present)
>>>>>> Since eventually the MPI distro will only need a hint to be configured 
>>>>>> with CUDA, why not removing the dependency at all and add only a flag 
>>>>>> —download-openmpi-use-cuda?
>>>>>> 
>>>>>>> On May 28, 2021, at 8:44 PM, Barry Smith <bsm...@petsc.dev 
>>>>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>  Stefano, who has a far better memory than me, wrote
>>>>>>> 
>>>>>>> > Or probably remove —download-openmpi ? Or, just for the moment, why 
>>>>>>> > can’t we just tell configure that mpi is a weak dependence of 
>>>>>>> > cuda.py, so that it will be forced to be configured later?
>>>>>>> 
>>>>>>>   MPI.py depends on cuda.py so we cannot also have cuda.py depend on 
>>>>>>> MPI.py using the generic dependencies of configure/packages  
>>>>>>> 
>>>>>>>   but perhaps we can just hardwire the rerunning of cuda.py when the 
>>>>>>> MPI compilers are reset. I will try that now and if I can get it to 
>>>>>>> work we should be able to move those old fix branches along as MR.
>>>>>>> 
>>>>>>>   Barry
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On May 28, 2021, at 12:41 PM, Mark Adams <mfad...@lbl.gov 
>>>>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>>>>> 
>>>>>>>> OK, I will try to rebase and test Barry's branch.
>>>>>>>> 
>>>>>>>> On Fri, May 28, 2021 at 1:26 PM Stefano Zampini 
>>>>>>>> <stefano.zamp...@gmail.com <mailto:stefano.zamp...@gmail.com>> wrote:
>>>>>>>> Yes, it is the branch I was using before force pushing to Barry’s 
>>>>>>>> barry/2020-11-11/cleanup-matsetvaluesdevice
>>>>>>>> You can use both I guess
>>>>>>>> 
>>>>>>>>> On May 28, 2021, at 8:25 PM, Mark Adams <mfad...@lbl.gov 
>>>>>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>>>>>> 
>>>>>>>>> Is this the correct branch? It conflicted with ex5cu so I assume it 
>>>>>>>>> is.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> stefanozampini/simplify-setvalues-device 
>>>>>>>>> <https://gitlab.com/petsc/petsc/-/tree/stefanozampini/simplify-setvalues-device>
>>>>>>>>> 
>>>>>>>>> On Fri, May 28, 2021 at 1:24 PM Mark Adams <mfad...@lbl.gov 
>>>>>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>>>>>> I am fixing rebasing this branch over main.
>>>>>>>>> 
>>>>>>>>> On Fri, May 28, 2021 at 1:16 PM Stefano Zampini 
>>>>>>>>> <stefano.zamp...@gmail.com <mailto:stefano.zamp...@gmail.com>> wrote:
>>>>>>>>> Or probably remove —download-openmpi ? Or, just for the moment, why 
>>>>>>>>> can’t we just tell configure that mpi is a weak dependence of 
>>>>>>>>> cuda.py, so that it will be forced to be configured later?
>>>>>>>>> 
>>>>>>>>>> On May 28, 2021, at 8:12 PM, Stefano Zampini 
>>>>>>>>>> <stefano.zamp...@gmail.com <mailto:stefano.zamp...@gmail.com>> wrote:
>>>>>>>>>> 
>>>>>>>>>> That branch provides a fix for MatSetValuesDevice but it never got 
>>>>>>>>>> merged because of the CI issues with the —download-openmpi. We can 
>>>>>>>>>> probably try to skip the test in that specific configuration?
>>>>>>>>>> 
>>>>>>>>>>> On May 28, 2021, at 7:45 PM, Barry Smith <bsm...@petsc.dev 
>>>>>>>>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> ~/petsc/src/mat/tutorials 
>>>>>>>>>>> (barry/2021-05-28/robustify-cuda-gencodearch-check=) 
>>>>>>>>>>> arch-robustify-cuda-gencodearch-check
>>>>>>>>>>> $ ./ex5cu
>>>>>>>>>>> terminate called after throwing an instance of 
>>>>>>>>>>> 'thrust::system::system_error'
>>>>>>>>>>>   what():  fill_n: failed to synchronize: cudaErrorIllegalAddress: 
>>>>>>>>>>> an illegal memory access was encountered
>>>>>>>>>>> Aborted (core dumped)
>>>>>>>>>>> 
>>>>>>>>>>>         requires: cuda !define(PETSC_USE_CTABLE)
>>>>>>>>>>> 
>>>>>>>>>>>   CI does not test with CUDA and no ctable.  The code is still 
>>>>>>>>>>> broken as it was six months ago in the discussion Stefano pointed 
>>>>>>>>>>> to. It is clear why just no one has had the time to clean things up.
>>>>>>>>>>> 
>>>>>>>>>>>   Barry
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> On May 28, 2021, at 11:13 AM, Mark Adams <mfad...@lbl.gov 
>>>>>>>>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Fri, May 28, 2021 at 11:57 AM Stefano Zampini 
>>>>>>>>>>>> <stefano.zamp...@gmail.com <mailto:stefano.zamp...@gmail.com>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> If you are referring to your device set values, I guess it is not 
>>>>>>>>>>>> currently tested
>>>>>>>>>>>> 
>>>>>>>>>>>> No. There is a test for that (ex5cu).
>>>>>>>>>>>> I have a user that is getting a segv in MatSetValues with 
>>>>>>>>>>>> aijcusparse. I suspect there is memory corruption but I'm trying 
>>>>>>>>>>>> to cover all the bases.
>>>>>>>>>>>> I have added a cuda test to ksp/ex56 that works. I can do an MR 
>>>>>>>>>>>> for it if such a test does not exist.
>>>>>>>>>>>>  
>>>>>>>>>>>> See the discussions here 
>>>>>>>>>>>> https://gitlab.com/petsc/petsc/-/merge_requests/3411 
>>>>>>>>>>>> <https://gitlab.com/petsc/petsc/-/merge_requests/3411>
>>>>>>>>>>>> I started cleaning up the code to prepare for testing but we never 
>>>>>>>>>>>> finished it 
>>>>>>>>>>>> https://gitlab.com/petsc/petsc/-/commits/stefanozampini/simplify-setvalues-device/
>>>>>>>>>>>>  
>>>>>>>>>>>> <https://gitlab.com/petsc/petsc/-/commits/stefanozampini/simplify-setvalues-device/>
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> On May 28, 2021, at 6:53 PM, Mark Adams <mfad...@lbl.gov 
>>>>>>>>>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Is there a test with MatSetValues and CUDA? 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
> 
> 
> -- 
> Stefano

Re: [petsc-dev] [petsc-users] CUDA MatSetValues test

Reply via email to