Re: [petsc-users] CUDA MatSetValues test

Barry Smith Sat, 29 May 2021 09:33:13 -0700

  I am working away on this branch, making some progress, also cleaning things 
up with some small simplifications. Hope I can succeed, a bunch of stuff got 
moved around and some structs had changes, the merge could not handle some of 
these so I have to do a good amount of code wrangling to fix it.


  I'll let you know as I progress.

  Barry


> On May 28, 2021, at 10:53 PM, Barry Smith <[email protected]> wrote:
> 
> 
>   I have rebased and tried to fix everything. I am now fixing the issues of 
> --download-openmpi and cuda, once that is done I will test, rebase with main 
> again if needed and restart the MR and get it into main.
> 
>   Barry
> 
> I was stupid to let the MR lay fallow, I should have figured out a solution 
> to the openmpi and cuda issue instead of punting and waiting for a dream fix.
> 
> 
> 
>> On May 28, 2021, at 2:39 PM, Mark Adams <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Thanks,
>> 
>> I did not intend to make any (real) changes. 
>> The only thing that I did not intend to use from Barry's branch, that 
>> conflicted, was the help and comment block at the top of ex5cu.cu 
>> <http://ex5cu.cu/>
>> 
>> * I ended up with two declarations of PetscSplitCSRDataStructure
>> * I added some includes to fix errors like this:
>> /ccs/home/adams/petsc/include/../src/mat/impls/aij/seq/seqcusparse/cusparsematimpl.h(263):
>>  error: incomplete type is not allowed
>> * I end ended not having csr2csc_i in Mat_SeqAIJCUSPARSE so I get: 
>> /autofs/nccs-svm1_home1/adams/petsc/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu
>>  <http://aijcusparse.cu/>(1348): error: class "Mat_SeqAIJCUSPARSE" has no 
>> member "csr2csc_i"
>> 
>> 
>> 
>> 
>> On Fri, May 28, 2021 at 3:13 PM Stefano Zampini <[email protected] 
>> <mailto:[email protected]>> wrote:
>> I can take a quick look at it tomorrow, what are the main changes you made 
>> since then?
>> 
>>> On May 28, 2021, at 9:51 PM, Mark Adams <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> I am getting messed up in trying to resolve conflicts in rebasing over main.
>>> Is there a better way of doing this?
>>> Can I just tell git to use Barry's version and then test it?
>>> Or should I just try it again?
>>> 
>>> On Fri, May 28, 2021 at 2:15 PM Mark Adams <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> I am rebasing over main and its a bit of a mess. I must have missed 
>>> something. I get this. I think the _n_SplitCSRMat must be wrong.
>>> 
>>> 
>>> In file included from 
>>> /autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/impls/basic/sfbasic.c:128:0:
>>> /ccs/home/adams/petsc/include/petscmat.h:1976:32: error: conflicting types 
>>> for 'PetscSplitCSRDataStructure'
>>>  typedef struct _n_SplitCSRMat *PetscSplitCSRDataStructure;
>>>                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~
>>> /ccs/home/adams/petsc/include/petscmat.h:1922:31: note: previous 
>>> declaration of 'PetscSplitCSRDataStructure' was here
>>>  typedef struct _p_SplitCSRMat PetscSplitCSRDataStructure;
>>>                                ^~~~~~~~~~~~~~~~~~~~~~~~~~
>>>           CC arch-summit-opt-gnu-cuda/obj/vec/vec/impls/seq/dvec2.o
>>> 
>>> On Fri, May 28, 2021 at 1:50 PM Stefano Zampini <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> OpenMPI.py depends on cuda.py in that, if cuda is present, configures using 
>>> cuda. MPI.py or MPICH.py do not depend on cuda.py (MPICH, only weakly, it 
>>> adds a print if cuda is present)
>>> Since eventually the MPI distro will only need a hint to be configured with 
>>> CUDA, why not removing the dependency at all and add only a flag 
>>> —download-openmpi-use-cuda?
>>> 
>>>> On May 28, 2021, at 8:44 PM, Barry Smith <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> 
>>>>  Stefano, who has a far better memory than me, wrote
>>>> 
>>>> > Or probably remove —download-openmpi ? Or, just for the moment, why 
>>>> > can’t we just tell configure that mpi is a weak dependence of cuda.py, 
>>>> > so that it will be forced to be configured later?
>>>> 
>>>>   MPI.py depends on cuda.py so we cannot also have cuda.py depend on 
>>>> MPI.py using the generic dependencies of configure/packages  
>>>> 
>>>>   but perhaps we can just hardwire the rerunning of cuda.py when the MPI 
>>>> compilers are reset. I will try that now and if I can get it to work we 
>>>> should be able to move those old fix branches along as MR.
>>>> 
>>>>   Barry
>>>> 
>>>> 
>>>> 
>>>>> On May 28, 2021, at 12:41 PM, Mark Adams <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> OK, I will try to rebase and test Barry's branch.
>>>>> 
>>>>> On Fri, May 28, 2021 at 1:26 PM Stefano Zampini 
>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>> Yes, it is the branch I was using before force pushing to Barry’s 
>>>>> barry/2020-11-11/cleanup-matsetvaluesdevice
>>>>> You can use both I guess
>>>>> 
>>>>>> On May 28, 2021, at 8:25 PM, Mark Adams <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> Is this the correct branch? It conflicted with ex5cu so I assume it is.
>>>>>> 
>>>>>> 
>>>>>> stefanozampini/simplify-setvalues-device 
>>>>>> <https://gitlab.com/petsc/petsc/-/tree/stefanozampini/simplify-setvalues-device>
>>>>>> 
>>>>>> On Fri, May 28, 2021 at 1:24 PM Mark Adams <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> I am fixing rebasing this branch over main.
>>>>>> 
>>>>>> On Fri, May 28, 2021 at 1:16 PM Stefano Zampini 
>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>> Or probably remove —download-openmpi ? Or, just for the moment, why 
>>>>>> can’t we just tell configure that mpi is a weak dependence of cuda.py, 
>>>>>> so that it will be forced to be configured later?
>>>>>> 
>>>>>>> On May 28, 2021, at 8:12 PM, Stefano Zampini <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> 
>>>>>>> That branch provides a fix for MatSetValuesDevice but it never got 
>>>>>>> merged because of the CI issues with the —download-openmpi. We can 
>>>>>>> probably try to skip the test in that specific configuration?
>>>>>>> 
>>>>>>>> On May 28, 2021, at 7:45 PM, Barry Smith <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>> ~/petsc/src/mat/tutorials 
>>>>>>>> (barry/2021-05-28/robustify-cuda-gencodearch-check=) 
>>>>>>>> arch-robustify-cuda-gencodearch-check
>>>>>>>> $ ./ex5cu
>>>>>>>> terminate called after throwing an instance of 
>>>>>>>> 'thrust::system::system_error'
>>>>>>>>   what():  fill_n: failed to synchronize: cudaErrorIllegalAddress: an 
>>>>>>>> illegal memory access was encountered
>>>>>>>> Aborted (core dumped)
>>>>>>>> 
>>>>>>>>         requires: cuda !define(PETSC_USE_CTABLE)
>>>>>>>> 
>>>>>>>>   CI does not test with CUDA and no ctable.  The code is still broken 
>>>>>>>> as it was six months ago in the discussion Stefano pointed to. It is 
>>>>>>>> clear why just no one has had the time to clean things up.
>>>>>>>> 
>>>>>>>>   Barry
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On May 28, 2021, at 11:13 AM, Mark Adams <[email protected] 
>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, May 28, 2021 at 11:57 AM Stefano Zampini 
>>>>>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>>>>>> If you are referring to your device set values, I guess it is not 
>>>>>>>>> currently tested
>>>>>>>>> 
>>>>>>>>> No. There is a test for that (ex5cu).
>>>>>>>>> I have a user that is getting a segv in MatSetValues with 
>>>>>>>>> aijcusparse. I suspect there is memory corruption but I'm trying to 
>>>>>>>>> cover all the bases.
>>>>>>>>> I have added a cuda test to ksp/ex56 that works. I can do an MR for 
>>>>>>>>> it if such a test does not exist.
>>>>>>>>>  
>>>>>>>>> See the discussions here 
>>>>>>>>> https://gitlab.com/petsc/petsc/-/merge_requests/3411 
>>>>>>>>> <https://gitlab.com/petsc/petsc/-/merge_requests/3411>
>>>>>>>>> I started cleaning up the code to prepare for testing but we never 
>>>>>>>>> finished it 
>>>>>>>>> https://gitlab.com/petsc/petsc/-/commits/stefanozampini/simplify-setvalues-device/
>>>>>>>>>  
>>>>>>>>> <https://gitlab.com/petsc/petsc/-/commits/stefanozampini/simplify-setvalues-device/>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On May 28, 2021, at 6:53 PM, Mark Adams <[email protected] 
>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Is there a test with MatSetValues and CUDA? 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>

Re: [petsc-users] CUDA MatSetValues test

Reply via email to