I am working away on this branch, making some progress, also cleaning things up with some small simplifications. Hope I can succeed, a bunch of stuff got moved around and some structs had changes, the merge could not handle some of these so I have to do a good amount of code wrangling to fix it.
I'll let you know as I progress. Barry > On May 28, 2021, at 10:53 PM, Barry Smith <[email protected]> wrote: > > > I have rebased and tried to fix everything. I am now fixing the issues of > --download-openmpi and cuda, once that is done I will test, rebase with main > again if needed and restart the MR and get it into main. > > Barry > > I was stupid to let the MR lay fallow, I should have figured out a solution > to the openmpi and cuda issue instead of punting and waiting for a dream fix. > > > >> On May 28, 2021, at 2:39 PM, Mark Adams <[email protected] >> <mailto:[email protected]>> wrote: >> >> Thanks, >> >> I did not intend to make any (real) changes. >> The only thing that I did not intend to use from Barry's branch, that >> conflicted, was the help and comment block at the top of ex5cu.cu >> <http://ex5cu.cu/> >> >> * I ended up with two declarations of PetscSplitCSRDataStructure >> * I added some includes to fix errors like this: >> /ccs/home/adams/petsc/include/../src/mat/impls/aij/seq/seqcusparse/cusparsematimpl.h(263): >> error: incomplete type is not allowed >> * I end ended not having csr2csc_i in Mat_SeqAIJCUSPARSE so I get: >> /autofs/nccs-svm1_home1/adams/petsc/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu >> <http://aijcusparse.cu/>(1348): error: class "Mat_SeqAIJCUSPARSE" has no >> member "csr2csc_i" >> >> >> >> >> On Fri, May 28, 2021 at 3:13 PM Stefano Zampini <[email protected] >> <mailto:[email protected]>> wrote: >> I can take a quick look at it tomorrow, what are the main changes you made >> since then? >> >>> On May 28, 2021, at 9:51 PM, Mark Adams <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> I am getting messed up in trying to resolve conflicts in rebasing over main. >>> Is there a better way of doing this? >>> Can I just tell git to use Barry's version and then test it? >>> Or should I just try it again? >>> >>> On Fri, May 28, 2021 at 2:15 PM Mark Adams <[email protected] >>> <mailto:[email protected]>> wrote: >>> I am rebasing over main and its a bit of a mess. I must have missed >>> something. I get this. I think the _n_SplitCSRMat must be wrong. >>> >>> >>> In file included from >>> /autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/impls/basic/sfbasic.c:128:0: >>> /ccs/home/adams/petsc/include/petscmat.h:1976:32: error: conflicting types >>> for 'PetscSplitCSRDataStructure' >>> typedef struct _n_SplitCSRMat *PetscSplitCSRDataStructure; >>> ^~~~~~~~~~~~~~~~~~~~~~~~~~ >>> /ccs/home/adams/petsc/include/petscmat.h:1922:31: note: previous >>> declaration of 'PetscSplitCSRDataStructure' was here >>> typedef struct _p_SplitCSRMat PetscSplitCSRDataStructure; >>> ^~~~~~~~~~~~~~~~~~~~~~~~~~ >>> CC arch-summit-opt-gnu-cuda/obj/vec/vec/impls/seq/dvec2.o >>> >>> On Fri, May 28, 2021 at 1:50 PM Stefano Zampini <[email protected] >>> <mailto:[email protected]>> wrote: >>> OpenMPI.py depends on cuda.py in that, if cuda is present, configures using >>> cuda. MPI.py or MPICH.py do not depend on cuda.py (MPICH, only weakly, it >>> adds a print if cuda is present) >>> Since eventually the MPI distro will only need a hint to be configured with >>> CUDA, why not removing the dependency at all and add only a flag >>> —download-openmpi-use-cuda? >>> >>>> On May 28, 2021, at 8:44 PM, Barry Smith <[email protected] >>>> <mailto:[email protected]>> wrote: >>>> >>>> >>>> Stefano, who has a far better memory than me, wrote >>>> >>>> > Or probably remove —download-openmpi ? Or, just for the moment, why >>>> > can’t we just tell configure that mpi is a weak dependence of cuda.py, >>>> > so that it will be forced to be configured later? >>>> >>>> MPI.py depends on cuda.py so we cannot also have cuda.py depend on >>>> MPI.py using the generic dependencies of configure/packages >>>> >>>> but perhaps we can just hardwire the rerunning of cuda.py when the MPI >>>> compilers are reset. I will try that now and if I can get it to work we >>>> should be able to move those old fix branches along as MR. >>>> >>>> Barry >>>> >>>> >>>> >>>>> On May 28, 2021, at 12:41 PM, Mark Adams <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> OK, I will try to rebase and test Barry's branch. >>>>> >>>>> On Fri, May 28, 2021 at 1:26 PM Stefano Zampini >>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>> Yes, it is the branch I was using before force pushing to Barry’s >>>>> barry/2020-11-11/cleanup-matsetvaluesdevice >>>>> You can use both I guess >>>>> >>>>>> On May 28, 2021, at 8:25 PM, Mark Adams <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> >>>>>> Is this the correct branch? It conflicted with ex5cu so I assume it is. >>>>>> >>>>>> >>>>>> stefanozampini/simplify-setvalues-device >>>>>> <https://gitlab.com/petsc/petsc/-/tree/stefanozampini/simplify-setvalues-device> >>>>>> >>>>>> On Fri, May 28, 2021 at 1:24 PM Mark Adams <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> I am fixing rebasing this branch over main. >>>>>> >>>>>> On Fri, May 28, 2021 at 1:16 PM Stefano Zampini >>>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>>> Or probably remove —download-openmpi ? Or, just for the moment, why >>>>>> can’t we just tell configure that mpi is a weak dependence of cuda.py, >>>>>> so that it will be forced to be configured later? >>>>>> >>>>>>> On May 28, 2021, at 8:12 PM, Stefano Zampini <[email protected] >>>>>>> <mailto:[email protected]>> wrote: >>>>>>> >>>>>>> That branch provides a fix for MatSetValuesDevice but it never got >>>>>>> merged because of the CI issues with the —download-openmpi. We can >>>>>>> probably try to skip the test in that specific configuration? >>>>>>> >>>>>>>> On May 28, 2021, at 7:45 PM, Barry Smith <[email protected] >>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> ~/petsc/src/mat/tutorials >>>>>>>> (barry/2021-05-28/robustify-cuda-gencodearch-check=) >>>>>>>> arch-robustify-cuda-gencodearch-check >>>>>>>> $ ./ex5cu >>>>>>>> terminate called after throwing an instance of >>>>>>>> 'thrust::system::system_error' >>>>>>>> what(): fill_n: failed to synchronize: cudaErrorIllegalAddress: an >>>>>>>> illegal memory access was encountered >>>>>>>> Aborted (core dumped) >>>>>>>> >>>>>>>> requires: cuda !define(PETSC_USE_CTABLE) >>>>>>>> >>>>>>>> CI does not test with CUDA and no ctable. The code is still broken >>>>>>>> as it was six months ago in the discussion Stefano pointed to. It is >>>>>>>> clear why just no one has had the time to clean things up. >>>>>>>> >>>>>>>> Barry >>>>>>>> >>>>>>>> >>>>>>>>> On May 28, 2021, at 11:13 AM, Mark Adams <[email protected] >>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, May 28, 2021 at 11:57 AM Stefano Zampini >>>>>>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>>>>>> If you are referring to your device set values, I guess it is not >>>>>>>>> currently tested >>>>>>>>> >>>>>>>>> No. There is a test for that (ex5cu). >>>>>>>>> I have a user that is getting a segv in MatSetValues with >>>>>>>>> aijcusparse. I suspect there is memory corruption but I'm trying to >>>>>>>>> cover all the bases. >>>>>>>>> I have added a cuda test to ksp/ex56 that works. I can do an MR for >>>>>>>>> it if such a test does not exist. >>>>>>>>> >>>>>>>>> See the discussions here >>>>>>>>> https://gitlab.com/petsc/petsc/-/merge_requests/3411 >>>>>>>>> <https://gitlab.com/petsc/petsc/-/merge_requests/3411> >>>>>>>>> I started cleaning up the code to prepare for testing but we never >>>>>>>>> finished it >>>>>>>>> https://gitlab.com/petsc/petsc/-/commits/stefanozampini/simplify-setvalues-device/ >>>>>>>>> >>>>>>>>> <https://gitlab.com/petsc/petsc/-/commits/stefanozampini/simplify-setvalues-device/> >>>>>>>>> >>>>>>>>> >>>>>>>>>> On May 28, 2021, at 6:53 PM, Mark Adams <[email protected] >>>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>>> >>>>>>>>>> Is there a test with MatSetValues and CUDA? >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
