> On Jun 2, 2021, at 2:06 AM, Stefano Zampini <stefano.zamp...@gmail.com> wrote: > > Barry > > If I remember correctly, the ctable business was a limitation of Mark's > initial implementation, which I removed here > https://gitlab.com/petsc/petsc/-/merge_requests/3411/diffs?commit_id=36173707431f5583889720868beae9046e6e1fd2 > > <https://gitlab.com/petsc/petsc/-/merge_requests/3411/diffs?commit_id=36173707431f5583889720868beae9046e6e1fd2>. > Here colmap was "int" already. We made a mistake letting this branch fall > so behind main. I will add my observations in the MR
Thanks, likely it was the merge brought in the incorrect colmap. Yes, it was my fault the branch go neglected, I become too emotional over the configure issue with CUDA, stating it needed a completely rewrite and then forgot about the branch. I have "fixed" the CUDA configure issue with an "ok" but hacky manner, after MPI is configured and changes the compilers, thus removing all the compiler flags, I force a rerun of parts of cuda.py to add back their contributions to the CUDA compiler flags. There is a circular dependency that this change resolves. Barry > > Il giorno mer 2 giu 2021 alle ore 09:20 Barry Smith <bsm...@petsc.dev > <mailto:bsm...@petsc.dev>> ha scritto: > > Switching the thread to petsc-dev since users don't need this. > > Had a nasty failure on one the CI, turned out the GPU colmap was declared > as PetscInt* but it was copied down from the CPU into the GPU memory as int. > The power of void* interface. Thus ex5cu.cu <http://ex5cu.cu/> which was > never CI tested was crashing very strangely. By declaring it properly as int* > I got the example to run with 64 bit indices. Submitted a new pipeline to see > what else pops up. > > I am a bit worried about the csr2csc data structures and code, one line > declaring csr2csc_i inside the struct disappeared in the rebase on main, I > stuck it in but don't understand that code at all. I'll report further when > the CI is done or crashes. I will also need to make sure it is compiled pre > and post PETSC_PKG_CUDA_VERSION_GE(11,0,0) > > I will also investigate this CTABLE business, the GPU doesn't need to know > or care if the CPU is using a CTABLE or not; it just needs the colmap array > format that it gets from spreading out the garray anyway to be copied to the > GPU so PETSc can likely be built with CTABLES as normal; this will improve > the testing a lot. > > > > Barry > > >> On May 30, 2021, at 11:54 AM, Barry Smith <bsm...@petsc.dev >> <mailto:bsm...@petsc.dev>> wrote: >> >> >> I believe I have finally successfully rebased the branch >> barry/2020-11-11/cleanup-matsetvaluesdevice against main and cleaned up all >> the issues. Please read the commit message. >> >> I have submitted a CI pipeline with ctables turned off temporarily for >> testing of the MatSetValuesDevice(). If it works hopefully Mark can maybe >> run a few additional tests of his Landau code that are not in the usual >> testing to verify and we can finally get the branch into main. >> >> Mark, >> >> Since this change is involved, it is likely your Landau mass matrix >> branch may not rebase cleanly. Let me know if you would like me to do the >> rebase and testing of your Landau mass matrix branch. I can get it ready to >> work with the results of barry/2020-11-11/cleanup-matsetvaluesdevice and >> then hand it back to you for further development. >> >> Barry >> >> >>> On May 29, 2021, at 11:32 AM, Barry Smith <bsm...@petsc.dev >>> <mailto:bsm...@petsc.dev>> wrote: >>> >>> >>> I am working away on this branch, making some progress, also cleaning >>> things up with some small simplifications. Hope I can succeed, a bunch of >>> stuff got moved around and some structs had changes, the merge could not >>> handle some of these so I have to do a good amount of code wrangling to fix >>> it. >>> >>> I'll let you know as I progress. >>> >>> Barry >>> >>> >>>> On May 28, 2021, at 10:53 PM, Barry Smith <bsm...@petsc.dev >>>> <mailto:bsm...@petsc.dev>> wrote: >>>> >>>> >>>> I have rebased and tried to fix everything. I am now fixing the issues >>>> of --download-openmpi and cuda, once that is done I will test, rebase with >>>> main again if needed and restart the MR and get it into main. >>>> >>>> Barry >>>> >>>> I was stupid to let the MR lay fallow, I should have figured out a >>>> solution to the openmpi and cuda issue instead of punting and waiting for >>>> a dream fix. >>>> >>>> >>>> >>>>> On May 28, 2021, at 2:39 PM, Mark Adams <mfad...@lbl.gov >>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>> >>>>> Thanks, >>>>> >>>>> I did not intend to make any (real) changes. >>>>> The only thing that I did not intend to use from Barry's branch, that >>>>> conflicted, was the help and comment block at the top of ex5cu.cu >>>>> <http://ex5cu.cu/> >>>>> >>>>> * I ended up with two declarations of PetscSplitCSRDataStructure >>>>> * I added some includes to fix errors like this: >>>>> /ccs/home/adams/petsc/include/../src/mat/impls/aij/seq/seqcusparse/cusparsematimpl.h(263): >>>>> error: incomplete type is not allowed >>>>> * I end ended not having csr2csc_i in Mat_SeqAIJCUSPARSE so I get: >>>>> /autofs/nccs-svm1_home1/adams/petsc/src/mat/impls/aij/seq/seqcusparse/aijcusparse.cu >>>>> <http://aijcusparse.cu/>(1348): error: class "Mat_SeqAIJCUSPARSE" has no >>>>> member "csr2csc_i" >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, May 28, 2021 at 3:13 PM Stefano Zampini >>>>> <stefano.zamp...@gmail.com <mailto:stefano.zamp...@gmail.com>> wrote: >>>>> I can take a quick look at it tomorrow, what are the main changes you >>>>> made since then? >>>>> >>>>>> On May 28, 2021, at 9:51 PM, Mark Adams <mfad...@lbl.gov >>>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>>> >>>>>> I am getting messed up in trying to resolve conflicts in rebasing over >>>>>> main. >>>>>> Is there a better way of doing this? >>>>>> Can I just tell git to use Barry's version and then test it? >>>>>> Or should I just try it again? >>>>>> >>>>>> On Fri, May 28, 2021 at 2:15 PM Mark Adams <mfad...@lbl.gov >>>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>>> I am rebasing over main and its a bit of a mess. I must have missed >>>>>> something. I get this. I think the _n_SplitCSRMat must be wrong. >>>>>> >>>>>> >>>>>> In file included from >>>>>> /autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/impls/basic/sfbasic.c:128:0: >>>>>> /ccs/home/adams/petsc/include/petscmat.h:1976:32: error: conflicting >>>>>> types for 'PetscSplitCSRDataStructure' >>>>>> typedef struct _n_SplitCSRMat *PetscSplitCSRDataStructure; >>>>>> ^~~~~~~~~~~~~~~~~~~~~~~~~~ >>>>>> /ccs/home/adams/petsc/include/petscmat.h:1922:31: note: previous >>>>>> declaration of 'PetscSplitCSRDataStructure' was here >>>>>> typedef struct _p_SplitCSRMat PetscSplitCSRDataStructure; >>>>>> ^~~~~~~~~~~~~~~~~~~~~~~~~~ >>>>>> CC arch-summit-opt-gnu-cuda/obj/vec/vec/impls/seq/dvec2.o >>>>>> >>>>>> On Fri, May 28, 2021 at 1:50 PM Stefano Zampini >>>>>> <stefano.zamp...@gmail.com <mailto:stefano.zamp...@gmail.com>> wrote: >>>>>> OpenMPI.py depends on cuda.py in that, if cuda is present, configures >>>>>> using cuda. MPI.py or MPICH.py do not depend on cuda.py (MPICH, only >>>>>> weakly, it adds a print if cuda is present) >>>>>> Since eventually the MPI distro will only need a hint to be configured >>>>>> with CUDA, why not removing the dependency at all and add only a flag >>>>>> —download-openmpi-use-cuda? >>>>>> >>>>>>> On May 28, 2021, at 8:44 PM, Barry Smith <bsm...@petsc.dev >>>>>>> <mailto:bsm...@petsc.dev>> wrote: >>>>>>> >>>>>>> >>>>>>> Stefano, who has a far better memory than me, wrote >>>>>>> >>>>>>> > Or probably remove —download-openmpi ? Or, just for the moment, why >>>>>>> > can’t we just tell configure that mpi is a weak dependence of >>>>>>> > cuda.py, so that it will be forced to be configured later? >>>>>>> >>>>>>> MPI.py depends on cuda.py so we cannot also have cuda.py depend on >>>>>>> MPI.py using the generic dependencies of configure/packages >>>>>>> >>>>>>> but perhaps we can just hardwire the rerunning of cuda.py when the >>>>>>> MPI compilers are reset. I will try that now and if I can get it to >>>>>>> work we should be able to move those old fix branches along as MR. >>>>>>> >>>>>>> Barry >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On May 28, 2021, at 12:41 PM, Mark Adams <mfad...@lbl.gov >>>>>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>>>>> >>>>>>>> OK, I will try to rebase and test Barry's branch. >>>>>>>> >>>>>>>> On Fri, May 28, 2021 at 1:26 PM Stefano Zampini >>>>>>>> <stefano.zamp...@gmail.com <mailto:stefano.zamp...@gmail.com>> wrote: >>>>>>>> Yes, it is the branch I was using before force pushing to Barry’s >>>>>>>> barry/2020-11-11/cleanup-matsetvaluesdevice >>>>>>>> You can use both I guess >>>>>>>> >>>>>>>>> On May 28, 2021, at 8:25 PM, Mark Adams <mfad...@lbl.gov >>>>>>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>>>>>> >>>>>>>>> Is this the correct branch? It conflicted with ex5cu so I assume it >>>>>>>>> is. >>>>>>>>> >>>>>>>>> >>>>>>>>> stefanozampini/simplify-setvalues-device >>>>>>>>> <https://gitlab.com/petsc/petsc/-/tree/stefanozampini/simplify-setvalues-device> >>>>>>>>> >>>>>>>>> On Fri, May 28, 2021 at 1:24 PM Mark Adams <mfad...@lbl.gov >>>>>>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>>>>>> I am fixing rebasing this branch over main. >>>>>>>>> >>>>>>>>> On Fri, May 28, 2021 at 1:16 PM Stefano Zampini >>>>>>>>> <stefano.zamp...@gmail.com <mailto:stefano.zamp...@gmail.com>> wrote: >>>>>>>>> Or probably remove —download-openmpi ? Or, just for the moment, why >>>>>>>>> can’t we just tell configure that mpi is a weak dependence of >>>>>>>>> cuda.py, so that it will be forced to be configured later? >>>>>>>>> >>>>>>>>>> On May 28, 2021, at 8:12 PM, Stefano Zampini >>>>>>>>>> <stefano.zamp...@gmail.com <mailto:stefano.zamp...@gmail.com>> wrote: >>>>>>>>>> >>>>>>>>>> That branch provides a fix for MatSetValuesDevice but it never got >>>>>>>>>> merged because of the CI issues with the —download-openmpi. We can >>>>>>>>>> probably try to skip the test in that specific configuration? >>>>>>>>>> >>>>>>>>>>> On May 28, 2021, at 7:45 PM, Barry Smith <bsm...@petsc.dev >>>>>>>>>>> <mailto:bsm...@petsc.dev>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> ~/petsc/src/mat/tutorials >>>>>>>>>>> (barry/2021-05-28/robustify-cuda-gencodearch-check=) >>>>>>>>>>> arch-robustify-cuda-gencodearch-check >>>>>>>>>>> $ ./ex5cu >>>>>>>>>>> terminate called after throwing an instance of >>>>>>>>>>> 'thrust::system::system_error' >>>>>>>>>>> what(): fill_n: failed to synchronize: cudaErrorIllegalAddress: >>>>>>>>>>> an illegal memory access was encountered >>>>>>>>>>> Aborted (core dumped) >>>>>>>>>>> >>>>>>>>>>> requires: cuda !define(PETSC_USE_CTABLE) >>>>>>>>>>> >>>>>>>>>>> CI does not test with CUDA and no ctable. The code is still >>>>>>>>>>> broken as it was six months ago in the discussion Stefano pointed >>>>>>>>>>> to. It is clear why just no one has had the time to clean things up. >>>>>>>>>>> >>>>>>>>>>> Barry >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On May 28, 2021, at 11:13 AM, Mark Adams <mfad...@lbl.gov >>>>>>>>>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, May 28, 2021 at 11:57 AM Stefano Zampini >>>>>>>>>>>> <stefano.zamp...@gmail.com <mailto:stefano.zamp...@gmail.com>> >>>>>>>>>>>> wrote: >>>>>>>>>>>> If you are referring to your device set values, I guess it is not >>>>>>>>>>>> currently tested >>>>>>>>>>>> >>>>>>>>>>>> No. There is a test for that (ex5cu). >>>>>>>>>>>> I have a user that is getting a segv in MatSetValues with >>>>>>>>>>>> aijcusparse. I suspect there is memory corruption but I'm trying >>>>>>>>>>>> to cover all the bases. >>>>>>>>>>>> I have added a cuda test to ksp/ex56 that works. I can do an MR >>>>>>>>>>>> for it if such a test does not exist. >>>>>>>>>>>> >>>>>>>>>>>> See the discussions here >>>>>>>>>>>> https://gitlab.com/petsc/petsc/-/merge_requests/3411 >>>>>>>>>>>> <https://gitlab.com/petsc/petsc/-/merge_requests/3411> >>>>>>>>>>>> I started cleaning up the code to prepare for testing but we never >>>>>>>>>>>> finished it >>>>>>>>>>>> https://gitlab.com/petsc/petsc/-/commits/stefanozampini/simplify-setvalues-device/ >>>>>>>>>>>> >>>>>>>>>>>> <https://gitlab.com/petsc/petsc/-/commits/stefanozampini/simplify-setvalues-device/> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On May 28, 2021, at 6:53 PM, Mark Adams <mfad...@lbl.gov >>>>>>>>>>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Is there a test with MatSetValues and CUDA? >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > > > > -- > Stefano