It looks like MatAssemblyEnd is not setting up correctly in parallel . Segvv here. I'll take a look at what Stefano did.
#18 main () (at 0x00000000100019a8) #17 MatMult (mat=0x2155a750, x=0x56dc29c0, y=0x5937b190) at /autofs/nccs-svm1_home1/adams/petsc/src/mat/interface/matrix.c:2448 (at 0x00002000005f4858) #16 MatMult_MPIAIJCUSPARSE(_p_Mat*, _p_Vec*, _p_Vec*) () from /ccs/home/adams/petsc/arch-summit-opt64-gnu-cuda/lib/libpetsc.so.3.15 (at 0x000020000095e298) #15 VecScatterBegin (sf=0x5937fbf0, x=0x56dc29c0, y=0x5937cd20, addv=<optimized out>, mode=<optimized out>) at /autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/interface/vscat.c:1345 (at 0x00002000003a44fc) #14 VecScatterBegin_Internal (sf=0x5937fbf0, x=0x56dc29c0, y=0x5937cd20, addv=INSERT_VALUES, mode=SCATTER_FORWARD) at /autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/interface/vscat.c:72 (at 0x000020000039e9cc) #13 PetscSFBcastWithMemTypeBegin (sf=0x5937fbf0, unit=0x200024529ed0, rootmtype=<optimized out>, rootdata=0x200076ea1e00, leafmtype=<optimized out>, leafdata=0x200076ea2200, op=0x200024539c70) at /autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/interface/sf.c:1493 (at 0x0000200000396f04) #12 PetscSFBcastBegin_Basic (sf=0x5937fbf0, unit=<optimized out>, rootmtype=<optimized out>, rootdata=0x200076ea1e00, leafmtype=<optimized out>, leafdata=0x200076ea2200, op=0x200024539c70) at /autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/impls/basic/sfbasic.c:191 (at 0x00002000002de188) #11 PetscSFLinkStartCommunication (direction=PETSCSF_ROOT2LEAF, link=<optimized out>, sf=0x5937fbf0) at /ccs/home/adams/petsc/include/../src/vec/is/sf/impls/basic/sfpack.h:267 (at 0x00002000002de188) #10 PetscSFLinkStartRequests_MPI (sf=<optimized out>, link=0x5937f080, direction=<optimized out>) at /autofs/nccs-svm1_home1/adams/petsc/src/vec/is/sf/impls/basic/sfmpi.c:41 (at 0x00002000003850dc) #9 PMPI_Startall () from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/libmpi_ibm.so.3 (at 0x0000200024493d98) #8 mca_pml_pami_start () from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/spectrum_mpi/mca_pml_pami.so (at 0x00002000301ce6e0) #7 pml_pami_persis_send_start () from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/spectrum_mpi/mca_pml_pami.so (at 0x00002000301ce29c) #6 pml_pami_send () from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/spectrum_mpi/mca_pml_pami.so (at 0x00002000301cf69c) #5 PAMI_Send_immediate () from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3 (at 0x0000200030395814) #4 PAMI::Protocol::Send::Eager<PAMI::Device::Shmem::PacketModel<PAMI::Device::ShmemDevice<PAMI::Fifo::WrapFifo<PAMI::Fifo::FifoPacket<64u, 4096u>, PAMI::Counter::IndirectBounded<PAMI::Atomic::NativeAtomic>, 256u>, PAMI::Counter::Indirect<PAMI::Counter::Native>, PAMI::Device::Shmem::CMAShaddr, 256u, 512u> >, PAMI::Device::IBV::PacketModel<PAMI::Device::IBV::Device, true> >::EagerImpl<(PAMI::Protocol::Send::configuration_t)5, true>::immediate(pami_send_immediate_t*) () from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3 (at 0x0000200030457bac) #3 PAMI::Protocol::Send::EagerSimple<PAMI::Device::Shmem::PacketModel<PAMI::Device::ShmemDevice<PAMI::Fifo::WrapFifo<PAMI::Fifo::FifoPacket<64u, 4096u>, PAMI::Counter::IndirectBounded<PAMI::Atomic::NativeAtomic>, 256u>, PAMI::Counter::Indirect<PAMI::Counter::Native>, PAMI::Device::Shmem::CMAShaddr, 256u, 512u> >, (PAMI::Protocol::Send::configuration_t)5>::immediate_impl(pami_send_immediate_t*) () from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3 (at 0x0000200030457824) #2 bool PAMI::Device::Interface::PacketModel<PAMI::Device::Shmem::PacketModel<PAMI::Device::ShmemDevice<PAMI::Fifo::WrapFifo<PAMI::Fifo::FifoPacket<64u, 4096u>, PAMI::Counter::IndirectBounded<PAMI::Atomic::NativeAtomic>, 256u>, PAMI::Counter::Indirect<PAMI::Counter::Native>, PAMI::Device::Shmem::CMAShaddr, 256u, 512u> > >::postPacket<2u>(unsigned long, unsigned long, void*, unsigned long, iovec (&) [2u]) () from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3 (at 0x0000200030456c18) #1 PAMI::Device::Shmem::Packet<PAMI::Fifo::FifoPacket<64u, 4096u> >::writePayload(PAMI::Fifo::FifoPacket<64u, 4096u>&, iovec*, unsigned long) () from /autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/spectrum-mpi-10.3.1.2-20200121-awz2q5brde7wgdqqw4ugalrkukeub4eb/container/../lib/pami_port/libpami.so.3 (at 0x0000200030435a7c) #0 __memcpy_power7 () from /lib64/libc.so.6 (at 0x000020002463b804) On Fri, May 28, 2021 at 12:45 PM Barry Smith <[email protected]> wrote: > > ~/petsc/src/mat/tutorials* > (barry/2021-05-28/robustify-cuda-gencodearch-check=)* > arch-robustify-cuda-gencodearch-check > $ ./ex5cu > terminate called after throwing an instance of > 'thrust::system::system_error' > what(): fill_n: failed to synchronize: cudaErrorIllegalAddress: an > illegal memory access was encountered > Aborted (core dumped) > > requires: cuda !define(PETSC_USE_CTABLE) > > CI does not test with CUDA and no ctable. The code is still broken as > it was six months ago in the discussion Stefano pointed to. It is clear why > just no one has had the time to clean things up. > > Barry > > > On May 28, 2021, at 11:13 AM, Mark Adams <[email protected]> wrote: > > > > On Fri, May 28, 2021 at 11:57 AM Stefano Zampini < > [email protected]> wrote: > >> If you are referring to your device set values, I guess it is not >> currently tested >> > > No. There is a test for that (ex5cu). > I have a user that is getting a segv in MatSetValues with aijcusparse. I > suspect there is memory corruption but I'm trying to cover all the bases. > I have added a cuda test to ksp/ex56 that works. I can do an MR for it if > such a test does not exist. > > >> See the discussions here >> https://gitlab.com/petsc/petsc/-/merge_requests/3411 >> I started cleaning up the code to prepare for testing but we never >> finished it >> https://gitlab.com/petsc/petsc/-/commits/stefanozampini/simplify-setvalues-device/ >> >> >> On May 28, 2021, at 6:53 PM, Mark Adams <[email protected]> wrote: >> >> Is there a test with MatSetValues and CUDA? >> >> >> >
