Try without Valgrind, put a CHKMEMQ; just before the call to LandauKokkosJacobian and as its first line. And run with -malloc_debug. This is a less optimal way to find memory corruption but may be more useful in this case.
> On May 29, 2021, at 10:46 PM, Junchao Zhang <junchao.zh...@gmail.com> wrote: > > try gcc/6.4.0 > --Junchao Zhang > > > On Sat, May 29, 2021 at 9:50 PM Mark Adams <mfad...@lbl.gov > <mailto:mfad...@lbl.gov>> wrote: > And I grief using gcc-8.1.1 and get this error: > > /autofs/nccs-svm1_sw/summit/gcc/8.1.1/include/c++/8.1.1/type_traits(347): > error: identifier "__ieee128" is undefined > > Any ideas? > > On Sat, May 29, 2021 at 10:39 PM Mark Adams <mfad...@lbl.gov > <mailto:mfad...@lbl.gov>> wrote: > And valgrind sees this. I think the jump to the function is going to the > wrong place. > I'm giving up on PGI but can try newer versions of GCC. (what is the deal > with the range of major releases, 4-10?) > (as I said this looks like an error that a user is getting so I'd like to > figure it out). > > 0 SNES Function norm 4.974994975313e-03 > ==77820== Invalid read of size 4 > ==77820== at 0x7E69068: LandauKokkosJacobian (in > /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-notpl-cuda10/lib/libpetsc.so.3.015.0) > ==77820== by 0x7C598AF: LandauFormJacobian_Internal (plexland.c:212) > ==77820== by 0x7C728D3: LandauIJacobian (plexland.c:2107) > ==77820== by 0x7C8C26B: TSComputeIJacobian (ts.c:934) > ==77820== by 0x7E28337: SNESTSFormJacobian_Theta (theta.c:1007) > ==77820== by 0x7CBBFD3: SNESTSFormJacobian (ts.c:4415) > ==77820== by 0x7AD84BF: SNESComputeJacobian (snes.c:2824) > ==77820== by 0x7BA945B: SNESSolve_NEWTONLS (ls.c:222) > ==77820== by 0x7AF336F: SNESSolve (snes.c:4769) > ==77820== by 0x7E19D13: TSTheta_SNESSolve (theta.c:185) > ==77820== by 0x7E1A8B7: TSStep_Theta (theta.c:223) > ==77820== by 0x7CB093F: TSStep (ts.c:3571) > ==77820== Address 0x96fff690 is in a --- anonymous segment > ==77820== > [0]PETSC ERROR: > ------------------------------------------------------------------------ > [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, > probably memory access out of range > [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > [0]PETSC ERROR: or see > https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind > <https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind> > [0]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on > GNU/linux and Apple Mac OS X to find memory corruption errors > [0]PETSC ERROR: likely location of problem given in stack below > [0]PETSC ERROR: --------------------- Stack Frames > ------------------------------------ > [0]PETSC ERROR: The EXACT line numbers in the error traceback are not > available. > [0]PETSC ERROR: instead the line number of the start of the function is given. > [0]PETSC ERROR: #1 LandauKokkosJacobian() at > /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/utils/dmplexlandau/kokkos/landau.kokkos.cxx:272 > > On Sat, May 29, 2021 at 8:46 PM Mark Adams <mfad...@lbl.gov > <mailto:mfad...@lbl.gov>> wrote: > > > On Sat, May 29, 2021 at 7:48 PM Barry Smith <bsm...@petsc.dev > <mailto:bsm...@petsc.dev>> wrote: > > I don't see why it is not running the Kokkos check. Here is the rule right > below the CUDA rule that is apparently running. > > check_build: > -@echo "Running check examples to verify correct installation" > -@echo "Using PETSC_DIR=${PETSC_DIR} and PETSC_ARCH=${PETSC_ARCH}" > +@cd src/snes/tutorials >/dev/null; ${OMAKE_SELF} > PETSC_ARCH=${PETSC_ARCH} PETSC_DIR=${PETSC_DIR} clean-legacy > +@cd src/snes/tutorials >/dev/null; ${OMAKE_SELF} > PETSC_ARCH=${PETSC_ARCH} PETSC_DIR=${PETSC_DIR} testex19 > +@if [ "${HYPRE_LIB}" != "" ] && [ "${PETSC_WITH_BATCH}" = "" ] && [ > "${PETSC_SCALAR}" = "real" ]; then \ > cd src/snes/tutorials >/dev/null; ${OMAKE_SELF} > PETSC_ARCH=${PETSC_ARCH} PETSC_DIR=${PETSC_DIR} > DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex19_hypre; \ > fi; > +@if [ "${CUDA_LIB}" != "" ] && [ "${PETSC_WITH_BATCH}" = "" ] && [ > "${PETSC_SCALAR}" = "real" ]; then \ > cd src/snes/tutorials >/dev/null; ${OMAKE_SELF} > PETSC_ARCH=${PETSC_ARCH} PETSC_DIR=${PETSC_DIR} > DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex19_cuda; \ > fi; > +@if [ "${KOKKOS_KERNELS_LIB}" != "" ] && [ "${PETSC_WITH_BATCH}" = > "" ] && [ "${PETSC_SCALAR}" = "real" ] && [ "${PETSC_PRECISION}" = "double" > ] && [ "${MPI_IS_MPIUNI}" = "0" ]; then \ > cd src/snes/tutorials >/dev/null; ${OMAKE_SELF} > PETSC_ARCH=${PETSC_ARCH} PETSC_DIR=${PETSC_DIR} > DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex3k_kokkos; \ > fi; > > Regarding the debugging, if it is just one MPI rank (or even more) with GDB > it will trap the error and show the exact line of source code where the error > occurred and you can poke around at variables to see if they look corrupt or > wrong (for example crazy address in a pointer), I don't know why your > debugger is not giving more useful information. > > > This is what I did (in DDT). It stopped at the function call and the data > looked fine. I stepped into the call, but didn't get to it. The signal > handler was called and I was dead. > Maybe I did something in my branch. Can't see what, but I keep probing, > Thanks, > > Barry > > > > On May 29, 2021, at 2:16 PM, Mark Adams <mfad...@lbl.gov > > <mailto:mfad...@lbl.gov>> wrote: > > > > I am running on Summit with Kokkos-CUDA and I am getting a segv that looks > > like some sort of a compile/link mismatch. I also have a user with a C++ > > code that is getting strange segvs when calling MatSetValues with CUDA (I > > know MatSetValues is not a cupsarse method, but that is the report that I > > have). I have no idea if these are related but they both involve C -- C++ > > calls ... > > > > I started with a clean build (attached) and I ran in DDT. DDT stopped at > > the call in plexland.c to the KokkosLanau operator. I stepped into this > > function and then took this screenshot of the stack, with the Kokkos call > > and PETSc signal handler. > > > > Make check does not seem to be running Kokkos tests: > > > > 15:02 adams/landau-mass-opt *= /gpfs/alpine/csc314/scratch/adams/petsc$ > > make PETSC_DIR=/gpfs/alpine/csc314/scratch/adams/petsc > > PETSC_ARCH=arch-summit-opt-gnu-kokkos-notpl-cuda10 check > > Running check examples to verify correct installation > > Using PETSC_DIR=/gpfs/alpine/csc314/scratch/adams/petsc and > > PETSC_ARCH=arch-summit-opt-gnu-kokkos-notpl-cuda10 > > C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI process > > C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI processes > > C/C++ example src/snes/tutorials/ex19 run successfully with cuda > > Completed test examples > > > > Also, I ran this AM with another branch that had not been rebased with main > > as recently as this branch (adams/landau-mass-opt). > > > > Any ideas? > > <make.log><configure.log><Screen Shot 2021-05-29 at 2.51.00 PM.png> >