On Sat, May 29, 2021 at 11:46 PM Junchao Zhang <junchao.zh...@gmail.com> wrote:
> try gcc/6.4.0 > 6.4.0 is the default and what I've been using. 6.4.0 builds and it has worked but I am now getting this segv (valgrind trace below) in adams/landau-mass-opt. My thinking is to try other versions. > --Junchao Zhang > > > On Sat, May 29, 2021 at 9:50 PM Mark Adams <mfad...@lbl.gov> wrote: > >> And I grief using gcc-8.1.1 and get this error: >> >> /autofs/nccs-svm1_sw/summit/gcc/8.1.1/include/c++/8.1.1/type_traits(347): >> error: identifier "__ieee128" is undefined >> >> Any ideas? >> >> On Sat, May 29, 2021 at 10:39 PM Mark Adams <mfad...@lbl.gov> wrote: >> >>> And valgrind sees this. I think the jump to the function is going to >>> the wrong place. >>> I'm giving up on PGI but can try newer versions of GCC. (what is the >>> deal with the range of major releases, 4-10?) >>> (as I said this looks like an error that a user is getting so I'd like >>> to figure it out). >>> >>> 0 SNES Function norm 4.974994975313e-03 >>> ==77820== Invalid read of size 4 >>> ==77820== at 0x7E69068: LandauKokkosJacobian (in >>> /gpfs/alpine/csc314/scratch/adams/petsc/arch-summit-opt-gnu-kokkos-notpl-cuda10/lib/libpetsc.so.3.015.0) >>> ==77820== by 0x7C598AF: LandauFormJacobian_Internal (plexland.c:212) >>> ==77820== by 0x7C728D3: LandauIJacobian (plexland.c:2107) >>> ==77820== by 0x7C8C26B: TSComputeIJacobian (ts.c:934) >>> ==77820== by 0x7E28337: SNESTSFormJacobian_Theta (theta.c:1007) >>> ==77820== by 0x7CBBFD3: SNESTSFormJacobian (ts.c:4415) >>> ==77820== by 0x7AD84BF: SNESComputeJacobian (snes.c:2824) >>> ==77820== by 0x7BA945B: SNESSolve_NEWTONLS (ls.c:222) >>> ==77820== by 0x7AF336F: SNESSolve (snes.c:4769) >>> ==77820== by 0x7E19D13: TSTheta_SNESSolve (theta.c:185) >>> ==77820== by 0x7E1A8B7: TSStep_Theta (theta.c:223) >>> ==77820== by 0x7CB093F: TSStep (ts.c:3571) >>> ==77820== Address 0x96fff690 is in a --- anonymous segment >>> ==77820== >>> [0]PETSC ERROR: >>> ------------------------------------------------------------------------ >>> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, >>> probably memory access out of range >>> [0]PETSC ERROR: Try option -start_in_debugger or >>> -on_error_attach_debugger >>> [0]PETSC ERROR: or see >>> https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind >>> [0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac >>> OS X to find memory corruption errors >>> [0]PETSC ERROR: likely location of problem given in stack below >>> [0]PETSC ERROR: --------------------- Stack Frames >>> ------------------------------------ >>> [0]PETSC ERROR: The EXACT line numbers in the error traceback are not >>> available. >>> [0]PETSC ERROR: instead the line number of the start of the function is >>> given. >>> [0]PETSC ERROR: #1 LandauKokkosJacobian() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ts/utils/dmplexlandau/kokkos/landau.kokkos.cxx:272 >>> >>> On Sat, May 29, 2021 at 8:46 PM Mark Adams <mfad...@lbl.gov> wrote: >>> >>>> >>>> >>>> On Sat, May 29, 2021 at 7:48 PM Barry Smith <bsm...@petsc.dev> wrote: >>>> >>>>> >>>>> I don't see why it is not running the Kokkos check. Here is the >>>>> rule right below the CUDA rule that is apparently running. >>>>> >>>>> check_build: >>>>> -@echo "Running check examples to verify correct installation" >>>>> -@echo "Using PETSC_DIR=${PETSC_DIR} and >>>>> PETSC_ARCH=${PETSC_ARCH}" >>>>> +@cd src/snes/tutorials >/dev/null; ${OMAKE_SELF} >>>>> PETSC_ARCH=${PETSC_ARCH} PETSC_DIR=${PETSC_DIR} clean-legacy >>>>> +@cd src/snes/tutorials >/dev/null; ${OMAKE_SELF} >>>>> PETSC_ARCH=${PETSC_ARCH} PETSC_DIR=${PETSC_DIR} testex19 >>>>> +@if [ "${HYPRE_LIB}" != "" ] && [ "${PETSC_WITH_BATCH}" = "" >>>>> ] && [ "${PETSC_SCALAR}" = "real" ]; then \ >>>>> cd src/snes/tutorials >/dev/null; ${OMAKE_SELF} >>>>> PETSC_ARCH=${PETSC_ARCH} PETSC_DIR=${PETSC_DIR} >>>>> DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex19_hypre; \ >>>>> fi; >>>>> +@if [ "${CUDA_LIB}" != "" ] && [ "${PETSC_WITH_BATCH}" = "" ] >>>>> && [ "${PETSC_SCALAR}" = "real" ]; then \ >>>>> cd src/snes/tutorials >/dev/null; ${OMAKE_SELF} >>>>> PETSC_ARCH=${PETSC_ARCH} PETSC_DIR=${PETSC_DIR} >>>>> DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex19_cuda; \ >>>>> fi; >>>>> +@if [ "${KOKKOS_KERNELS_LIB}" != "" ] && [ >>>>> "${PETSC_WITH_BATCH}" = "" ] && [ "${PETSC_SCALAR}" = "real" ] && [ >>>>> "${PETSC_PRECISION}" = "double" ] && [ "${MPI_IS_MPIUNI}" = "0" ]; then \ >>>>> cd src/snes/tutorials >/dev/null; ${OMAKE_SELF} >>>>> PETSC_ARCH=${PETSC_ARCH} PETSC_DIR=${PETSC_DIR} >>>>> DIFF=${PETSC_DIR}/lib/petsc/bin/petscdiff runex3k_kokkos; \ >>>>> fi; >>>>> >>>>> Regarding the debugging, if it is just one MPI rank (or even more) >>>>> with GDB it will trap the error and show the exact line of source code >>>>> where the error occurred and you can poke around at variables to see if >>>>> they look corrupt or wrong (for example crazy address in a pointer), I >>>>> don't know why your debugger is not giving more useful information. >>>>> >>>>> >>>> This is what I did (in DDT). It stopped at the function call and the >>>> data looked fine. I stepped into the call, but didn't get to it. The signal >>>> handler was called and I was dead. >>>> Maybe I did something in my branch. Can't see what, but I keep probing, >>>> Thanks, >>>> >>>> >>>>> Barry >>>>> >>>>> >>>>> > On May 29, 2021, at 2:16 PM, Mark Adams <mfad...@lbl.gov> wrote: >>>>> > >>>>> > I am running on Summit with Kokkos-CUDA and I am getting a segv that >>>>> looks like some sort of a compile/link mismatch. I also have a user with a >>>>> C++ code that is getting strange segvs when calling MatSetValues with CUDA >>>>> (I know MatSetValues is not a cupsarse method, but that is the report that >>>>> I have). I have no idea if these are related but they both involve C -- >>>>> C++ >>>>> calls ... >>>>> > >>>>> > I started with a clean build (attached) and I ran in DDT. DDT >>>>> stopped at the call in plexland.c to the KokkosLanau operator. I stepped >>>>> into this function and then took this screenshot of the stack, with the >>>>> Kokkos call and PETSc signal handler. >>>>> > >>>>> > Make check does not seem to be running Kokkos tests: >>>>> > >>>>> > 15:02 adams/landau-mass-opt *= >>>>> /gpfs/alpine/csc314/scratch/adams/petsc$ make >>>>> PETSC_DIR=/gpfs/alpine/csc314/scratch/adams/petsc >>>>> PETSC_ARCH=arch-summit-opt-gnu-kokkos-notpl-cuda10 check >>>>> > Running check examples to verify correct installation >>>>> > Using PETSC_DIR=/gpfs/alpine/csc314/scratch/adams/petsc and >>>>> PETSC_ARCH=arch-summit-opt-gnu-kokkos-notpl-cuda10 >>>>> > C/C++ example src/snes/tutorials/ex19 run successfully with 1 MPI >>>>> process >>>>> > C/C++ example src/snes/tutorials/ex19 run successfully with 2 MPI >>>>> processes >>>>> > C/C++ example src/snes/tutorials/ex19 run successfully with cuda >>>>> > Completed test examples >>>>> > >>>>> > Also, I ran this AM with another branch that had not been rebased >>>>> with main as recently as this branch (adams/landau-mass-opt). >>>>> > >>>>> > Any ideas? >>>>> > <make.log><configure.log><Screen Shot 2021-05-29 at 2.51.00 PM.png> >>>>> >>>>>