I have used valgrind here. I did not run it on this MPI error. I will. On Wed, Jan 26, 2022 at 10:56 AM Barry Smith <bsm...@petsc.dev> wrote:
> > Any way to run with valgrind (or a HIP variant of valgrind)? It looks > like a memory corruption issue and tracking down exactly when the > corruption begins is 3/4's of the way to finding the exact cause. > > Are the crashes reproducible in the same place with identical runs? > > > On Jan 26, 2022, at 10:46 AM, Mark Adams <mfad...@lbl.gov> wrote: > > I think it is an MPI bug. It works with GPU aware MPI turned off. > I am sure Summit will be fine. > We have had users fix this error by switching thier MPI. > > On Wed, Jan 26, 2022 at 10:10 AM Junchao Zhang <junchao.zh...@gmail.com> > wrote: > >> I don't know if this is due to bugs in petsc/kokkos backend. See if you >> can run 6 nodes (48 mpi ranks). If it fails, then run the same problem on >> Summit with 8 nodes to see if it still fails. If yes, it is likely a bug of >> our own. >> >> --Junchao Zhang >> >> >> On Wed, Jan 26, 2022 at 8:44 AM Mark Adams <mfad...@lbl.gov> wrote: >> >>> I am not able to reproduce this with a small problem. 2 nodes or less >>> refinement works. This is from the 8 node test, the -dm_refine 5 version. >>> I see that it comes from PtAP. >>> This is on the fine grid. (I was thinking it could be on a reduced grid >>> with idle processors, but no) >>> >>> [15]PETSC ERROR: Argument out of range >>> [15]PETSC ERROR: Key <= 0 >>> [15]PETSC ERROR: See https://petsc.org/release/faq/ for trouble >>> shooting. >>> [15]PETSC ERROR: Petsc Development GIT revision: v3.16.3-696-g46640c56cb >>> GIT Date: 2022-01-25 09:20:51 -0500 >>> [15]PETSC ERROR: >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data/../ex13 on a >>> arch-olcf-crusher named crusher020 by adams Wed Jan 26 08:35:47 2022 >>> [15]PETSC ERROR: Configure options --with-cc=cc --with-cxx=CC >>> --with-fc=ftn --with-fortran-bindings=0 >>> LIBS="-L/opt/cray/pe/mpich/8.1.12/gtl/lib -lmpi_gtl_hsa" --with-debugging=0 >>> --COPTFLAGS="-g -O" --CXXOPTFLAGS="-g -O" --FOPTFLAGS=-g >>> --with-mpiexec="srun -p batch -N 1 -A csc314_crusher -t 00:10:00" >>> --with-hip --with-hipc=hipcc --download-hypre --with-hip-arch=gfx90a >>> --download-kokkos --download-kokkos-kernels --with-kokkos-kernels-tpl=0 >>> --download-p4est=1 >>> --with-zlib-dir=/sw/crusher/spack-envs/base/opt/cray-sles15-zen3/cce-13.0.0/zlib-1.2.11-qx5p4iereg4sjvfi5uwk6jn56o6se2q4 >>> PETSC_ARCH=arch-olcf-crusher >>> [15]PETSC ERROR: #1 PetscTableFind() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/include/petscctable.h:131 >>> [15]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mmaij.c:35 >>> [15]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/mpiaij.c:735 >>> [15]PETSC ERROR: #4 MatAssemblyEnd_MPIAIJKokkos() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:14 >>> [15]PETSC ERROR: #5 MatAssemblyEnd() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:5678 >>> [15]PETSC ERROR: #6 MatSetMPIAIJKokkosWithSplitSeqAIJKokkosMatrices() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:267 >>> [15]PETSC ERROR: #7 MatSetMPIAIJKokkosWithGlobalCSRMatrix() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:825 >>> [15]PETSC ERROR: #8 MatProductSymbolic_MPIAIJKokkos() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/impls/aij/mpi/kokkos/mpiaijkok.kokkos.cxx:1167 >>> [15]PETSC ERROR: #9 MatProductSymbolic() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matproduct.c:825 >>> [15]PETSC ERROR: #10 MatPtAP() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/mat/interface/matrix.c:9656 >>> [15]PETSC ERROR: #11 PCGAMGCreateLevel_GAMG() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:87 >>> [15]PETSC ERROR: #12 PCSetUp_GAMG() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/impls/gamg/gamg.c:663 >>> [15]PETSC ERROR: #13 PCSetUp() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/pc/interface/precon.c:1017 >>> [15]PETSC ERROR: #14 KSPSetUp() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:417 >>> [15]PETSC ERROR: #15 KSPSolve_Private() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:863 >>> [15]PETSC ERROR: #16 KSPSolve() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/ksp/ksp/interface/itfunc.c:1103 >>> [15]PETSC ERROR: #17 SNESSolve_KSPONLY() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/impls/ksponly/ksponly.c:51 >>> [15]PETSC ERROR: #18 SNESSolve() at >>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/interface/snes.c:4810 >>> [15]PETSC ERROR: #19 main() at ex13.c:169 >>> [15]PETSC ERROR: PETSc Option Table entries: >>> [15]PETSC ERROR: -benchmark_it 10 >>> >>> On Wed, Jan 26, 2022 at 7:26 AM Mark Adams <mfad...@lbl.gov> wrote: >>> >>>> The GPU aware MPI is dying going 1 to 8 nodes, 8 processes per node. >>>> I will make a minimum reproducer. start with 2 nodes, one process on >>>> each node. >>>> >>>> >>>> On Tue, Jan 25, 2022 at 10:19 PM Barry Smith <bsm...@petsc.dev> wrote: >>>> >>>>> >>>>> So the MPI is killing you in going from 8 to 64. (The GPU flop rate >>>>> scales almost perfectly, but the overall flop rate is only half of what it >>>>> should be at 64). >>>>> >>>>> On Jan 25, 2022, at 9:24 PM, Mark Adams <mfad...@lbl.gov> wrote: >>>>> >>>>> It looks like we have our instrumentation and job configuration in >>>>> decent shape so on to scaling with AMG. >>>>> In using multiple nodes I got errors with table entries not found, >>>>> which can be caused by a buggy MPI, and the problem does go away when I >>>>> turn GPU aware MPI off. >>>>> Jed's analysis, if I have this right, is that at *0.7T* flops we are >>>>> at about 35% of theoretical peal wrt memory bandwidth. >>>>> I run out of memory with the next step in this study (7 levels of >>>>> refinement), with 2M equations per GPU. This seems low to me and we will >>>>> see if we can fix this. >>>>> So this 0.7Tflops is with only 1/4 M equations so 35% is not terrible. >>>>> Here are the solve times with 001, 008 and 064 nodes, and 5 or 6 >>>>> levels of refinement. >>>>> >>>>> out_001_kokkos_Crusher_5_1.txt:KSPSolve 10 1.0 1.2933e+00 >>>>> 1.0 4.13e+10 1.1 1.8e+05 8.4e+03 5.8e+02 3 87 86 78 48 100100100100100 >>>>> 248792 423857 6840 3.85e+02 6792 3.85e+02 100 >>>>> out_001_kokkos_Crusher_6_1.txt:KSPSolve 10 1.0 5.3667e+00 >>>>> 1.0 3.89e+11 1.0 2.1e+05 3.3e+04 6.7e+02 2 87 86 79 48 100100100100100 >>>>> 571572 *700002* 7920 1.74e+03 7920 1.74e+03 100 >>>>> out_008_kokkos_Crusher_5_1.txt:KSPSolve 10 1.0 1.9407e+00 >>>>> 1.0 4.94e+10 1.1 3.5e+06 6.2e+03 6.7e+02 5 87 86 79 47 100100100100100 >>>>> 1581096 3034723 7920 6.88e+02 7920 6.88e+02 100 >>>>> out_008_kokkos_Crusher_6_1.txt:KSPSolve 10 1.0 7.4478e+00 >>>>> 1.0 4.49e+11 1.0 4.1e+06 2.3e+04 7.6e+02 2 88 87 80 49 100100100100100 >>>>> 3798162 5557106 9367 3.02e+03 9359 3.02e+03 100 >>>>> out_064_kokkos_Crusher_5_1.txt:KSPSolve 10 1.0 2.4551e+00 >>>>> 1.0 5.40e+10 1.1 4.2e+07 5.4e+03 7.3e+02 5 88 87 80 47 100100100100100 >>>>> 11065887 23792978 8684 8.90e+02 8683 8.90e+02 100 >>>>> out_064_kokkos_Crusher_6_1.txt:KSPSolve 10 1.0 1.1335e+01 >>>>> 1.0 5.38e+11 1.0 5.4e+07 2.0e+04 9.1e+02 4 88 88 82 49 100100100100100 >>>>> 24130606 43326249 11249 4.26e+03 11249 4.26e+03 100 >>>>> >>>>> On Tue, Jan 25, 2022 at 1:49 PM Mark Adams <mfad...@lbl.gov> wrote: >>>>> >>>>>> >>>>>>> Note that Mark's logs have been switching back and forth between >>>>>>> -use_gpu_aware_mpi and changing number of ranks -- we won't have that >>>>>>> information if we do manual timing hacks. This is going to be a routine >>>>>>> thing we'll need on the mailing list and we need the provenance to go >>>>>>> with >>>>>>> it. >>>>>>> >>>>>> >>>>>> GPU aware MPI crashes sometimes so to be safe, while debugging, I had >>>>>> it off. It works fine here so it has been on in the last tests. >>>>>> Here is a comparison. >>>>>> >>>>>> >>>>> <tt.tar> >>>>> >>>>> >>>>> >