On Thu, Sep 24, 2020 at 9:47 AM Chris Hewson <ch...@resfrac.com> wrote:
> After about a month of not having this issue pop up, it has come up again > > We have been struggling with a similar PETSc Error for awhile now, the > error message is as follows: > > [7]PETSC ERROR: PetscTableFind() line 132 in > /home/chewson/petsc-3.13.3/include/petscctable.h key 24443 is greater than > largest key allowed 23988 > The full stack would be really useful here. I am guessing this happens on MatMult(), but I do not know. Thanks, Matt > It is a particularly nasty bug as it doesn't reproduce itself when > debugging and doesn't happen all the time with the same inputs either. The > problem occurs after a long runtime of the code (12-40 hours) and we are > using a ksp solve with KSPBCGS. > > The PETSc compilation options that are used are: > > --download-metis > --download-mpich > --download-mumps > --download-parmetis > --download-ptscotch > --download-scalapack > --download-suitesparse > --prefix=/opt/anl/petsc-3.13.3 > --with-debugging=0 > --with-mpi=1 > COPTFLAGS=-O3 -march=haswell -mtune=haswell > CXXOPTFLAGS=-O3 -march=haswell -mtune=haswell > FOPTFLAGS=-O3 -march=haswell -mtune=haswell > > This is being run across 8 processes and is failing consistently on the > rank 7 process. We also use openmp outside of PETSC and the linear solve > portion of the code. The rank 0 process is always using compute, during > this the slave processes use an MPI_Wait call to wait for the collective > parts of the code to be called again. All PETSC calls are done across all > of the processes. > > We are using mpich version 3.3.2, downloaded with the petsc 3.13.3 package. > > At every PETSC call we are checking the error return from the function > collectively to ensure that no errors have been returned from PETSC. > > Some possible causes that I can think of are as follows: > 1. Memory leak causing a corruption either in our program or in petsc or > with one of the petsc objects. This seems unlikely as we have checked runs > with the option -malloc_dump for PETSc and using valgrind. > > 2. Optimization flags set for petsc compilation are causing variables that > go out of scope to be optimized out. > > 3. We are giving the wrong number of elements for a process or the value > is changing during the simulation. This seems unlikely as there is nothing > overly unique about these simulations and it's not reproducing itself. > > 4. An MPI channel or socket error causing an error in the collective > values for PETSc. > > Any input on this issue would be greatly appreciated. > > *Chris Hewson* > Senior Reservoir Simulation Engineer > ResFrac > +1.587.575.9792 > > > On Thu, Aug 13, 2020 at 4:21 PM Junchao Zhang <junchao.zh...@gmail.com> > wrote: > >> That is a great idea. I'll figure it out. >> --Junchao Zhang >> >> >> On Thu, Aug 13, 2020 at 4:31 PM Barry Smith <bsm...@petsc.dev> wrote: >> >>> >>> Junchao, >>> >>> Any way in the PETSc configure to warn that MPICH version is "bad" >>> or "untrustworthy" or even the vague "we have suspicians about this version >>> and recommend avoiding it"? A lot of time could be saved if others don't >>> deal with the same problem. >>> >>> Maybe add arrays of suspect_versions for OpenMPI, MPICH, etc and >>> always check against that list and print a boxed warning at configure time? >>> Better you could somehow generalize it and put it in package.py for use by >>> all packages, then any package can included lists of "suspect" versions. >>> (There are definitely HDF5 versions that should be avoided :-)). >>> >>> Barry >>> >>> >>> On Aug 13, 2020, at 12:14 PM, Junchao Zhang <junchao.zh...@gmail.com> >>> wrote: >>> >>> Thanks for the update. Let's assume it is a bug in MPI :) >>> --Junchao Zhang >>> >>> >>> On Thu, Aug 13, 2020 at 11:15 AM Chris Hewson <ch...@resfrac.com> wrote: >>> >>>> Just as an update to this, I can confirm that using the mpich version >>>> (3.3.2) downloaded with the petsc download solved this issue on my end. >>>> >>>> *Chris Hewson* >>>> Senior Reservoir Simulation Engineer >>>> ResFrac >>>> +1.587.575.9792 >>>> >>>> >>>> On Thu, Jul 23, 2020 at 5:58 PM Junchao Zhang <junchao.zh...@gmail.com> >>>> wrote: >>>> >>>>> On Mon, Jul 20, 2020 at 7:05 AM Barry Smith <bsm...@petsc.dev> wrote: >>>>> >>>>>> >>>>>> Is there a comprehensive MPI test suite (perhaps from MPICH)? Is >>>>>> there any way to run this full test suite under the problematic MPI and >>>>>> see >>>>>> if it detects any problems? >>>>>> >>>>>> Is so, could someone add it to the FAQ in the debugging section? >>>>>> >>>>> MPICH does have a test suite. It is at the subdir test/mpi of >>>>> downloaded mpich >>>>> <http://www.mpich.org/static/downloads/3.3.2/mpich-3.3.2.tar.gz>. It >>>>> annoyed me since it is not user-friendly. It might be helpful in catching >>>>> bugs at very small scale. But say if I want to test allreduce on 1024 >>>>> ranks >>>>> on 100 doubles, I have to hack the test suite. >>>>> Anyway, the instructions are here. >>>>> >>>>> For the purpose of petsc, under test/mpi one can configure it with >>>>> $./configure CC=mpicc CXX=mpicxx FC=mpifort --enable-strictmpi >>>>> --enable-threads=funneled --enable-fortran=f77,f90 --enable-fast >>>>> --disable-spawn --disable-cxx --disable-ft-tests // It is weird I >>>>> disabled >>>>> cxx but I had to set CXX! >>>>> $make -k -j8 // -k is to keep going and ignore compilation errors, >>>>> e.g., when building tests for MPICH extensions not in MPI standard, but >>>>> your MPI is OpenMPI. >>>>> $ // edit testlist, remove lines mpi_t, rma, f77, impls. Those are >>>>> sub-dirs containing tests for MPI routines Petsc does not rely on. >>>>> $ make testings or directly './runtests -tests=testlist' >>>>> >>>>> On a batch system, >>>>> $export MPITEST_BATCHDIR=`pwd`/btest // specify a batch dir, say >>>>> btest, >>>>> $./runtests -batch -mpiexec=mpirun -np=1024 -tests=testlist // Use >>>>> 1024 ranks if a test does no specify the number of processes. >>>>> $ // It copies test binaries to the batch dir and generates a >>>>> script runtests.batch there. Edit the script to fit your batch system and >>>>> then submit a job and wait for its finish. >>>>> $ cd btest && ../checktests --ignorebogus >>>>> >>>>> >>>>> PS: Fande, changing an MPI fixed your problem does not >>>>> necessarily mean the old MPI has bugs. It is complicated. It could be a >>>>> petsc bug. You need to provide us a code to reproduce your error. It does >>>>> not matter if the code is big. >>>>> >>>>> >>>>>> Thanks >>>>>> >>>>>> Barry >>>>>> >>>>>> >>>>>> On Jul 20, 2020, at 12:16 AM, Fande Kong <fdkong...@gmail.com> wrote: >>>>>> >>>>>> Trace could look like this: >>>>>> >>>>>> [640]PETSC ERROR: --------------------- Error Message >>>>>> -------------------------------------------------------------- >>>>>> [640]PETSC ERROR: Argument out of range >>>>>> [640]PETSC ERROR: key 45226154 is greater than largest key allowed >>>>>> 740521 >>>>>> [640]PETSC ERROR: See >>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble >>>>>> shooting. >>>>>> [640]PETSC ERROR: Petsc Release Version 3.13.3, unknown >>>>>> [640]PETSC ERROR: ../../griffin-opt on a arch-moose named r6i5n18 by >>>>>> wangy2 Sun Jul 19 17:14:28 2020 >>>>>> [640]PETSC ERROR: Configure options --download-hypre=1 >>>>>> --with-debugging=no --with-shared-libraries=1 --download-fblaslapack=1 >>>>>> --download-metis=1 --download-ptscotch=1 --download-parmetis=1 >>>>>> --download-superlu_dist=1 --download-mumps=1 --download-scalapack=1 >>>>>> --download-slepc=1 --with-mpi=1 --with-cxx-dialect=C++11 >>>>>> --with-fortran-bindings=0 --with-sowing=0 --with-64-bit-indices >>>>>> --download-mumps=0 >>>>>> [640]PETSC ERROR: #1 PetscTableFind() line 132 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/include/petscctable.h >>>>>> [640]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() line 33 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mmaij.c >>>>>> [640]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() line 876 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiaij.c >>>>>> [640]PETSC ERROR: #4 MatAssemblyEnd() line 5347 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c >>>>>> [640]PETSC ERROR: #5 MatPtAPNumeric_MPIAIJ_MPIXAIJ_allatonce() line >>>>>> 901 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiptap.c >>>>>> [640]PETSC ERROR: #6 MatPtAPNumeric_MPIAIJ_MPIMAIJ_allatonce() line >>>>>> 3180 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/maij/maij.c >>>>>> [640]PETSC ERROR: #7 MatProductNumeric_PtAP() line 704 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matproduct.c >>>>>> [640]PETSC ERROR: #8 MatProductNumeric() line 759 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matproduct.c >>>>>> [640]PETSC ERROR: #9 MatPtAP() line 9199 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c >>>>>> [640]PETSC ERROR: #10 MatGalerkin() line 10236 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c >>>>>> [640]PETSC ERROR: #11 PCSetUp_MG() line 745 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/impls/mg/mg.c >>>>>> [640]PETSC ERROR: #12 PCSetUp_HMG() line 220 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/impls/hmg/hmg.c >>>>>> [640]PETSC ERROR: #13 PCSetUp() line 898 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/interface/precon.c >>>>>> [640]PETSC ERROR: #14 KSPSetUp() line 376 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c >>>>>> [640]PETSC ERROR: #15 KSPSolve_Private() line 633 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c >>>>>> [640]PETSC ERROR: #16 KSPSolve() line 853 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c >>>>>> [640]PETSC ERROR: #17 SNESSolve_NEWTONLS() line 225 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/snes/impls/ls/ls.c >>>>>> [640]PETSC ERROR: #18 SNESSolve() line 4519 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/snes/interface/snes.c >>>>>> >>>>>> On Sun, Jul 19, 2020 at 11:13 PM Fande Kong <fdkong...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I am not entirely sure what is happening, but we encountered similar >>>>>>> issues recently. It was not reproducible. It might occur at different >>>>>>> stages, and errors could be weird other than "ctable stuff." Our code >>>>>>> was >>>>>>> Valgrind clean since every PR in moose needs to go through rigorous >>>>>>> Valgrind checks before it reaches the devel branch. The errors happened >>>>>>> when we used mvapich. >>>>>>> >>>>>>> We changed to use HPE-MPT (a vendor stalled MPI), then everything >>>>>>> was smooth. May you try a different MPI? It is better to try a system >>>>>>> carried one. >>>>>>> >>>>>>> We did not get the bottom of this problem yet, but we at least know >>>>>>> this is kind of MPI-related. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Fande, >>>>>>> >>>>>>> >>>>>>> On Sun, Jul 19, 2020 at 3:28 PM Chris Hewson <ch...@resfrac.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I am having a bug that is occurring in PETSC with the return string: >>>>>>>> >>>>>>>> [7]PETSC ERROR: PetscTableFind() line 132 in >>>>>>>> /home/chewson/petsc-3.13.2/include/petscctable.h key 7556 is greater >>>>>>>> than >>>>>>>> largest key allowed 5693 >>>>>>>> >>>>>>>> This is using petsc-3.13.2, compiled and running using mpich with >>>>>>>> -O3 and debugging turned off tuned to the haswell architecture and >>>>>>>> occurring either before or during a KSPBCGS solve/setup or during a >>>>>>>> MUMPS >>>>>>>> factorization solve (I haven't been able to replicate this issue with >>>>>>>> the >>>>>>>> same set of instructions etc.). >>>>>>>> >>>>>>>> This is a terrible way to ask a question, I know, and not very >>>>>>>> helpful from your side, but this is what I have from a user's run and >>>>>>>> can't >>>>>>>> reproduce on my end (either with the optimization compilation or with >>>>>>>> debugging turned on). This happens when the code has run for quite some >>>>>>>> time and is happening somewhat rarely. >>>>>>>> >>>>>>>> More than likely I am using a static variable (code is written in >>>>>>>> c++) that I'm not updating when the matrix size is changing or >>>>>>>> something >>>>>>>> silly like that, but any help or guidance on this would be appreciated. >>>>>>>> >>>>>>>> *Chris Hewson* >>>>>>>> Senior Reservoir Simulation Engineer >>>>>>>> ResFrac >>>>>>>> +1.587.575.9792 >>>>>>>> >>>>>>> >>>>>> >>> -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>