Hi Guys, Thanks for all of the prompt responses, very helpful and appreciated.
By "when debugging", did you mean when configure petsc --with-debugging=1 COPTFLAGS=-O0 -g etc or when you attached a debugger? - Both, I have run with a debugger attached and detached, all compiled with the following flags "--with-debugging=1 COPTFLAGS=-O0 -g" 1) Try OpenMPI (probably won't help, but worth trying) - Worth a try for sure 2) Find which part of the simulation makes it non-deterministic. Is it the mesh partitioning (parmetis)? Then try to make it deterministic. - Good tip, it is the mesh partitioning and along the lines of a question from Barry, the matrix size is changing. I will make this deterministic and give it a try 3) Dump matrices, vectors, etc and see when it fails, you can quickly reproduce the error by reading in the intermediate data. - Also a great suggestion, will give it a try The full stack would be really useful here. I am guessing this happens on MatMult(), but I do not know. - Agreed, I am currently running it so that the full stack will be produced, but waiting for it to fail, had compiled with all available optimizations on, but downside is of course if there is a failure. As a general question, roughly what's the performance impact on petsc with -o1/-o2/-o0 as opposed to -o3? Performance impact of --with-debugging = 1? Obviously problem/machine dependant, wondering on guidance more for this than anything Is the nonzero structure of your matrices changing or is it fixed for the entire simulation? The non-zero structure is changing, although the structures are reformed when this happens and this happens thousands of time before the failure has occured. Does this particular run always crash at the same place? Similar place? Doesn't always crash? Doesn't always crash, but other similar runs have crashed in different spots, which makes it difficult to resolve. I am going to try out a few of the strategies suggested above and will let you know what comes of that. *Chris Hewson* Senior Reservoir Simulation Engineer ResFrac +1.587.575.9792 On Thu, Sep 24, 2020 at 11:05 AM Barry Smith <bsm...@petsc.dev> wrote: > Chris, > > We realize how frustrating this type of problem is to deal with. > > Here is the code: > > ierr = > PetscTableCreate(aij->B->rmap->n,mat->cmap->N+1,&gid1_lid1);CHKERRQ(ierr); > for (i=0; i<aij->B->rmap->n; i++) { > for (j=0; j<B->ilen[i]; j++) { > PetscInt data,gid1 = aj[B->i[i] + j] + 1; > ierr = PetscTableFind(gid1_lid1,gid1,&data);CHKERRQ(ierr); > if (!data) { > /* one based table */ > ierr = > PetscTableAdd(gid1_lid1,gid1,++ec,INSERT_VALUES);CHKERRQ(ierr); > } > } > } > > It is simply looping over the rows of the sparse matrix putting the > columns it finds into the hash table. > > aj[B->i[i] + j] are the column entries, the number of columns in the > matrix is mat->cmap->N so the column entries should always be > less than the number of columns. The code is crashing when column entry > 24443 which is larger than the number of columns 23988. > > So either the aj[B->i[i] + j] + 1 are incorrect or the mat->cmap->N is > incorrect. > > 640]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() line 876 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiaij.c >>>>>> >>>>>> > if (!mat->was_assembled && mode == MAT_FINAL_ASSEMBLY) { > ierr = MatSetUpMultiply_MPIAIJ(mat);CHKERRQ(ierr); > } > > Seems to indicate it is setting up a new multiple because it is either the > first time into the algorithm or the nonzero structure changed on some rank > requiring a new assembly process. > > Is the nonzero structure of your matrices changing or is it fixed for > the entire simulation? > > Since the code has been running for a very long time already I have to > conclude that this is not the first time through and so something has > changed in the matrix? > > I think we have to put more diagnostics into the library to provide more > information before or at the time of the error detection. > > Does this particular run always crash at the same place? Similar > place? Doesn't always crash? > > Barry > > > > > On Sep 24, 2020, at 8:46 AM, Chris Hewson <ch...@resfrac.com> wrote: > > After about a month of not having this issue pop up, it has come up again > > We have been struggling with a similar PETSc Error for awhile now, the > error message is as follows: > > [7]PETSC ERROR: PetscTableFind() line 132 in > /home/chewson/petsc-3.13.3/include/petscctable.h key 24443 is greater than > largest key allowed 23988 > > It is a particularly nasty bug as it doesn't reproduce itself when > debugging and doesn't happen all the time with the same inputs either. The > problem occurs after a long runtime of the code (12-40 hours) and we are > using a ksp solve with KSPBCGS. > > The PETSc compilation options that are used are: > > --download-metis > --download-mpich > --download-mumps > --download-parmetis > --download-ptscotch > --download-scalapack > --download-suitesparse > --prefix=/opt/anl/petsc-3.13.3 > --with-debugging=0 > --with-mpi=1 > COPTFLAGS=-O3 -march=haswell -mtune=haswell > CXXOPTFLAGS=-O3 -march=haswell -mtune=haswell > FOPTFLAGS=-O3 -march=haswell -mtune=haswell > > This is being run across 8 processes and is failing consistently on the > rank 7 process. We also use openmp outside of PETSC and the linear solve > portion of the code. The rank 0 process is always using compute, during > this the slave processes use an MPI_Wait call to wait for the collective > parts of the code to be called again. All PETSC calls are done across all > of the processes. > > We are using mpich version 3.3.2, downloaded with the petsc 3.13.3 package. > > At every PETSC call we are checking the error return from the function > collectively to ensure that no errors have been returned from PETSC. > > Some possible causes that I can think of are as follows: > 1. Memory leak causing a corruption either in our program or in petsc or > with one of the petsc objects. This seems unlikely as we have checked runs > with the option -malloc_dump for PETSc and using valgrind. > > 2. Optimization flags set for petsc compilation are causing variables that > go out of scope to be optimized out. > > 3. We are giving the wrong number of elements for a process or the value > is changing during the simulation. This seems unlikely as there is nothing > overly unique about these simulations and it's not reproducing itself. > > 4. An MPI channel or socket error causing an error in the collective > values for PETSc. > > Any input on this issue would be greatly appreciated. > > *Chris Hewson* > Senior Reservoir Simulation Engineer > ResFrac > +1.587.575.9792 > > > On Thu, Aug 13, 2020 at 4:21 PM Junchao Zhang <junchao.zh...@gmail.com> > wrote: > >> That is a great idea. I'll figure it out. >> --Junchao Zhang >> >> >> On Thu, Aug 13, 2020 at 4:31 PM Barry Smith <bsm...@petsc.dev> wrote: >> >>> >>> Junchao, >>> >>> Any way in the PETSc configure to warn that MPICH version is "bad" >>> or "untrustworthy" or even the vague "we have suspicians about this version >>> and recommend avoiding it"? A lot of time could be saved if others don't >>> deal with the same problem. >>> >>> Maybe add arrays of suspect_versions for OpenMPI, MPICH, etc and >>> always check against that list and print a boxed warning at configure time? >>> Better you could somehow generalize it and put it in package.py for use by >>> all packages, then any package can included lists of "suspect" versions. >>> (There are definitely HDF5 versions that should be avoided :-)). >>> >>> Barry >>> >>> >>> On Aug 13, 2020, at 12:14 PM, Junchao Zhang <junchao.zh...@gmail.com> >>> wrote: >>> >>> Thanks for the update. Let's assume it is a bug in MPI :) >>> --Junchao Zhang >>> >>> >>> On Thu, Aug 13, 2020 at 11:15 AM Chris Hewson <ch...@resfrac.com> wrote: >>> >>>> Just as an update to this, I can confirm that using the mpich version >>>> (3.3.2) downloaded with the petsc download solved this issue on my end. >>>> >>>> *Chris Hewson* >>>> Senior Reservoir Simulation Engineer >>>> ResFrac >>>> +1.587.575.9792 >>>> >>>> >>>> On Thu, Jul 23, 2020 at 5:58 PM Junchao Zhang <junchao.zh...@gmail.com> >>>> wrote: >>>> >>>>> On Mon, Jul 20, 2020 at 7:05 AM Barry Smith <bsm...@petsc.dev> wrote: >>>>> >>>>>> >>>>>> Is there a comprehensive MPI test suite (perhaps from MPICH)? Is >>>>>> there any way to run this full test suite under the problematic MPI and >>>>>> see >>>>>> if it detects any problems? >>>>>> >>>>>> Is so, could someone add it to the FAQ in the debugging section? >>>>>> >>>>> MPICH does have a test suite. It is at the subdir test/mpi of >>>>> downloaded mpich >>>>> <http://www.mpich.org/static/downloads/3.3.2/mpich-3.3.2.tar.gz>. It >>>>> annoyed me since it is not user-friendly. It might be helpful in catching >>>>> bugs at very small scale. But say if I want to test allreduce on 1024 >>>>> ranks >>>>> on 100 doubles, I have to hack the test suite. >>>>> Anyway, the instructions are here. >>>>> >>>>> For the purpose of petsc, under test/mpi one can configure it with >>>>> $./configure CC=mpicc CXX=mpicxx FC=mpifort --enable-strictmpi >>>>> --enable-threads=funneled --enable-fortran=f77,f90 --enable-fast >>>>> --disable-spawn --disable-cxx --disable-ft-tests // It is weird I >>>>> disabled >>>>> cxx but I had to set CXX! >>>>> $make -k -j8 // -k is to keep going and ignore compilation errors, >>>>> e.g., when building tests for MPICH extensions not in MPI standard, but >>>>> your MPI is OpenMPI. >>>>> $ // edit testlist, remove lines mpi_t, rma, f77, impls. Those are >>>>> sub-dirs containing tests for MPI routines Petsc does not rely on. >>>>> $ make testings or directly './runtests -tests=testlist' >>>>> >>>>> On a batch system, >>>>> $export MPITEST_BATCHDIR=`pwd`/btest // specify a batch dir, say >>>>> btest, >>>>> $./runtests -batch -mpiexec=mpirun -np=1024 -tests=testlist // Use >>>>> 1024 ranks if a test does no specify the number of processes. >>>>> $ // It copies test binaries to the batch dir and generates a >>>>> script runtests.batch there. Edit the script to fit your batch system and >>>>> then submit a job and wait for its finish. >>>>> $ cd btest && ../checktests --ignorebogus >>>>> >>>>> >>>>> PS: Fande, changing an MPI fixed your problem does not >>>>> necessarily mean the old MPI has bugs. It is complicated. It could be a >>>>> petsc bug. You need to provide us a code to reproduce your error. It does >>>>> not matter if the code is big. >>>>> >>>>> >>>>>> Thanks >>>>>> >>>>>> Barry >>>>>> >>>>>> >>>>>> On Jul 20, 2020, at 12:16 AM, Fande Kong <fdkong...@gmail.com> wrote: >>>>>> >>>>>> Trace could look like this: >>>>>> >>>>>> [640]PETSC ERROR: --------------------- Error Message >>>>>> -------------------------------------------------------------- >>>>>> [640]PETSC ERROR: Argument out of range >>>>>> [640]PETSC ERROR: key 45226154 is greater than largest key allowed >>>>>> 740521 >>>>>> [640]PETSC ERROR: See >>>>>> https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble >>>>>> shooting. >>>>>> [640]PETSC ERROR: Petsc Release Version 3.13.3, unknown >>>>>> [640]PETSC ERROR: ../../griffin-opt on a arch-moose named r6i5n18 by >>>>>> wangy2 Sun Jul 19 17:14:28 2020 >>>>>> [640]PETSC ERROR: Configure options --download-hypre=1 >>>>>> --with-debugging=no --with-shared-libraries=1 --download-fblaslapack=1 >>>>>> --download-metis=1 --download-ptscotch=1 --download-parmetis=1 >>>>>> --download-superlu_dist=1 --download-mumps=1 --download-scalapack=1 >>>>>> --download-slepc=1 --with-mpi=1 --with-cxx-dialect=C++11 >>>>>> --with-fortran-bindings=0 --with-sowing=0 --with-64-bit-indices >>>>>> --download-mumps=0 >>>>>> [640]PETSC ERROR: #1 PetscTableFind() line 132 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/include/petscctable.h >>>>>> [640]PETSC ERROR: #2 MatSetUpMultiply_MPIAIJ() line 33 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mmaij.c >>>>>> [640]PETSC ERROR: #3 MatAssemblyEnd_MPIAIJ() line 876 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiaij.c >>>>>> [640]PETSC ERROR: #4 MatAssemblyEnd() line 5347 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c >>>>>> [640]PETSC ERROR: #5 MatPtAPNumeric_MPIAIJ_MPIXAIJ_allatonce() line >>>>>> 901 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/aij/mpi/mpiptap.c >>>>>> [640]PETSC ERROR: #6 MatPtAPNumeric_MPIAIJ_MPIMAIJ_allatonce() line >>>>>> 3180 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/impls/maij/maij.c >>>>>> [640]PETSC ERROR: #7 MatProductNumeric_PtAP() line 704 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matproduct.c >>>>>> [640]PETSC ERROR: #8 MatProductNumeric() line 759 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matproduct.c >>>>>> [640]PETSC ERROR: #9 MatPtAP() line 9199 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c >>>>>> [640]PETSC ERROR: #10 MatGalerkin() line 10236 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/mat/interface/matrix.c >>>>>> [640]PETSC ERROR: #11 PCSetUp_MG() line 745 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/impls/mg/mg.c >>>>>> [640]PETSC ERROR: #12 PCSetUp_HMG() line 220 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/impls/hmg/hmg.c >>>>>> [640]PETSC ERROR: #13 PCSetUp() line 898 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/pc/interface/precon.c >>>>>> [640]PETSC ERROR: #14 KSPSetUp() line 376 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c >>>>>> [640]PETSC ERROR: #15 KSPSolve_Private() line 633 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c >>>>>> [640]PETSC ERROR: #16 KSPSolve() line 853 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/ksp/ksp/interface/itfunc.c >>>>>> [640]PETSC ERROR: #17 SNESSolve_NEWTONLS() line 225 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/snes/impls/ls/ls.c >>>>>> [640]PETSC ERROR: #18 SNESSolve() line 4519 in >>>>>> /home/wangy2/trunk/sawtooth/griffin/moose/petsc/src/snes/interface/snes.c >>>>>> >>>>>> On Sun, Jul 19, 2020 at 11:13 PM Fande Kong <fdkong...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I am not entirely sure what is happening, but we encountered similar >>>>>>> issues recently. It was not reproducible. It might occur at different >>>>>>> stages, and errors could be weird other than "ctable stuff." Our code >>>>>>> was >>>>>>> Valgrind clean since every PR in moose needs to go through rigorous >>>>>>> Valgrind checks before it reaches the devel branch. The errors happened >>>>>>> when we used mvapich. >>>>>>> >>>>>>> We changed to use HPE-MPT (a vendor stalled MPI), then everything >>>>>>> was smooth. May you try a different MPI? It is better to try a system >>>>>>> carried one. >>>>>>> >>>>>>> We did not get the bottom of this problem yet, but we at least know >>>>>>> this is kind of MPI-related. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Fande, >>>>>>> >>>>>>> >>>>>>> On Sun, Jul 19, 2020 at 3:28 PM Chris Hewson <ch...@resfrac.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I am having a bug that is occurring in PETSC with the return string: >>>>>>>> >>>>>>>> [7]PETSC ERROR: PetscTableFind() line 132 in >>>>>>>> /home/chewson/petsc-3.13.2/include/petscctable.h key 7556 is greater >>>>>>>> than >>>>>>>> largest key allowed 5693 >>>>>>>> >>>>>>>> This is using petsc-3.13.2, compiled and running using mpich with >>>>>>>> -O3 and debugging turned off tuned to the haswell architecture and >>>>>>>> occurring either before or during a KSPBCGS solve/setup or during a >>>>>>>> MUMPS >>>>>>>> factorization solve (I haven't been able to replicate this issue with >>>>>>>> the >>>>>>>> same set of instructions etc.). >>>>>>>> >>>>>>>> This is a terrible way to ask a question, I know, and not very >>>>>>>> helpful from your side, but this is what I have from a user's run and >>>>>>>> can't >>>>>>>> reproduce on my end (either with the optimization compilation or with >>>>>>>> debugging turned on). This happens when the code has run for quite some >>>>>>>> time and is happening somewhat rarely. >>>>>>>> >>>>>>>> More than likely I am using a static variable (code is written in >>>>>>>> c++) that I'm not updating when the matrix size is changing or >>>>>>>> something >>>>>>>> silly like that, but any help or guidance on this would be appreciated. >>>>>>>> >>>>>>>> *Chris Hewson* >>>>>>>> Senior Reservoir Simulation Engineer >>>>>>>> ResFrac >>>>>>>> +1.587.575.9792 >>>>>>>> >>>>>>> >>>>>> >>> >