On Fri, 20 Sep 2019, Smith, Barry F. via petsc-dev wrote: > > > > On Sep 19, 2019, at 9:11 PM, Balay, Satish <ba...@mcs.anl.gov> wrote: > > > > On Fri, 20 Sep 2019, Smith, Barry F. via petsc-dev wrote: > > > >> > >> This should be reported on gitlab, not in email. > >> > >> Anyways, my interpretation is that the machine runs low on swap space so > >> the OS is killing things. Once Satish and I sat down and checked the > >> system logs on one machine that had little swap and we saw system messages > >> about low swap at exactly the time the tests were killed. Satish is > >> resistant to increase swap I don't know why. Other times we see these > >> kills and they may not be due to swap but then they are a mystery. > > > > That was on bsd. > > > > This machine has 8gb swap and should be sufficient. And this issue [on this > > machine] was triggered > > only by this MR - which was wierd.. > > Does it happen every time to the same examples?
I might have tried restating a job once or twice - so yes. And 3 jobs [from a single pipeline] failed on this box. > > If you login and run that one test does it happen? I've only tried running the failing tests - and they ran fine. Didn't try 'make alltests' at that time. Satish > > If the MR is changing scatter code could it have broken something. > > We need to know why this is happening? Otherwise our test system will > drive us nuts with errors we don't have a clue where they come from. > > > >>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1 > > So MPI thinks MPI_Abort is called with a return code of 1. PETSc calls > MPI_Abort in a truck load of places and usually with a return code of 1. So > the first thing that needs to be done is fix PETSc so each different call to > MPI_Abort has a unique return code. Then in theory at least we know where it > got aborted. > > include/petscerror.h:#define CHKERRABORT(comm,ierr) do {if > (PetscUnlikely(ierr)) > {PetscError(PETSC_COMM_SELF,__LINE__,PETSC_FUNCTION_NAME,__FILE__,ierr,PETSC_ERROR_REPEAT," > ");MPI_Abort(comm,ierr);}} while (0) > include/petscerror.h: or CHKERRABORT(comm,n) to have MPI_Abort() returned > immediately. > src/contrib/fun3d/incomp/flow.c: /*ierr = MPI_Abort(MPI_COMM_WORLD,1);*/ > src/docs/mpi.www.index:man:+MPI_Abort++MPI_Abort++++man+http://www.mpich.org/static/docs/latest/www3/MPI_Abort.html#MPI_Abort > src/docs/mpi.www.index:man:+MPI_Abort++MPI_Abort++++man+http://www.mpich.org/static/docs/latest/www3/MPI_Abort.html#MPI_Abort > src/docs/tao_tex/manual/part1.tex:application called > MPI_Abort(MPI_COMM_WORLD, 73) - process 0 > src/docs/tex/manual/developers.tex: \item > \lstinline{PetscMPIAbortErrorHandler()}, which calls \lstinline{MPI_Abort()} > after printing the error message; and > src/snes/examples/tests/ex12f.F: call > MPI_Abort(PETSC_COMM_WORLD,0,ierr) > src/snes/examples/tutorials/ex30.c: MPI_Abort(PETSC_COMM_SELF,1); > src/sys/error/adebug.c: MPI_Abort(PETSC_COMM_WORLD,1); > src/sys/error/err.c: If this is called from the main() routine we call > MPI_Abort() instead of > src/sys/error/err.c: if (ismain) MPI_Abort(PETSC_COMM_WORLD,(int)ierr); > src/sys/error/errstop.c: MPI_Abort(PETSC_COMM_WORLD,n); > src/sys/error/fp.c: MPI_Abort(PETSC_COMM_WORLD,0); > src/sys/error/fp.c: MPI_Abort(PETSC_COMM_WORLD,0); > src/sys/error/fp.c: MPI_Abort(PETSC_COMM_WORLD,0); > src/sys/error/fp.c: MPI_Abort(PETSC_COMM_WORLD,0); > src/sys/error/fp.c: MPI_Abort(PETSC_COMM_WORLD,0); > src/sys/error/fp.c: MPI_Abort(PETSC_COMM_WORLD,0); > src/sys/error/signal.c: if (ierr) MPI_Abort(PETSC_COMM_WORLD,0); > src/sys/error/signal.c: MPI_Abort(PETSC_COMM_WORLD,(int)ierr); > src/sys/fsrc/somefort.F:! when MPI_Abort() is called directly by > CHKERRQ(ierr); > src/sys/fsrc/somefort.F: call MPI_Abort(comm,ierr,nierr) > src/sys/ftn-custom/zutils.c: MPI_Abort(PETSC_COMM_WORLD,1); > src/sys/ftn-custom/zutils.c: MPI_Abort(PETSC_COMM_WORLD,1); > src/sys/logging/utils/stagelog.c: MPI_Abort(MPI_COMM_WORLD, PETSC_ERR_SUP); > src/sys/mpiuni/mpi.c:int MPI_Abort(MPI_Comm comm,int errorcode) > src/sys/mpiuni/mpitime.c: if (!QueryPerformanceCounter(&StartTime)) > MPI_Abort(MPI_COMM_WORLD,1); > src/sys/mpiuni/mpitime.c: if (!QueryPerformanceFrequency(&PerfFreq)) > MPI_Abort(MPI_COMM_WORLD,1); > src/sys/mpiuni/mpitime.c: if (!QueryPerformanceCounter(&CurTime)) > MPI_Abort(MPI_COMM_WORLD,1); > src/sys/objects/init.c: in the debugger hence we call abort() instead of > MPI_Abort(). > src/sys/objects/init.c:void Petsc_MPI_AbortOnError(MPI_Comm *comm,PetscMPIInt > *flag,...) > src/sys/objects/init.c: if (ierr) MPI_Abort(*comm,*flag); /* hopeless so get > out */ > src/sys/objects/init.c: ierr = > MPI_Comm_create_errhandler(Petsc_MPI_AbortOnError,&err_handler);CHKERRQ(ierr); > src/sys/objects/pinit.c: MPI_Abort(MPI_COMM_WORLD,1); > src/sys/objects/pinit.c: MPI_Abort(MPI_COMM_WORLD,1); > src/sys/objects/pinit.c: MPI_Abort(MPI_COMM_WORLD,1); > src/sys/objects/pinit.c: MPI_Abort(MPI_COMM_WORLD,1); > src/ts/examples/tutorials/ex48.c: if (dim < 2) {MPI_Abort(MPI_COMM_WORLD,1); > return;} /* this is needed so that the clang static analyzer does not > generate a warning about variables used by not set */ > src/vec/vec/examples/tests/ex32f.F: call > MPI_Abort(MPI_COMM_WORLD,0,ierr) > src/vec/vec/interface/dlregisvec.c: MPI_Abort(MPI_COMM_SELF,1); > src/vec/vec/interface/dlregisvec.c: MPI_Abort(MPI_COMM_SELF,1); > src/vec/vec/utils/comb.c: MPI_Abort(MPI_COMM_SELF,1); > src/vec/vec/utils/comb.c: MPI_Abort(MPI_COMM_SELF,1); > > Junchao, > > Maybe you could fix this and make a MR? I don't know how to organize the > numbering. Should we have a central list of all numbers with macros in > petscerror.h like > > #define PETSC_MPI_ABORT_MPIU_MaxIndex_Local 10 > > etc? > > > > > > > > Barry > > > > > Satish > > > > > >> > >> You can return the particular job by clicking on the little circle after > >> the job name and see what happens the next time. > >> > >> Barry > >> > >> It may be the -j and -l options for some systems need to adjusted down > >> slightly and this will prevent these. Satish can that be done in the > >> examples/arch-ci* files with configure options, or in in the runner files > >> or in the yaml file? > > > > configure has options --with-make-np --with-make-test-np --with-make-load > > > > Satish > > > >> > >> > >> > >>> On Sep 19, 2019, at 5:00 PM, Zhang, Junchao <jczh...@mcs.anl.gov> wrote: > >>> > >>> All failed tests just said "application called MPI_Abort" and had no > >>> stack trace. They are not cuda tests. I updated SF to avoid CUDA related > >>> initialization if not needed. Let's see the new test result. > >>> not ok dm_impls_stag_tests-ex13_none_none_none_3d_par_stag_stencil_width-1 > >>> # application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1 > >>> > >>> > >>> --Junchao Zhang > >>> > >>> > >>> On Thu, Sep 19, 2019 at 3:57 PM Smith, Barry F. <bsm...@mcs.anl.gov> > >>> wrote: > >>> > >>> Failed? Means nothing, send link or cut and paste error > >>> > >>> It could be that since we have multiple separate tests running at the > >>> same time they overload the GPU or cause some inconsistent behavior that > >>> doesn't appear every time the tests are run. > >>> > >>> Barry > >>> > >>> Maybe we need to sequentialize all the tests that use the GPUs, we just > >>> trust gnumake for the parallelism maybe you could some how add > >>> dependencies to get gnu make to achieve this? > >>> > >>> > >>> > >>> > >>>> On Sep 19, 2019, at 3:53 PM, Zhang, Junchao <jczh...@mcs.anl.gov> wrote: > >>>> > >>>> On Thu, Sep 19, 2019 at 3:24 PM Smith, Barry F. <bsm...@mcs.anl.gov> > >>>> wrote: > >>>> > >>>> > >>>>> On Sep 19, 2019, at 2:50 PM, Zhang, Junchao <jczh...@mcs.anl.gov> wrote: > >>>>> > >>>>> I saw your update. In PetscCUDAInitialize we have > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> /* First get the device count */ > >>>>> > >>>>> err = cudaGetDeviceCount(&devCount); > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> /* next determine the rank and then set the device via a mod */ > >>>>> > >>>>> ierr = MPI_Comm_rank(comm,&rank);CHKERRQ(ierr); > >>>>> > >>>>> device = rank % devCount; > >>>>> > >>>>> } > >>>>> > >>>>> err = cudaSetDevice(device); > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> If we rely on the first CUDA call to do initialization, how could CUDA > >>>>> know these MPI stuff. > >>>> > >>>> It doesn't, so it does whatever it does (which may be dumb). > >>>> > >>>> Are you proposing something? > >>>> > >>>> No. My test failed in CI with -cuda_initialize 0 on frog but I could not > >>>> reproduce it. I'm doing investigation. > >>>> > >>>> Barry > >>>> > >>>>> > >>>>> --Junchao Zhang > >>>>> > >>>>> > >>>>> > >>>>> On Wed, Sep 18, 2019 at 11:42 PM Smith, Barry F. <bsm...@mcs.anl.gov> > >>>>> wrote: > >>>>> > >>>>> Fixed the docs. Thanks for pointing out the lack of clarity > >>>>> > >>>>> > >>>>>> On Sep 18, 2019, at 11:25 PM, Zhang, Junchao via petsc-dev > >>>>>> <petsc-dev@mcs.anl.gov> wrote: > >>>>>> > >>>>>> Barry, > >>>>>> > >>>>>> I saw you added these in init.c > >>>>>> > >>>>>> > >>>>>> + -cuda_initialize - do the initialization in PetscInitialize() > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> Notes: > >>>>>> > >>>>>> Initializing cuBLAS takes about 1/2 second there it is done by > >>>>>> default in PetscInitialize() before logging begins > >>>>>> > >>>>>> > >>>>>> > >>>>>> But I did not get otherwise with -cuda_initialize 0, when will cuda be > >>>>>> initialized? > >>>>>> --Junchao Zhang > >>>>> > >>> > >> > > >