Re: [petsc-dev] PetscCUDAInitialize

Smith, Barry F. via petsc-dev Thu, 19 Sep 2019 19:24:37 -0700

> On Sep 19, 2019, at 9:11 PM, Balay, Satish <ba...@mcs.anl.gov> wrote:
> 
> On Fri, 20 Sep 2019, Smith, Barry F. via petsc-dev wrote:
> 
>> 
>>   This should be reported on gitlab, not in email.
>> 
>>   Anyways, my interpretation is that the machine runs low on swap space so 
>> the OS is killing things. Once Satish and I sat down and checked the system 
>> logs on one machine that had little swap and we saw system messages about 
>> low swap at exactly the time the tests were killed. Satish is resistant to 
>> increase swap I don't know why. Other times we see these kills and they may 
>> not be due to swap but then they are a mystery.
> 
> That was on bsd.
> 
> This machine has 8gb swap and should be sufficient. And this issue [on this 
> machine] was triggered
> only by this MR - which was wierd..

   Does it happen every time to the same examples?

   If you login and run that one test does it happen?

   If the MR is changing scatter code could it have broken something.

   We need to know why this is happening? Otherwise our test system will drive 
us nuts with errors we don't have a clue where they come from.

  
>>> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1

  So MPI thinks MPI_Abort is called with a return code of 1. PETSc calls 
MPI_Abort in a truck load of places and usually with a return code of 1. So the 
first thing that needs to be done is fix PETSc so each different call to 
MPI_Abort has a unique return code. Then in theory at least we know where it 
got aborted.

include/petscerror.h:#define CHKERRABORT(comm,ierr) do {if 
(PetscUnlikely(ierr)) 
{PetscError(PETSC_COMM_SELF,__LINE__,PETSC_FUNCTION_NAME,__FILE__,ierr,PETSC_ERROR_REPEAT,"
 ");MPI_Abort(comm,ierr);}} while (0)
include/petscerror.h:    or CHKERRABORT(comm,n) to have MPI_Abort() returned 
immediately.
src/contrib/fun3d/incomp/flow.c:    /*ierr = MPI_Abort(MPI_COMM_WORLD,1);*/
src/docs/mpi.www.index:man:+MPI_Abort++MPI_Abort++++man+http://www.mpich.org/static/docs/latest/www3/MPI_Abort.html#MPI_Abort
src/docs/mpi.www.index:man:+MPI_Abort++MPI_Abort++++man+http://www.mpich.org/static/docs/latest/www3/MPI_Abort.html#MPI_Abort
src/docs/tao_tex/manual/part1.tex:application called MPI_Abort(MPI_COMM_WORLD, 
73) - process 0
src/docs/tex/manual/developers.tex:  \item 
\lstinline{PetscMPIAbortErrorHandler()}, which calls \lstinline{MPI_Abort()} 
after printing the error message; and
src/snes/examples/tests/ex12f.F:        call MPI_Abort(PETSC_COMM_WORLD,0,ierr)
src/snes/examples/tutorials/ex30.c:  MPI_Abort(PETSC_COMM_SELF,1);
src/sys/error/adebug.c:  MPI_Abort(PETSC_COMM_WORLD,1);
src/sys/error/err.c:      If this is called from the main() routine we call 
MPI_Abort() instead of
src/sys/error/err.c:  if (ismain) MPI_Abort(PETSC_COMM_WORLD,(int)ierr);
src/sys/error/errstop.c:  MPI_Abort(PETSC_COMM_WORLD,n);
src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
src/sys/error/fp.c:  MPI_Abort(PETSC_COMM_WORLD,0);
src/sys/error/signal.c:  if (ierr) MPI_Abort(PETSC_COMM_WORLD,0);
src/sys/error/signal.c:  MPI_Abort(PETSC_COMM_WORLD,(int)ierr);
src/sys/fsrc/somefort.F:!     when MPI_Abort() is called directly by 
CHKERRQ(ierr);
src/sys/fsrc/somefort.F:      call MPI_Abort(comm,ierr,nierr)
src/sys/ftn-custom/zutils.c:    MPI_Abort(PETSC_COMM_WORLD,1);
src/sys/ftn-custom/zutils.c:      MPI_Abort(PETSC_COMM_WORLD,1);
src/sys/logging/utils/stagelog.c:    MPI_Abort(MPI_COMM_WORLD, PETSC_ERR_SUP);
src/sys/mpiuni/mpi.c:int MPI_Abort(MPI_Comm comm,int errorcode)
src/sys/mpiuni/mpitime.c:    if (!QueryPerformanceCounter(&StartTime)) 
MPI_Abort(MPI_COMM_WORLD,1);
src/sys/mpiuni/mpitime.c:    if (!QueryPerformanceFrequency(&PerfFreq)) 
MPI_Abort(MPI_COMM_WORLD,1);
src/sys/mpiuni/mpitime.c:  if (!QueryPerformanceCounter(&CurTime)) 
MPI_Abort(MPI_COMM_WORLD,1);
src/sys/objects/init.c:  in the debugger hence we call abort() instead of 
MPI_Abort().
src/sys/objects/init.c:void Petsc_MPI_AbortOnError(MPI_Comm *comm,PetscMPIInt 
*flag,...)
src/sys/objects/init.c:  if (ierr) MPI_Abort(*comm,*flag); /* hopeless so get 
out */
src/sys/objects/init.c:      ierr = 
MPI_Comm_create_errhandler(Petsc_MPI_AbortOnError,&err_handler);CHKERRQ(ierr);
src/sys/objects/pinit.c:    MPI_Abort(MPI_COMM_WORLD,1);
src/sys/objects/pinit.c:    MPI_Abort(MPI_COMM_WORLD,1);
src/sys/objects/pinit.c:    MPI_Abort(MPI_COMM_WORLD,1);
src/sys/objects/pinit.c:    MPI_Abort(MPI_COMM_WORLD,1);
src/ts/examples/tutorials/ex48.c:  if (dim < 2) {MPI_Abort(MPI_COMM_WORLD,1); 
return;} /* this is needed so that the clang static analyzer does not generate 
a warning about variables used by not set */
src/vec/vec/examples/tests/ex32f.F:        call MPI_Abort(MPI_COMM_WORLD,0,ierr)
src/vec/vec/interface/dlregisvec.c:    MPI_Abort(MPI_COMM_SELF,1);
src/vec/vec/interface/dlregisvec.c:    MPI_Abort(MPI_COMM_SELF,1);
src/vec/vec/utils/comb.c:    MPI_Abort(MPI_COMM_SELF,1);
src/vec/vec/utils/comb.c:      MPI_Abort(MPI_COMM_SELF,1);

  Junchao,

     Maybe you could fix this and make a MR? I don't know how to organize the 
numbering. Should we have a central list of all numbers with macros in 
petscerror.h like 

#define PETSC_MPI_ABORT_MPIU_MaxIndex_Local 10 

etc?







   Barry

> 
> Satish
> 
> 
>> 
>>   You can return the particular job by clicking on the little circle after 
>> the job name and see what happens the next time.
>> 
>>   Barry
>> 
>>   It may be the -j and -l options for some systems need to adjusted down 
>> slightly and this will prevent these. Satish can that be done in the 
>> examples/arch-ci* files with configure options, or in in the runner files or 
>> in the yaml file?
> 
> configure has options --with-make-np --with-make-test-np --with-make-load
> 
> Satish
> 
>> 
>> 
>> 
>>> On Sep 19, 2019, at 5:00 PM, Zhang, Junchao <jczh...@mcs.anl.gov> wrote:
>>> 
>>> All failed tests just said "application called MPI_Abort" and had no stack 
>>> trace. They are not cuda tests. I updated SF to avoid CUDA  related 
>>> initialization if not needed. Let's see the new test result.
>>> not ok dm_impls_stag_tests-ex13_none_none_none_3d_par_stag_stencil_width-1
>>> #   application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
>>> 
>>> 
>>> --Junchao Zhang
>>> 
>>> 
>>> On Thu, Sep 19, 2019 at 3:57 PM Smith, Barry F. <bsm...@mcs.anl.gov> wrote:
>>> 
>>> Failed?  Means nothing, send link or cut and paste error
>>> 
>>> It could be that since we have multiple separate tests running at the same 
>>> time they overload the GPU or cause some inconsistent behavior that doesn't 
>>> appear every time the tests are run.
>>> 
>>>   Barry
>>> 
>>> Maybe we need to sequentialize all the tests that use the GPUs, we just 
>>> trust gnumake for the parallelism maybe you could some how add dependencies 
>>> to get gnu make to achieve this?
>>> 
>>> 
>>> 
>>> 
>>>> On Sep 19, 2019, at 3:53 PM, Zhang, Junchao <jczh...@mcs.anl.gov> wrote:
>>>> 
>>>> On Thu, Sep 19, 2019 at 3:24 PM Smith, Barry F. <bsm...@mcs.anl.gov> wrote:
>>>> 
>>>> 
>>>>> On Sep 19, 2019, at 2:50 PM, Zhang, Junchao <jczh...@mcs.anl.gov> wrote:
>>>>> 
>>>>> I saw your update. In PetscCUDAInitialize we have
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>      /* First get the device count */
>>>>> 
>>>>>      err   = cudaGetDeviceCount(&devCount);
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>      /* next determine the rank and then set the device via a mod */
>>>>> 
>>>>>      ierr   = MPI_Comm_rank(comm,&rank);CHKERRQ(ierr);
>>>>> 
>>>>>      device = rank % devCount;
>>>>> 
>>>>>    }
>>>>> 
>>>>>    err = cudaSetDevice(device);
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> If we rely on the first CUDA call to do initialization, how could CUDA 
>>>>> know these MPI stuff.
>>>> 
>>>>  It doesn't, so it does whatever it does (which may be dumb).
>>>> 
>>>>  Are you proposing something?
>>>> 
>>>> No. My test failed in CI with -cuda_initialize 0 on frog but I could not 
>>>> reproduce it. I'm doing investigation. 
>>>> 
>>>>  Barry
>>>> 
>>>>> 
>>>>> --Junchao Zhang
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Sep 18, 2019 at 11:42 PM Smith, Barry F. <bsm...@mcs.anl.gov> 
>>>>> wrote:
>>>>> 
>>>>>  Fixed the docs. Thanks for pointing out the lack of clarity
>>>>> 
>>>>> 
>>>>>> On Sep 18, 2019, at 11:25 PM, Zhang, Junchao via petsc-dev 
>>>>>> <petsc-dev@mcs.anl.gov> wrote:
>>>>>> 
>>>>>> Barry,
>>>>>> 
>>>>>> I saw you added these in init.c
>>>>>> 
>>>>>> 
>>>>>> +  -cuda_initialize - do the initialization in PetscInitialize()
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Notes:
>>>>>> 
>>>>>>   Initializing cuBLAS takes about 1/2 second there it is done by default 
>>>>>> in PetscInitialize() before logging begins
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> But I did not get otherwise with -cuda_initialize 0, when will cuda be 
>>>>>> initialized?
>>>>>> --Junchao Zhang
>>>>> 
>>> 
>> 
>
Re: [petsc-dev] PetscCUDAInitialize

Reply via email to