I have added this in the branch barry/add-mpiu_allreduce and made a pull request
Thanks! > On Oct 21, 2015, at 11:18 AM, Eric Chamberland > <eric.chamberl...@giref.ulaval.ca> wrote: > > On 21/10/15 11:36 AM, Barry Smith wrote: >> >>> On Oct 21, 2015, at 10:28 AM, Eric Chamberland >>> <eric.chamberl...@giref.ulaval.ca> wrote: >>> >>> On 21/10/15 10:56 AM, Barry Smith wrote: >>>> >>>> I similarly saw the same behavior in the debugger that you reported and >>>> was mystified. >>>> This is why I called it a nasty bug. The setting of different nonew flag >>>> meant that one process called the Allreduce() at the end of >>>> MatAssemblyEnd_MPIAIJ() >>>> >>>> if ((!mat->was_assembled && mode == MAT_FINAL_ASSEMBLY) || >>>> !((Mat_SeqAIJ*)(aij->A->data))->nonew) { >>>> PetscObjectState state = aij->A->nonzerostate + aij->B->nonzerostate; >>>> ierr = >>>> MPI_Allreduce(&state,&mat->nonzerostate,1,MPIU_INT64,MPI_SUM,PetscObjectComm((PetscObject)mat));CHKERRQ(ierr); >>>> } >>>> >>>> but the other process skipped this call. Thus the other process got to the >>>> MPI_Allreduce() in PetscValidLogicalCollectiveEnum() and exchanged data >>>> between the two MPI_Allreduce() that serve very different purposes. Since >>>> MPI_Allreduce() doesn't have any concept of tags if one process skips an >>>> Allreduce call then the next Allreduce call gets matched with it. >>> >>> ok! I understand now... >>> >>> then, would wrapping (in debug mode), these calls with MPI_Barrier >>> before/after, would have prevented to mislead the debugging person? In our >>> code, we do control that all processes make the same collective mpi calls >>> with a kind-of barrier (in debug mode) which communicates the line number >>> and the file name (transformed into a int value) so that a process is >>> blocked at the first wrongly matched mpi call... >> >> This is a good idea. Do you have a C macro implementation you would be >> willing to share, it would be trivial to add to PETSc if you had something >> like >> >> #if debug >> #define MPIU_Allreduce(.....) macro that first checks function and line >> number >> #else >> #define MPIU_Allreduce MPI_Allreduce >> > > Yes/no: it is really not coded like this for us and use our own MPI classes, > but you mostly already have the mandatory required function which is > PetscValidLogicalCollectiveInt... but the "ints" you check are: > > [0] => __LINE__ > [1] => __FILE__ transformed into a int (we just sum the chars...) > [2] => a static counter that is incremented (to catch that the very same > line+file is called the very same number of time -- which can be an error > that will not be caught if you compare only [0] and [1]). > > So It could be something like this: > > #if debug > #define MPIU_Allreduce(.....) \ > PetscVerifyExecutionPoint(__LINE__,__FILE__,comm); \\ > MPI_Allreduce... > #else > #define MPIU_Allreduce MPI_Allreduce > > with: > > PetscVerifyExecutionPoint(__LINE__,__FILE__,comm) > { > static int lCount = 0; > ++lCount; > > /* better to communicate all 3 at the same time...*/ > SomewhatPetscValidLogicalCollectiveInt(__FILE__,comm); > SomewhatPetscValidLogicalCollectiveInt(__LINE__,comm); > SomewhatPetscValidLogicalCollectiveInt(lCount,comm); > } > > and all of this is not my idea, it is from Patrick... ;) > >> >>> >>> >>>> >>>> The way I finally found the bug was to put a break point in >>>> MPI_Allreduce() and run the program until the two processes were calling >>>> Allreduce() from different places. >>> >>> okay...wow... you modified the mpi code? >> >> No, I just ran both processes in the debugger with the break point doing >> a cont each time until they did not match. > > ok! I forgot we can do this... ;) > > Eric >