I have added this in the branch barry/add-mpiu_allreduce and made a pull 
request

  Thanks!

> On Oct 21, 2015, at 11:18 AM, Eric Chamberland 
> <eric.chamberl...@giref.ulaval.ca> wrote:
> 
> On 21/10/15 11:36 AM, Barry Smith wrote:
>> 
>>> On Oct 21, 2015, at 10:28 AM, Eric Chamberland 
>>> <eric.chamberl...@giref.ulaval.ca> wrote:
>>> 
>>> On 21/10/15 10:56 AM, Barry Smith wrote:
>>>> 
>>>>   I similarly saw the same behavior in the debugger that you reported and 
>>>> was mystified.
>>>>   This is why I called it a nasty bug. The setting of different nonew flag 
>>>> meant that one process called the Allreduce() at the end of 
>>>> MatAssemblyEnd_MPIAIJ()
>>>> 
>>>> if ((!mat->was_assembled && mode == MAT_FINAL_ASSEMBLY) || 
>>>> !((Mat_SeqAIJ*)(aij->A->data))->nonew) {
>>>>     PetscObjectState state = aij->A->nonzerostate + aij->B->nonzerostate;
>>>>     ierr = 
>>>> MPI_Allreduce(&state,&mat->nonzerostate,1,MPIU_INT64,MPI_SUM,PetscObjectComm((PetscObject)mat));CHKERRQ(ierr);
>>>>   }
>>>> 
>>>> but the other process skipped this call. Thus the other process got to the 
>>>> MPI_Allreduce() in PetscValidLogicalCollectiveEnum() and exchanged data 
>>>> between the two MPI_Allreduce() that serve very different purposes. Since 
>>>> MPI_Allreduce() doesn't have any concept of tags if one process skips an 
>>>> Allreduce call then the next Allreduce call gets matched with it.
>>> 
>>> ok!  I understand now...
>>> 
>>> then, would wrapping (in debug mode), these calls with MPI_Barrier 
>>> before/after, would have prevented to mislead the debugging person?  In our 
>>> code, we do control that all processes make the same collective mpi calls 
>>> with a kind-of barrier (in debug mode) which communicates the line number 
>>> and the file name (transformed into a int value) so that a process is 
>>> blocked at the first wrongly matched mpi call...
>> 
>>     This is a good idea. Do you have a C macro implementation you would be 
>> willing to share, it would be trivial to add to PETSc if you had something 
>> like
>> 
>> #if debug
>> #define MPIU_Allreduce(.....) macro that first checks function and line 
>> number
>> #else
>> #define MPIU_Allreduce MPI_Allreduce
>> 
> 
> Yes/no: it is really not coded like this for us and use our own MPI classes, 
> but you mostly already have the mandatory required function which is 
> PetscValidLogicalCollectiveInt... but the "ints" you check are:
> 
> [0] => __LINE__
> [1] => __FILE__ transformed into a int (we just sum the chars...)
> [2] => a static counter that is incremented (to catch that the very same 
> line+file is called the very same number of time -- which can be an error 
> that will not be caught if you compare only [0] and [1]).
> 
> So It could be something like this:
> 
> #if debug
> #define MPIU_Allreduce(.....) \
> PetscVerifyExecutionPoint(__LINE__,__FILE__,comm); \\
> MPI_Allreduce...
> #else
> #define MPIU_Allreduce MPI_Allreduce
> 
> with:
> 
> PetscVerifyExecutionPoint(__LINE__,__FILE__,comm)
> {
> static int lCount = 0;
> ++lCount;
> 
> /* better to communicate all 3 at the same time...*/
> SomewhatPetscValidLogicalCollectiveInt(__FILE__,comm);
> SomewhatPetscValidLogicalCollectiveInt(__LINE__,comm);
> SomewhatPetscValidLogicalCollectiveInt(lCount,comm);
> }
> 
> and all of this is not my idea, it is from Patrick... ;)
> 
>> 
>>> 
>>> 
>>>> 
>>>>   The way I finally found the bug was to put a break point in 
>>>> MPI_Allreduce() and run the program until the two processes were calling 
>>>> Allreduce() from different places.
>>> 
>>> okay...wow... you modified the mpi code?
>> 
>>    No, I just ran both processes in the debugger with the break point doing 
>> a cont each time until they did not match.
> 
> ok!  I forgot we can do this... ;)
> 
> Eric
> 

Reply via email to