Permission to ask potentially dumb questions? In this code (src/vec/is/sf/impls/window/sfwindow.c in git master as of 7fb8eca36577947f0d4ded035522334534bc7e38):
static PetscErrorCode PetscSFFetchAndOpBegin_Window(PetscSF sf,MPI_Datatype unit,void *rootdata,const void *leafdata,void *leafupdate,MPI_Op op) { PetscErrorCode ierr; PetscInt i,nranks; const PetscMPIInt *ranks; const MPI_Datatype *mine,*remote; MPI_Win win; PetscFunctionBegin; ierr = PetscSFGetRanks(sf,&nranks,&ranks,NULL,NULL,NULL);CHKERRQ(ierr); ierr = PetscSFWindowGetDataTypes(sf,unit,&mine,&remote);CHKERRQ(ierr); ierr = PetscSFWindowOpTranslate(&op);CHKERRQ(ierr); ierr = PetscSFGetWindow(sf,unit,rootdata,PETSC_FALSE,0,0,0,&win);CHKERRQ(ierr); for (i=0; i<sf->nranks; i++) { ierr = MPI_Win_lock(MPI_LOCK_EXCLUSIVE,sf->ranks[i],0,win);CHKERRQ(ierr); ierr = MPI_Get(leafupdate,1,mine[i],ranks[i],0,1,remote[i],win);CHKERRQ(ierr); ierr = MPI_Accumulate((void*)leafdata,1,mine[i],ranks[i],0,1,remote[i],op,win);CHKERRQ(ierr); ierr = MPI_Win_unlock(ranks[i],win);CHKERRQ(ierr); } PetscFunctionReturn(0); } I see the following issues: 1) Implementing atomic read-modify-write using Lock->Get->Accumulate->Unlock is W.R.O.N.G. Get and Accumulate are NOT guaranteed to be ordered within any epoch. Ever. You appear to be leveraged the ordered channel semantics of MPICH's Ch3. You need to either use MPI-3 RMA's MPI_Get_accumulate (or MPI_Fetch_and_op) or use a mutex ala MPICH's ROMIO and ARMCI-MPI's MPI-2 port. See http://dx.doi.org/10.1109/CCGRID.2005.1558687 for algorithm details. To be explicit, your code is equivalent to: for (i=0; i<sf->nranks; i++) { ierr = MPI_Win_lock(MPI_LOCK_EXCLUSIVE,sf->ranks[i],0,win);CHKERRQ(ierr); int order = rand()%2; if (order) { ierr = MPI_Get(leafupdate,1,mine[i],ranks[i],0,1,remote[i],win);CHKERRQ(ierr); ierr = MPI_Accumulate((void*)leafdata,1,mine[i],ranks[i],0,1,remote[i],op,win);CHKERRQ(ierr); } else { ierr = MPI_Accumulate((void*)leafdata,1,mine[i],ranks[i],0,1,remote[i],op,win);CHKERRQ(ierr); ierr = MPI_Get(leafupdate,1,mine[i],ranks[i],0,1,remote[i],win);CHKERRQ(ierr); } ierr = MPI_Win_unlock(ranks[i],win);CHKERRQ(ierr); } 2) 'sf->ranks[i]' is used in MPI_Win_lock, whereas the rest of the RMA calls use ranks[i]. I assume this is innocuous, but unless you have a mutex on 'sf', it is theoretically possible that sf->ranks[i] could be changed by another thread in such a way as to lead to an undefined or incorrect program. If you prohibit this behavior as part of the internal design contract, then it should be noted. Best, Jeff -- Jeff Hammond jeff.scie...@gmail.com http://jeffhammond.github.io/