Hi,

I'm trying to get a better understanding of coordinating (non-overlapping) local stores with remote puts when using passive synchronization for RMA.  I understand that the window should be locked for a local store, but can it be a shared lock?  In my example, each process retrieves and increments an index (indexBuf and indexWin) from a target process and then stores it's rank into an array (dataBuf and dataWin) at that index on the target.  If the target is local, a local store is attempted:

/* indexWin on indexBuf, dataWin on dataBuf */
std::vector<int> myvals(numProc);
MPI_Win_lock_all(0, indexWin);
MPI_Win_lock_all(0, dataWin);
for (int tgtProc = 0; tgtProc != numProc; ++tgtProc)
  {
    MPI_Fetch_and_op(&one, &myvals[tgtProc], MPI_INT, tgtProc, 0, MPI_SUM,indexWin);
    MPI_Win_flush_local(tgtProc, indexWin);
    // Put our rank into the right location of the target
    if (tgtProc == procID)
      {
        dataBuf[myvals[procID]] = procID;
      }
    else
      {
        MPI_Put(&procID, 1, MPI_INT, tgtProc, myvals[tgtProc], 1, MPI_INT,dataWin);
      }
  }
MPI_Win_flush_all(dataWin);  /* Force completion and time synchronization */
MPI_Barrier(MPI_COMM_WORLD);
/* Proceed with local loads and unlock windows later */

I believe this is valid for a unified memory model but would probably fail for a separate model (unless a separate model very cleverly merges a private and public window?)  Is this understanding correct?  And if I instead use MPI_Put for the local write, then it should be valid for both memory models?

Another approach is specific locks.  I don't like this because it seems there are excessive synchronizations.  But if I really want to mix local stores and remote puts, is this the only way using locks?

/* indexWin on indexBuf, dataWin on dataBuf */
std::vector<int> myvals(numProc);
for (int tgtProc = 0; tgtProc != numProc; ++tgtProc)
  {
    MPI_Win_lock(MPI_LOCK_SHARED, tgtProc, 0, indexWin);
    MPI_Fetch_and_op(&one, &myvals[tgtProc], MPI_INT, tgtProc, 0, MPI_SUM,indexWin);
    MPI_Win_unlock(tgtProc, indexWin);
    // Put our rank into the right location of the target
    if (tgtProc == procID)
      {
        MPI_Win_lock(MPI_LOCK_EXCLUSIVE, tgtProc, 0, dataWin);
        dataBuf[myvals[procID]] = procID;
        MPI_Win_unlock(tgtProc, dataWin);  /*(A)*/
      }
    else
      {
        MPI_Win_lock(MPI_LOCK_SHARED, tgtProc, 0, dataWin);
        MPI_Put(&procID, 1, MPI_INT, tgtProc, myvals[tgtProc], 1, MPI_INT,dataWin);
        MPI_Win_unlock(tgtProc, dataWin);
      }
  }
/* Proceed with local loads */

I believe this is also valid for both memory models?  An unlock must have followed the last access to the local window, before the exclusive lock is gained.  That should have synchronized the windows and another synchronization should happen at (A).  Is that understanding correct?  If so, how does one ever get into a situation where MPI_Win_sync must be used?

Final question.  In the first example, let's say there is a lot of computation in the loop and I want the MPI_Puts to immediately make progress.  Would it be sensible to follow the MPI_Put with a MPI_Win_flush_local to get things moving?  Or is it best to avoid any unnecessary synchronizations?

Thanks,
Stephen

Reply via email to