Re: [OMPI users] Coordinating (non-overlapping) local stores with remote puts form using passive RMA synchronization

Joseph Schuchart via users Tue, 02 Jun 2020 00:53:07 -0700

Hi Stephen,

Let me try to answer your questions inline (I don't have extensiveexperience with the separate model and from my experience mostimplementations support the unified model, with some exceptions):


On 5/31/20 1:31 AM, Stephen Guzik via users wrote:

Hi,
I'm trying to get a better understanding of coordinating(non-overlapping) local stores with remote puts when using passivesynchronization for RMA. I understand that the window should be lockedfor a local store, but can it be a shared lock?


Yes. There is no reason why that cannot be a shared lock.

In my example, eachprocess retrieves and increments an index (indexBuf and indexWin) from atarget process and then stores it's rank into an array (dataBuf anddataWin) at that index on the target. If the target is local, a localstore is attempted:
/* indexWin on indexBuf, dataWin on dataBuf */
std::vector<int> myvals(numProc);
MPI_Win_lock_all(0, indexWin);
MPI_Win_lock_all(0, dataWin);
for (int tgtProc = 0; tgtProc != numProc; ++tgtProc)
   {
MPI_Fetch_and_op(&one, &myvals[tgtProc], MPI_INT, tgtProc, 0,MPI_SUM,indexWin);
     MPI_Win_flush_local(tgtProc, indexWin);
     // Put our rank into the right location of the target
     if (tgtProc == procID)
       {
         dataBuf[myvals[procID]] = procID;
       }
     else
       {
MPI_Put(&procID, 1, MPI_INT, tgtProc, myvals[tgtProc], 1,MPI_INT,dataWin);
       }
   }
MPI_Win_flush_all(dataWin); /* Force completion and timesynchronization */
MPI_Barrier(MPI_COMM_WORLD);
/* Proceed with local loads and unlock windows later */
I believe this is valid for a unified memory model but would probablyfail for a separate model (unless a separate model very cleverly mergesa private and public window?) Is this understanding correct? And if Iinstead use MPI_Put for the local write, then it should be valid forboth memory models?

Yes, if you use RMA operations even on local memory it is valid for bothmemory models.

The MPI standard on page 455 (S3) states that "a store to process memoryto a location in a window must not start once a put or accumulate updateto that target window has started, until the put or accumulate updatebecomes visible in process memory." So there is no clever merging and itis up to the user to ensure that there are no puts and stores happeningat the same time.

Another approach is specific locks. I don't like this because it seemsthere are excessive synchronizations. But if I really want to mix localstores and remote puts, is this the only way using locks?
/* indexWin on indexBuf, dataWin on dataBuf */
std::vector<int> myvals(numProc);
for (int tgtProc = 0; tgtProc != numProc; ++tgtProc)
   {
     MPI_Win_lock(MPI_LOCK_SHARED, tgtProc, 0, indexWin);
MPI_Fetch_and_op(&one, &myvals[tgtProc], MPI_INT, tgtProc, 0,MPI_SUM,indexWin);
     MPI_Win_unlock(tgtProc, indexWin);
     // Put our rank into the right location of the target
     if (tgtProc == procID)
       {
         MPI_Win_lock(MPI_LOCK_EXCLUSIVE, tgtProc, 0, dataWin);
         dataBuf[myvals[procID]] = procID;
         MPI_Win_unlock(tgtProc, dataWin);  /*(A)*/
       }
     else
       {
         MPI_Win_lock(MPI_LOCK_SHARED, tgtProc, 0, dataWin);
MPI_Put(&procID, 1, MPI_INT, tgtProc, myvals[tgtProc], 1,MPI_INT,dataWin);
         MPI_Win_unlock(tgtProc, dataWin);
       }
   }
/* Proceed with local loads */
I believe this is also valid for both memory models? An unlock musthave followed the last access to the local window, before the exclusivelock is gained. That should have synchronized the windows and anothersynchronization should happen at (A). Is that understanding correct?

That is correct for both memory models, yes. It is likely to be slowerbecause locking and unlocking involves some effort. You are better offusing put instead.

If you really want to use local stores you can check for theMPI_WIN_UNIFIED attribute and fall-back to using puts only for theseparate model.

> If so, how does one ever get into a situation where MPI_Win_sync mustbe used?

You can think of a synchronization scheme where each process takes ashared lock on a window, stores data to a local location, callsMPI_Win_sync and signals to other processes that the data is nowavailable, e.g., through a barrier or a send. In that case processeskeep the lock and use some non-RMA synchronization instead.

Final question. In the first example, let's say there is a lot ofcomputation in the loop and I want the MPI_Puts to immediately makeprogress. Would it be sensible to follow the MPI_Put with aMPI_Win_flush_local to get things moving? Or is it best to avoid anyunnecessary synchronizations?

That is highly implementation-specific. Some implementations may bufferthe puts and delay the transfer to the flush, some may initiate itimmediately, and some may treat a local flush similar to a regularflush. I would not make any assumptions about the underlyingimplementation and defer flushes as long as possible.


Cheers
Joseph


Thanks,
Stephen

Re: [OMPI users] Coordinating (non-overlapping) local stores with remote puts form using passive RMA synchronization

Reply via email to