[OMPI users] OSC UCX error using MPI_Win_allocate

2024-03-19 Thread Stephen Guzik via users

Hi,

For development purposes, I built and installed Open MPI 5.0.2 on my 
workstation.  As I understand it, to use OpenSHMEM, one has to include 
ucx.  I configured with


./configure --build=x86_64-linux-gnu 
--prefix=/usr/local/openmpi/5.0.2_gcc-12.2.0 --with-ucx 
--with-pmix=internal --with-libevent=external --with-hwloc=external 
--enable-mpi-fortran=all --with-cuda=/usr/local/cuda 
--with-cuda-libdir=/usr/lib/x86_64-linux-gnu


All seems fine with both MPI and OpenSHMEM until I try to use 
MPI_Win_allocate in which case I see the following error, once for each 
call:


osc_ucx_component.c:369  Error: OSC UCX component priority set inside 
component query failed


While the job still seems to run correctly, any advice on removing the 
error?


Thanks,
Stephen

[OMPI users] Issues with MPI_Win_Create on Debian 11

2022-02-08 Thread Stephen Guzik via users

Hi all,

There are several bug reports on 4.1.x describing MPI_Win_create failing 
for various architectures.  I too am seeing the same for 4.1.0-10 which 
is packaged for Debian 11, just on a standard workstation where at least 
vader,tcp,self, and sm are identified (not sure which are being used).  
From reading the bug reports, I learned two things: a) version 5 might 
fix this and b) MPI_Win_create should have been depreciated.


Indeed, I changed all my MPI_Win_create to MPI_Win_allocate and 
everything works fine.  So I suppose that is the best solution, at least 
in my case.


I'm writing this email just in case anyone else encounters the same 
situation.


Stephen


[OMPI users] Coordinating (non-overlapping) local stores with remote puts form using passive RMA synchronization

2020-05-30 Thread Stephen Guzik via users

Hi,

I'm trying to get a better understanding of coordinating 
(non-overlapping) local stores with remote puts when using passive 
synchronization for RMA.  I understand that the window should be locked 
for a local store, but can it be a shared lock?  In my example, each 
process retrieves and increments an index (indexBuf and indexWin) from a 
target process and then stores it's rank into an array (dataBuf and 
dataWin) at that index on the target.  If the target is local, a local 
store is attempted:


/* indexWin on indexBuf, dataWin on dataBuf */
std::vector myvals(numProc);
MPI_Win_lock_all(0, indexWin);
MPI_Win_lock_all(0, dataWin);
for (int tgtProc = 0; tgtProc != numProc; ++tgtProc)
  {
    MPI_Fetch_and_op(, [tgtProc], MPI_INT, tgtProc, 0, 
MPI_SUM,indexWin);

    MPI_Win_flush_local(tgtProc, indexWin);
    // Put our rank into the right location of the target
    if (tgtProc == procID)
  {
    dataBuf[myvals[procID]] = procID;
  }
    else
  {
    MPI_Put(, 1, MPI_INT, tgtProc, myvals[tgtProc], 1, 
MPI_INT,dataWin);

  }
  }
MPI_Win_flush_all(dataWin);  /* Force completion and time synchronization */
MPI_Barrier(MPI_COMM_WORLD);
/* Proceed with local loads and unlock windows later */

I believe this is valid for a unified memory model but would probably 
fail for a separate model (unless a separate model very cleverly merges 
a private and public window?)  Is this understanding correct?  And if I 
instead use MPI_Put for the local write, then it should be valid for 
both memory models?


Another approach is specific locks.  I don't like this because it seems 
there are excessive synchronizations.  But if I really want to mix local 
stores and remote puts, is this the only way using locks?


/* indexWin on indexBuf, dataWin on dataBuf */
std::vector myvals(numProc);
for (int tgtProc = 0; tgtProc != numProc; ++tgtProc)
  {
    MPI_Win_lock(MPI_LOCK_SHARED, tgtProc, 0, indexWin);
    MPI_Fetch_and_op(, [tgtProc], MPI_INT, tgtProc, 0, 
MPI_SUM,indexWin);

    MPI_Win_unlock(tgtProc, indexWin);
    // Put our rank into the right location of the target
    if (tgtProc == procID)
  {
    MPI_Win_lock(MPI_LOCK_EXCLUSIVE, tgtProc, 0, dataWin);
    dataBuf[myvals[procID]] = procID;
    MPI_Win_unlock(tgtProc, dataWin);  /*(A)*/
  }
    else
  {
    MPI_Win_lock(MPI_LOCK_SHARED, tgtProc, 0, dataWin);
    MPI_Put(, 1, MPI_INT, tgtProc, myvals[tgtProc], 1, 
MPI_INT,dataWin);

    MPI_Win_unlock(tgtProc, dataWin);
  }
  }
/* Proceed with local loads */

I believe this is also valid for both memory models?  An unlock must 
have followed the last access to the local window, before the exclusive 
lock is gained.  That should have synchronized the windows and another 
synchronization should happen at (A).  Is that understanding correct?  
If so, how does one ever get into a situation where MPI_Win_sync must be 
used?


Final question.  In the first example, let's say there is a lot of 
computation in the loop and I want the MPI_Puts to immediately make 
progress.  Would it be sensible to follow the MPI_Put with a 
MPI_Win_flush_local to get things moving?  Or is it best to avoid any 
unnecessary synchronizations?


Thanks,
Stephen