Thanks, George.  I think we’re on the same page.  I’d love for Nathan to jump 
in here, since I’m guessing he has opinions on this subject.  Once we reach 
consensus, Wei or I will submit a PR to clarify the BTL documentation.

Brian

On 9/29/21, 7:40 AM, "George Bosilca" 
<bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote:


Brian,

My comment was mainly about the BTL code. MPI_Win_fence does not require remote 
completion, the call only guarantees that all outbound operations have been 
locally completed, and that all inbound operations from other sources on the 
process are also complete. I agree with you on the Win_flush implementation we 
have, it only guarantees the first part, and assumes the barrier will drain the 
network of all pending messages.

You're right, the current implementation assumes that the MPI_Barrier having a 
more synchronizing behavior and requiring more messages to be exchanged between 
the participants, might increase the likelihood that even with overtaking all 
pending messages have reached destination.

  George.


On Tue, Sep 28, 2021 at 10:36 PM Barrett, Brian 
<bbarr...@amazon.com<mailto:bbarr...@amazon.com>> wrote:
George –

Is your comment about the code path referring to the BTL code or the OSC RDMA 
code?  The OSC code seems to expect remote completion, at least for the fence 
operation.  Fence is implemented as a btl flush followed by a window-wide 
barrier.  There’s no ordering specified between the RDMA operations completed 
by the flush and the send messages in the collective, so overtaking is 
possible.  Given that the BTL and the UCX PML (or OFI MTL or whatever) are 
likely using different QPs, ordering of the packets is doubtful.

Like you, we saw that many BTLs appear to only guarantee local completion with 
flush().  So the question is which one is broken (and then we’ll have to figure 
out how to fix…).

Brian

On 9/28/21, 7:11 PM, "devel on behalf of George Bosilca via devel" 
<devel-boun...@lists.open-mpi.org<mailto:devel-boun...@lists.open-mpi.org> on 
behalf of devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> wrote:


Based on my high-level understanding of the code path and according to the UCX 
implementation of the flush, the required level of completion is local.

  George.


On Tue, Sep 28, 2021 at 19:26 Zhang, Wei via devel 
<devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> wrote:
Dear All,

I have a question regarding the completion semantics of btl_flush,

In opal/mca/btl/btl.h,

https://github.com/open-mpi/ompi/blob/4828663537e952e3d7cbf8fbf5359f16fdcaaade/opal/mca/btl/btl.h#L1146

the comment about btl_flush says:


* This function returns when all outstanding RDMA (put, get, atomic) operations

* that were started prior to the flush call have completed.

However, it is not clear to me what “complete” actually means? E.g. does it 
mean local completion (the action on RDMA initiator side has completed), or 
does it mean “remote completion”, (the action of RDMA remote side has 
completed). We are interested in this  because for many RDMA btls, “local 
completion” does not equal to “remote completion”.

From the way btl_flush is used in osc/rdma’s fence operation (which is a call 
to flush followed by a MPI_Barrier), we think that btl_flush should mean remote 
completion, but want to get the clarification from the community.

Sincerely,

Wei Zhang

Reply via email to