George –

Is your comment about the code path referring to the BTL code or the OSC RDMA 
code?  The OSC code seems to expect remote completion, at least for the fence 
operation.  Fence is implemented as a btl flush followed by a window-wide 
barrier.  There’s no ordering specified between the RDMA operations completed 
by the flush and the send messages in the collective, so overtaking is 
possible.  Given that the BTL and the UCX PML (or OFI MTL or whatever) are 
likely using different QPs, ordering of the packets is doubtful.

Like you, we saw that many BTLs appear to only guarantee local completion with 
flush().  So the question is which one is broken (and then we’ll have to figure 
out how to fix…).

Brian

On 9/28/21, 7:11 PM, "devel on behalf of George Bosilca via devel" 
<devel-boun...@lists.open-mpi.org<mailto:devel-boun...@lists.open-mpi.org> on 
behalf of devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> wrote:


Based on my high-level understanding of the code path and according to the UCX 
implementation of the flush, the required level of completion is local.

  George.


On Tue, Sep 28, 2021 at 19:26 Zhang, Wei via devel 
<devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> wrote:
Dear All,

I have a question regarding the completion semantics of btl_flush,

In opal/mca/btl/btl.h,

https://github.com/open-mpi/ompi/blob/4828663537e952e3d7cbf8fbf5359f16fdcaaade/opal/mca/btl/btl.h#L1146

the comment about btl_flush says:


* This function returns when all outstanding RDMA (put, get, atomic) operations

* that were started prior to the flush call have completed.

However, it is not clear to me what “complete” actually means? E.g. does it 
mean local completion (the action on RDMA initiator side has completed), or 
does it mean “remote completion”, (the action of RDMA remote side has 
completed). We are interested in this  because for many RDMA btls, “local 
completion” does not equal to “remote completion”.

From the way btl_flush is used in osc/rdma’s fence operation (which is a call 
to flush followed by a MPI_Barrier), we think that btl_flush should mean remote 
completion, but want to get the clarification from the community.

Sincerely,

Wei Zhang

Reply via email to