Re: [OMPI devel] Question regarding the completion of btl_flush

Barrett, Brian via devel Wed, 29 Sep 2021 11:08:36 -0700

Thanks, George.  I think we’re on the same page.  I’d love for Nathan to jump 
in here, since I’m guessing he has opinions on this subject.  Once we reach 
consensus, Wei or I will submit a PR to clarify the BTL documentation.

Brian

On 9/29/21, 7:40 AM, "George Bosilca"
<bosi...@icl.utk.edu<mailto:bosi...@icl.utk.edu>> wrote:

Brian,

My comment was mainly about the BTL code. MPI_Win_fence does not require remote
completion, the call only guarantees that all outbound operations have been
locally completed, and that all inbound operations from other sources on the
process are also complete. I agree with you on the Win_flush implementation we
have, it only guarantees the first part, and assumes the barrier will drain the
network of all pending messages.

You're right, the current implementation assumes that the MPI_Barrier having a
more synchronizing behavior and requiring more messages to be exchanged between
the participants, might increase the likelihood that even with overtaking all
pending messages have reached destination.

George.

On Tue, Sep 28, 2021 at 10:36 PM Barrett, Brian
<bbarr...@amazon.com<mailto:bbarr...@amazon.com>> wrote:
George –

Is your comment about the code path referring to the BTL code or the OSC RDMA
code? The OSC code seems to expect remote completion, at least for the fence
operation. Fence is implemented as a btl flush followed by a window-wide
barrier. There’s no ordering specified between the RDMA operations completed
by the flush and the send messages in the collective, so overtaking is
possible. Given that the BTL and the UCX PML (or OFI MTL or whatever) are
likely using different QPs, ordering of the packets is doubtful.

Like you, we saw that many BTLs appear to only guarantee local completion with
flush(). So the question is which one is broken (and then we’ll have to figure
out how to fix…).

Brian

On 9/28/21, 7:11 PM, "devel on behalf of George Bosilca via devel"
<devel-boun...@lists.open-mpi.org<mailto:devel-boun...@lists.open-mpi.org> on
behalf of devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> wrote:

Based on my high-level understanding of the code path and according to the UCX
implementation of the flush, the required level of completion is local.

George.

On Tue, Sep 28, 2021 at 19:26 Zhang, Wei via devel
<devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> wrote:
Dear All,

I have a question regarding the completion semantics of btl_flush,

In opal/mca/btl/btl.h,

https://github.com/open-mpi/ompi/blob/4828663537e952e3d7cbf8fbf5359f16fdcaaade/opal/mca/btl/btl.h#L1146

the comment about btl_flush says:

* This function returns when all outstanding RDMA (put, get, atomic) operations

* that were started prior to the flush call have completed.

However, it is not clear to me what “complete” actually means? E.g. does it
mean local completion (the action on RDMA initiator side has completed), or
does it mean “remote completion”, (the action of RDMA remote side has
completed). We are interested in this because for many RDMA btls, “local
completion” does not equal to “remote completion”.

From the way btl_flush is used in osc/rdma’s fence operation (which is a call
to flush followed by a MPI_Barrier), we think that btl_flush should mean remote
completion, but want to get the clarification from the community.

Sincerely,

Wei Zhang

Re: [OMPI devel] Question regarding the completion of btl_flush

Reply via email to