George – Is your comment about the code path referring to the BTL code or the OSC RDMA code? The OSC code seems to expect remote completion, at least for the fence operation. Fence is implemented as a btl flush followed by a window-wide barrier. There’s no ordering specified between the RDMA operations completed by the flush and the send messages in the collective, so overtaking is possible. Given that the BTL and the UCX PML (or OFI MTL or whatever) are likely using different QPs, ordering of the packets is doubtful.
Like you, we saw that many BTLs appear to only guarantee local completion with flush(). So the question is which one is broken (and then we’ll have to figure out how to fix…). Brian On 9/28/21, 7:11 PM, "devel on behalf of George Bosilca via devel" <devel-boun...@lists.open-mpi.org<mailto:devel-boun...@lists.open-mpi.org> on behalf of devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> wrote: Based on my high-level understanding of the code path and according to the UCX implementation of the flush, the required level of completion is local. George. On Tue, Sep 28, 2021 at 19:26 Zhang, Wei via devel <devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> wrote: Dear All, I have a question regarding the completion semantics of btl_flush, In opal/mca/btl/btl.h, https://github.com/open-mpi/ompi/blob/4828663537e952e3d7cbf8fbf5359f16fdcaaade/opal/mca/btl/btl.h#L1146 the comment about btl_flush says: * This function returns when all outstanding RDMA (put, get, atomic) operations * that were started prior to the flush call have completed. However, it is not clear to me what “complete” actually means? E.g. does it mean local completion (the action on RDMA initiator side has completed), or does it mean “remote completion”, (the action of RDMA remote side has completed). We are interested in this because for many RDMA btls, “local completion” does not equal to “remote completion”. From the way btl_flush is used in osc/rdma’s fence operation (which is a call to flush followed by a MPI_Barrier), we think that btl_flush should mean remote completion, but want to get the clarification from the community. Sincerely, Wei Zhang