On May 24, 2007, at 2:48 PM, George Bosilca wrote:
I see the problem this patch try to solve, but I fail to correctly understand the implementation. The patch affect all PML and BTL in the code base by adding one more argument to some of the most often called functions. And there is only one BTL (openib) who seems to use it while all others completely ignore it. Moreover, there seems to be already a very similar mechanism based on the MCA_BTL_DES_FLAGS_PRIORITY flag, which can be set by the PML level into the btl_descriptor.So what's the difference between the additional argument and a correct usage of the MCA_BTL_DES_FLAGS_PRIORITY flag ?
The problem is that MCA_BTL_DES_FLAGS_PRIORITY was meant to indicate that the fragment was higher priority, but the fragment isn't higher priority. It simply needs to be ordered w.r.t. a previous fragment, an RDMA in this case. This being said, we could have just added an rdma fin flag, but this would mix protocol a bit too much between the BTL and the PML in my opinion. What we have with this fix is that the BTL can assign an order tag to any descriptor if it wishes, this order tag is only valid after a call to btl_send or btl_put/get. This order tag can then be used to request another descriptor later that will enforce ordering. The semantics here are clear, and the BTL doesn't have to do anything if it doesn't wish (w.r.t. assigning a valid order tag). So this was the clearest semantics I could come up with that allowed for numerous implementations at the BTL level. For example, even specifying an rdma fin flag directly to the BTL would restrict the BTL further than these semantics because then all RDMA's must be sent on the same endpoint/QP as all the PML would be able to indicate is that a FIN is being sent, and the BTL wouldn't have the context to know which RDMA the FIN belonged to and hence couldn't enforce ordering easily.
The only reason OpenIB is the only one to use this new functionality is because I haven't had a chance to fix up udapl, which I plan to do next week. Note that GM semantics expose a similar problem (ordering is only guaranteed for messages of the same priority), but myrinet doesn't buffer like some of the IB/IWARP stuff can so we won't see it there.
There are also a number of optimizations that these semantics allow, for example, the BTL doesn't have to give local completion callback on an RDMA anymore, as the fin message can be used for local completion of both.
I am also looking at adding a BTL_PUT_IMMEDIATE which provides remote completion via an active message tag callback along with 64 bits of data, this would allow us to bypass the FIN entirely if the network supports it, this would be useful for MX as an example. OpenIB also supports a similar mechanism but there are problems that would need to be addressed as OpenIB only delivers 32 bits with the remote completion.
- Galen
george. On May 24, 2007, at 3:51 PM, gship...@osl.iu.edu wrote:Author: gshipman Date: 2007-05-24 15:51:26 EDT (Thu, 24 May 2007) New Revision: 14768 URL: https://svn.open-mpi.org/trac/ompi/changeset/14768 Log: Add optional ordering to the BTL interface.This is required to tighten up the BTL semantics. Ordering is not guaranteed,but, if the BTL returns a order tag in a descriptor (other thanMCA_BTL_NO_ORDER) then we may request another descriptor that will obeyordering w.r.t. to the other descriptor.This will allow sane behavior for RDMA networks, where local completion of an RDMA operation on the active side does not imply remote completion on the passive side. If we send a FIN message after local completion and the FIN is not ordered w.r.t. the RDMA operation then badness may occur as the passive side may now try to deregister the memory and the RDMA operation may still bepending on the passive side. Note that this has no impact on networks that don't suffer from this limitation as the ORDER tag can simply always be specified as MCA_BTL_NO_ORDER. Text files modified:trunk/ompi/mca/bml/bml.h | 29 +++ ++++++++++++-------- trunk/ompi/mca/btl/btl.h | 10 +++ +++++ trunk/ompi/mca/btl/gm/btl_gm.c | 8 +++ +++trunk/ompi/mca/btl/gm/btl_gm.h | 3 ++trunk/ompi/mca/btl/mx/btl_mx.c | 8 +++ +++trunk/ompi/mca/btl/mx/btl_mx.h | 3 ++trunk/ompi/mca/btl/openib/btl_openib.c | 49 +++ +++++++++++++++++++++++++++++++++++-trunk/ompi/mca/btl/openib/btl_openib.h | 3 ++ trunk/ompi/mca/btl/openib/btl_openib_endpoint.c | 7 +++-- trunk/ompi/mca/btl/openib/btl_openib_frag.c | 7 +++++ trunk/ompi/mca/btl/portals/btl_portals.c | 8 ++++- trunk/ompi/mca/btl/portals/btl_portals.h | 3 ++ trunk/ompi/mca/btl/self/btl_self.c | 3 ++ trunk/ompi/mca/btl/self/btl_self.h | 3 ++ trunk/ompi/mca/btl/sm/btl_sm.c | 2 + trunk/ompi/mca/btl/sm/btl_sm.h | 2 + trunk/ompi/mca/btl/tcp/btl_tcp.c | 6 ++++ trunk/ompi/mca/btl/tcp/btl_tcp.h | 3 ++ trunk/ompi/mca/btl/template/btl_template.c | 8 ++++- trunk/ompi/mca/btl/template/btl_template.h | 3 ++trunk/ompi/mca/btl/template/btl_template_component.c | 10 +++ +--- trunk/ompi/mca/btl/udapl/btl_udapl.c | 11 +++ +++--trunk/ompi/mca/btl/udapl/btl_udapl.h | 3 ++trunk/ompi/mca/btl/udapl/btl_udapl_component.c | 17 +++ +++++----trunk/ompi/mca/osc/rdma/osc_rdma_data_move.c | 3 ++ trunk/ompi/mca/pml/dr/pml_dr.h | 6 ++--trunk/ompi/mca/pml/dr/pml_dr_sendreq.c | 12 +++ +++++-trunk/ompi/mca/pml/dr/pml_dr_sendreq.h | 3 +trunk/ompi/mca/pml/ob1/pml_ob1.c | 17 +++ +++++----- trunk/ompi/mca/pml/ob1/pml_ob1.h | 44 +++ ++------------------------------ trunk/ompi/mca/pml/ob1/pml_ob1_recvreq.c | 14 +++ ++++---- trunk/ompi/mca/pml/ob1/pml_ob1_sendreq.c | 28 +++ +++++++++++++------32 files changed, 241 insertions(+), 95 deletions(-) Diff not shown due to size (53504 bytes). To see the diff, run the following command: svn diff -r 14767:14768 --no-diff-deleted _______________________________________________ svn mailing list s...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/svn_______________________________________________ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel