Template Version: @(#)sac_nextcase %I% %G% SMI
This information is Copyright 2008 Sun Microsystems
1. Introduction
1.1. Project/Component Working Name:
IB Support for Relaxed Ordering
1.2. Name of Document Author/Supplier:
Author: Lida Horn
1.3 Date of This Document:
10 December, 2008
4. Technical Description
PSARC members: please at least look at the "**IMPORTANT NOTE**" below
and comment. If this fasttrack needs to be derailed, having an early
indication is appreciated, so we don't waste time.
IB Support for Relaxed Ordering
-------------------------------
4.1 Background
Certain Sun platforms have introduced support for PCI Relaxed Ordering
(RO) for increased performance (see PSARC/2006/157). In theory, if
InfiniBand (IB) applications are written as recommended by the IB
specification [1] (that is to use IB completions to understand when
operations are done), the usage of RO should have no impact on how IB
applications are written.
However, for many years, Message Passing Interface (MPI)
implementations for IB have knowingly violated these recommendations
in pursuit of maximum performance (which is considered very important
in this segment of the HPC market). In particular, it has been a
common practice to rely on "memory polling" where MPI waits for the
last bytes in a transfer to flip to know when an operation is
done. (This also relies on knowing which bytes will be written last.)
This practice shaves off the time necessary to deliver the IB
completion.
Of course, RO can create a problem with memory polling, since the last
transfer bytes are not necessarily delivered last (when DMA operations
consist of multiple transfer units). This situation introduces the
potential for corruption. It should be pointed out that exposure to
this potential danger hasn't been a problem because IB cards were not
certified on RO platforms. However, now that we are trying to bring IB
MPI support to RO platforms, this problem must be confronted. This
case introduces mechanisms for IB memory polling clients to operate
safely in the face of PCI Relaxed Ordering.
4.2 Proposal
This proposal makes changes/additions to two components used by memory
polling IB clients: uDAPL (PSARC/2003/145 and follow-on cases) and the
InfiniBand Transport Framework (IBTF, PSARC/2002/132 and follow-on
cases). In uDAPL, changes are made to the client interface. In IBTF,
changes are made to the Transport Interface (TI) for ULPs and Channel
Interface (CI) for HCA drivers. The IBTF changes are part of the v3
IBTF ABI first introduced in PSARC/2008/630.
All interface changes and additions in this proposal have a
micro/patch binding, however note the issue with uDAPL interfaces
below.
IBTF Transport Interface (ON Consolidation Private)
ibt_mr_flags_t: IBT_MR_DISABLE_RO flag - disable RO on registration
IBTF Channel Interface (ON Consolidation Private)
ibt_mr_flags_t: IBT_MR_DISABLE_RO flag - disable RO on registration
uDAPL changes (incompatible change to Committed interface, see below)
dat_ia_open(): RO_AWARE_ prefix for ia_name_ptr
dat_lmr_create(): new DAT_MEM_TYPE_SO_VIRTUAL memory "type"
**IMPORTANT NOTE**: both dat_ia_open() and dat_lmr_create() are part
of the published standards for uDAPL [2]. The original uDAPL case
(PSARC/2003/145), implementing uDAPL 1.1, exports these interfaces
as "Standard", which now translates to "Committed". In
PSARC/2004/399, our uDAPL was upgraded to 1.2. Usually a change
to these interfaces could only occur in a major release. However, as
noted in the Interface Taxonomy, exemptions are possible for reasons
such as possible data corruption inherent in the interface, which we
believe to be the case here. We believe the impact of this
incompatible change to be mitigated for two reasons:
* this issue only arises on RO platforms not previously certified
for IB cards, therefore uDAPL will continue to work as before on
previously supported platforms; only new usage on RO platforms
will have to contend with this issue
* the limited current usage of uDAPL (almost exclusively a few
MPIs); most OS bypass clients now use Open Fabric User Verbs
(which Sun also intends to eventually transition to) -- for the
same reasons, it is unlikely we could get a change made to the
uDAPL interface (because of the declining interest in uDAPL, the
DAT Collaborative no longer meets to continue uDAPL spec
development)
Copies of all modified/added man pages are in the materials directory
(see section 4.3 below). Change bars highlight modifications.
A. uDAPL RO-Aware Clients
One key concept is for uDAPL to understand when (newer) clients have
code to handle RO platforms (i.e. are "RO Aware"). Clients signal this
awareness by prefixing the "interface adapter" name argument to
dat_ia_open() with the string "RO_AWARE_". Clients which do not do
this (legacy ones) are assumed to be unaware of RO issues.
uDAPL code determines whether a platform is using RO by using the
platform name returned through utsname.h(3HEAD), which is the same
information used by uname(1). If a client is unaware of RO and the
platform is using RO, then dat_ia_open() will return the
DAT_INVALID_PARAMETER error (DAT sub-type: DAT_INVALID_RO_COOKIE). The
sub-type distinguishes an RO mismatch failure from other causes. This
error blocks an unaware client from proceeding and possibly suffering
corruption. In all other cases, the any "RO_AWARE_" prefix is removed
and regular dat_ia_open() behavior is followed.
You may well ask whether it is necessary to be so drastic and block
such clients? Couldn't we set strong ordering under the covers for the
non "RO_AWARE_" clients on RO platforms? We could but it would have a
bad performance effect on other innocent clients on a RO
platform. (The strict ordering PCI-E ops force all the RO ops ahead to
complete first and so slows down everyone.) So on balance, with the
platform guys weighing in strongly on this issue, we think it's better
to stop the RO ignorant in their tracks on an RO platform. From a call
generator view, the complaining MPI people are likely to be greatly
outnumbered by the pool of potential innocent victims (who won't even
know someone else did something wrong). And a call from an innocent
victim will be hard to diagnose, and a waste of time, as they did
nothing wrong.
Man page changes: dat_ia_open(3DAT)
B. uDAPL LMR RO flag
An RO Aware client may use Local Memory Regions (LMRs) which are
specified to be either strict ordering only (not RO and suitable for
memory polling) or allowed to be RO. The possible use of RO is
considered the default, so we signal the non-RO case with a new memory
"type" DAT_MEM_TYPE_SO_VIRTUAL (SO = strong ordering). For clients
which are not "RO_AWARE_", the regular DAT_MEM_TYPE_VIRTUAL is
implicitly converted to DAT_MEM_TYPE_SO_VIRTUAL, because they may be
memory polling.
Strictly speaking this appears to be redundant, since such non RO
Aware clients only get to run on non-RO platforms (due to the
"RO_AWARE_" check). But this is a hedge against a bug in incorrectly
determining whether a platform supports RO or not, and assures no
corruption in that scenario. So if we incorrectly allowed the RO
ignorant to run on a RO platform, we will potentially suffer
performance issues, but at least there will be no corruption.
The same DAT_MEM_TYPE_SO_VIRTUAL type is also passed out through
dat_lmr_query(), when memory attributes are queried.
Man page changes: dat_lmr_create(3DAT)
Note that the dat_lmr_query(3DAT) man page is not changed, as the
memory types are defined in the dat_lmr_create(3DAT) man page.
C. IBTF non-RO Memory Region flags
uDAPL LMR create call information ultimately becomes calls to the IBTF
function ibt_register_mr() (TI) and HCA driver (*ibc_register_mr)()
entry point (CI). All the registration calls now accept a new
IBT_MR_DISABLE_RO flag for strict ordering only (not RO and suitable
for memory polling). Since the recommended programming model of using
IB completions in the IB spec [1] is compatible with RO, the default
(no flag) means "RO allowed".
New registration flag in ibt_mr_flags_t:
IBT_MR_DISABLE_RO = (1 << 14) /* no flag == RO allowed */
Man page changes: ibt_mr_attr_t(9S)
4.3 Summary of changes by man page
Man page Disposition Reason for change
---------------------------------------------------------
dat_ia_open.3dat changed A
dat_lmr_create.3dat changed B
ibt_mr_attr_t.9s changed C
4.4 References
[1] InfiniBand Architecture Specification Volume 1, Release
1.2.1. InfiniBand Trade Association, 2007.
http://www.infinibandta.org/members/spec/V1r1_2_1.Release_12062007.zip
(requires IBTA member login)
[2] uDAPL: User Direct Access Programming Library, Version 1.2. DAT
Collaborative, 2004.
http://www.datcollaborative.org/udapl12_091504.zip
6. Resources and Schedule
6.4. Steering Committee requested information
6.4.1. Consolidation C-team Name:
ON
6.5. ARC review type: FastTrack
6.6. ARC Exposure: open