Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: IPoIB Connected Mode 1.2. Name of Document Author/Supplier: Author: Kevin Ge 1.3 Date of This Document: 30 October, 2009 4. Technical Description
A. Overview ----------- This case proposes changes to the Solaris kernel to provide support for "Connected Mode" in the IPoIB driver ibd(7D) (described in [1] and [2]). The Infiniband Architecture [3] defines multiple "transport service types", including Unreliable Datagram (UD), Reliable Connected (RC) and Unreliable Connected (UC). Current ibd (based on [4]) runs in "Datagram Mode" over the UD transport service type. Connected Mode (described in [5]) can use either UC and/or RC. This IPoIB-CM project uses RC, because of the desire to inter-operate with Linux which also uses RC. The main advantage of Connected Mode is better performance (higher throughput and lower CPU utilization) based on using very large MTUs (see below for more discussion). Connected Mode, though, can have the disadvantage of consuming more resources, especially when scaling up to a large cluster (due to using an InfiniBand connection to each destination). Note that this case only covers all necessary changes to support IPoIB driver running in Connected Mode over RC. Other enhancements are outside the scope of this case. A micro/patch binding is asserted for this proposal. B. Connected Mode IPoIB driver ------------------------------ The revised ibd(7D) driver will support both Connected and Datagram mode. The features from the current Datagram mode ibd driver will be inherited. The remainder of this section discusses interface additions for the Connected mode capable driver. B.1 Switching between datagram and connected mode The existing ibd driver in OpenSolaris and Solaris 10 does not ship with a driver .conf file. However, the Connected Mode support described in this case introduces a new parameter 'enable_rc' that may be set via the ibd driver .conf file. This parameter specifies whether each ibd instance defaults to using Connected Mode over RC or not. # 1: unicast packets will be sent over Reliable Connected Mode # 0: unicast packets will be sent over Unreliable Datagram Mode # # Each element in the list below maps to the corresponding ibd # instance; the first element is for ibd instance 0, the second # element is for instance 1 and so on. # enable_rc=1,1,0,0; Please note that Connected Mode support in IPoIB is optional as per [5]. Therefore, if Connected Mode is not available for a remote node, the Datagram mode will automatically be used for that destination by the ibd driver. Therefore, the only meaning of 'enable_rc' is to decide whether to try Connected Mode first or not, and whether to advertise this as a capability supported by this instance or not. The default value for 'enable_rc' for each instance is 0. Hence without a ibd.conf file, Datagram mode will be used. We intend to ship a driver .conf file for ibd in ONNV (and hence OpenSolaris) with enable_rc set to all ones (enabling Connected Mode by default on all instances) for the best performance. However, for Solaris 10, we have received business guidance to have an "opt-in" approach due to a desire for greater stability in established enterprise environments. We will do this by not shipping the .conf file. Therefore, by default Solaris 10 will be Datagram mode. It will take an explicit administrator action (setting enable_rc) to cause Solaris 10 to use Connected Mode. OFED (Linux IB) originally made Connected Mode opt-in too. However, later OFED made it the default. We don't intend to change it later to be the default in Solaris 10. However, Solaris Next, being descended from ONNV, will have it as default. An edited ibd(7D) manpage documenting this change is in the materials directory. B.2 Change of default MTU size Connected Mode by virtue of using the RC transport service type offers link MTUs of up to 2^31-4 octets in length. Thus, the use of Connected Mode can offer benefits by supporting very large MTUs. Datagram Mode using UD is limited to 4092 (4K-4) octets, though commonly only 2044 (2K-4) is offered. Due to the limits of the TCP/IP protocol, it makes sense to only offer up to 65535 (64K-1) bytes. OFED (i.e. Linux IB) uses 65520 (64K-16) byte MTU for alignment reasons. To inter-operate with OFED at the best performance, we also adopt 65520 as the default MTU of the Connected Mode. C. Interfaces ------------- +-------------------------------------------------------------------+ | Interfaces Exported | +---------------------------+------------------+--------------------+ | Interface Name | Classification | Comment | +---------------------------+------------------+--------------------+ |/kernel/drv/ibd.conf* | Uncommitted | Configuration file | +---------------------------+------------------+--------------------+ * = only for OpenSolaris D. References ------------- [1] IP over InfiniBand, PSARC/2001/289 [2] IPoIB Conversion to GLDv3, PSARC/2007/636 [3] InfiniBand Architecture Specification Volume 1, Release 1.2.1, InfiniBand Trade Association, 2007. http://www.infinibandta.org/content/pages.php?pg=technology_download [4] Transmission of IP over InfiniBand (IPoIB), RFC 4391, IETF, http://www.ietf.org/rfc/rfc4391.txt [5] IP over InfiniBand: Connected Mode, RFC 4755, IETF, http://www.ietf.org/rfc/rfc4755.txt 6. Resources and Schedule 6.4. Steering Committee requested information 6.4.1. Consolidation C-team Name: ON 6.5. ARC review type: FastTrack 6.6. ARC Exposure: open