Template Version: @(#)sac_nextcase 1.70 03/30/10 SMI
This information is Copyright (c) 2010, Oracle and/or its affiliates. All 
rights reserved.
1. Introduction
    1.1. Project/Component Working Name:
         IBTF 2010.Q2 Enhancements
    1.2. Name of Document Author/Supplier:
         Author:  William Taylor
    1.3  Date of This Document:
        23 June, 2010
4. Technical Description

IBTF 2010.Q2 Enhancements
-------------------------

Table of Contents:
  4.1 Background
  4.2 Proposal
      A. CQ Interrupt Enhancements
      B. "DMA" Memory Regions
      C. Zone Addressing
      D. Corrected RDMA Limits
      E. Kernel Map of User Buf Flag      
      F. Corrected Memory Management Error
      G. "clone" Flag Restrictions
      H. Version 4 ABI Change
  4.3 Man Page Summary


4.1 Background

A number of upcoming InfiniBand projects require new capabilities from
our Solaris InfiniBand (IB) support. This case introduces new and
modified interfaces for our Solaris InfiniBand Transport Framework
(IBTF, PSARC/2002/132 and follow-on cases). The main subjects of this
case include: CQ interrupt enhancements, "DMA" memory regions and zone
addressing. In addition, these minor items are also included:
corrected RDMA limits, kernel map of user bufs flag, corrected memory
management error, "clone" flag restrictions, and the version 4 ABI
change for IBTF.

The first consumer for CQ interrupt enhancements is expected to be
RDSv3 (PSARC/2010/043). The consumer of "DMA" memory regions is
Lustre. Zones addressing will initially be consumed by OFUV Kernel
Components (PSARC/2009/421).


4.2 Proposal

The proposal is to make additions to the IBTF Transport Interface (TI)
for IB Upper Level Protocol (ULP) clients and the Channel Interface
(CI) for HCA drivers.

All interface changes and additions in the proposal have a micro/patch
binding.


Transport Interface (ON Consolidation Private)

  ibt_free_cq_sched(): revise arguments
  ibt_free_srcip_info(): new function to free ibt_srcip_info_t results
  ibt_get_src_ip(): revise arguments
  ibt_query_cq_handler_id(): new function to query CQ handler ID attributes
  ibt_register_dma_mr(): new function for defining DMA MR

  ibt_alt_ip_path_attr_t: add new zoneid_t field
  ibt_cq_attr_t: add cq_hid field
  ibt_cq_flags_t: add IBT_CQ_HID flag
  ibt_cq_handler_attr_t: output for ibt_query_cq_handler_id()
  ibt_cq_sched_attr_t: new input struct to ibt_alloc_cq_sched()
  ibt_dmr_attr_t: new struct to define DMA MR attributes
  ibt_hca_attr_t: add hca_dip, hca_conn_rdma_sgl_sz, 
    hca_conn_rdma_write_sgl_sz fields
  ibt_hca_flags2_t: add IBT_HCA2_DMA_MR
  ibt_iov_attr_t: add iov_alt_lkey field
  ibt_iov_flags_t: add IBT_IOV_ALT_LKEY, IBT_IOV_USER_BUF flags
  ibt_ip_path_attr_t: add new zoneid_t field
  ibt_mr_flags_t: add IBT_MR_USER_BUF
  ibt_va_flags_t: add IBT_VA_USER_BUF
  ibt_scrip_attr_t: input to ibt_get_src_ip()
  ibt_srcip_info_t: output of ibt_get_src_ip()
  ibt_status_t: add IBT_CQ_SCHED_INVALID, IBT_CQ_NO_SCHED_GROUP,
      IBT_CQ_HID_INVALID, IBT_WC_MEM_MGT_OP_ERR enum values
  ibt_version_t: add IBTI_V4
  ibt_wc_status_t: add IBT_WC_MEM_MGT_OP_ERR

Channel Interface (ON Consolidation Private)

  ibc_operations_t: add ibc_query_cq_handler_id and ibc_register_dma_mr,
      revise ibc_alloc_cq_sched and ibc_free_cq_sched
  ibc_sched_hdl_t: new type for CQ sched object
  ibt_cq_attr_t: add cq_hid field
  ibt_cq_flags_t: add IBT_CQ_HID flag
  ibt_cq_handler_attr_t: output for ibc_query_cq_handler_id
  ibt_cq_sched_attr_t: new input struct to ibc_alloc_cq_sched
  ibt_dmr_attr_t: new struct to define DMA MR attributes
  ibt_hca_attr_t: add hca_dip, hca_conn_rdma_sgl_sz,
    hca_conn_rdma_write_sgl_sz fields
  ibt_hca_flags2_t: add IBT_HCA2_DMA_MR
  ibt_iov_attr_t: add iov_alt_lkey field
  ibt_iov_flags_t: add IBT_IOV_ALT_LKEY, IBT_IOV_USER_BUF flags
  ibt_mr_flags_t: add IBT_MR_USER_BUF
  ibt_va_flags_t: add IBT_VA_USER_BUF 
  ibt_status_t: add IBT_CQ_SCHED_INVALID, IBT_CQ_NO_SCHED_GROUP,
      IBT_CQ_HID_INVALID, IBT_WC_MEM_MGT_OP_ERR enum values
  ibc_version_t: add IBCI_V4
  ibt_wc_status_t: add IBT_WC_MEM_MGT_OP_ERR

Because these changes cause binary incompatibility, we are also
incrementing our IBTF binary interface version number to four.

Copies of all new and modified man pages are in the materials
directory (see section 4.3 below).


A. CQ Interrupt Enhancements

PSARC/2009/060 introduced the IB version of MSI-X ("Multiple
Completion Handlers") to support RSS. The new modifications add the
ability to specify a particular interrupt or "group" and exposing DDI
interrupt information.

Previously, the assignment of MSI-X interrupts was not explicit and
implementation dependent (the implementation was using round robin
assignment of available HCA interrupts). But this can lead to
unintended (and random) sharing of interrupts between unrelated CQs
and applications. We propose two ways to improve the situation.

First, we add a way to specify the particular vector to use when
creating a CQ. Using this mechanism allows certain related functions
(e.g. send and receive of same QP) to be handled on the same interrupt
without requiring the CQs (and the handling code) to be combined.

Add flag and field to ibt_cq_attr_t:
  IBT_CQ_HID            = 1 << 3        /* new flag in ibt_cq_flags_t */
  ibt_cq_handler_id_t   cq_hid;         /* new field in ibt_cq_attr_t */

Add error code to ibt_status_t:
  IBT_CQ_HID_INVALID    = 552,  /* CQ Handler ID invalid */

The second mechanism allows one to specify assigning an interrupt
from a pre-reserved set ("group"). This feature provides a way to
separate interrupts from different ULPs. The interface for this
feature is a modification of the previously defined (but unused),
"CQ sched" functions.

Add new flags to ibt_cq_sched_flags_t:
                                        /* if named group not found: */
  IBT_CQS_EXACT_SCHED_GROUP = 1 << 1,   /*   return error */
  IBT_CQS_SCHED_GROUP       = 1 << 2,   /*   return default group instead */

Revise ibt_cq_sched_attr_t:
  ibt_cq_sched_flags_t  cqs_flags;
  char                  *cqs_group_name;

Revise function signatures:
  ibt_status_t ibt_free_cq_sched(ibt_hca_hdl_t hca_hdl,         /* TI */
      ibt_sched_hdl_t sched_hdl); 

  ibt_status_t (*ibc_alloc_cq_sched)(ibc_hca_hdl_t hca,         /* CI */
      ibt_cq_sched_attr_t *attr, ibc_sched_hdl_t *sched_hdl_p); 

  ibt_status_t (*ibc_free_cq_sched)(ibc_hca_hdl_t hca,
      ibc_sched_hdl_t sched_hdl); 

Add error codes to ibt_status_t:
  IBT_CQ_SCHED_INVALID  = 550,  /* Invalid CQ Sched Handle */
  IBT_CQ_NO_SCHED_GROUP = 551,  /* Schedule group not found */

We also provide a function to provide the DDI information for
particular interrupts used by CQ handling:

Add the dev_info_t in ibt_hca_attr_t:
  dev_info_t    *hca_dip; 

Add new ibt_cq_handler_attr_t:
  dev_info_t            *cha_dip;
  ibt_intr_handle_t     cha_ih;

Add new functions:
  ibt_status_t ibt_query_cq_handler_id(ibt_hca_hdl_t hca_hdl,   /* TI */
      ibt_cq_handler_id_t hid, ibt_cq_handler_attr_t *attrs);

  ibt_status_t (*ibc_query_cq_handler_id)(ibc_hca_hdl_t hca,    /* CI */
      ibt_cq_handler_id_t hid, ibt_cq_handler_attr_t *attrs); 

Changed man pages: ibci.9, ibti.9, ibc_alloc_cq.9e,
ibc_alloc_cq_sched.9e, ibc_modify_cq.9e, ibc_query_cq.9e,
ibc_query_cq_handler_id.9e, ibt_alloc_cq.9f, ibt_alloc_cq_sched.9f,
ibt_modify_cq.9f, ibt_query_cq.9f, ibt_query_cq_handler_id.9f,
ibc_operations_t.9s, ibt_cq_sched_attr_t.9s, ibt_hca_attr_t.9s


B. "DMA" Memory Regions

Linux (OFED) has long provided a way to define a memory region which
can use physical addresses (ib_get_dma_mr). We now provide a similar
API to aid in the porting of Lustre from Linux.

Add capability flag to ibt_hca_flags2_t to indicate support:
  IBT_HCA2_DMA_MR       = 1 << 11       /* DMA MR */

Add operation to define memory region by physical address and length:
  ibt_status_t ibt_register_dma_mr(ibt_hca_hdl_t hca_hdl,       /* TI */
      ibt_pd_hdl_t pd, ibt_dmr_attr_t *mem_attr,
      ibt_mr_hdl_t *mr_hdl_p, ibt_mr_desc_t *mem_desc); 

  ibt_status_t (*ibc_register_dma_mr)(ibc_hca_hdl_t hca,        /* CI */
      ibc_pd_hdl_t pd, ibt_dmr_attr_t *attr_p, void *ibtl_reserved,
      ibc_mr_hdl_t *mr_p, ibt_mr_desc_t *mem_desc);

Add ability to use other L_Keys (i.e. from register_dma_mr) with 
map_mem_iov mapping function:
  IBT_IOV_ALT_LKEY = (1 << 3)   /* new flag in ibt_iov_flags_t */
  ibt_lkey_t iov_alt_lkey;      /* new field in ibt_iov_attr_t */

Changed man pages: ibci.9, ibti.9, ibc_map_mem_iov.9e,
ibc_register_dma_mr.9e, ibt_map_mem_iov.9f, ibt_register_dma_mr.9f,
ibc_operations_t.9s, ibt_dmr_attr_t.9s, ibt_hca_attr_t.9s


C. Zone Addressing

This first increment of zones capability for IB concentrates on IP
addressing issues when multiple IP stacks exist. When querying for an
IB path by IP address, we add information about the zone ID (so we can
use the correct IP stack).

New fields for zoneid_t in path queries:
  zoneid_t      ipa_zoneid;     /* added to ibt_ip_path_attr_t */
  zoneid_t      apa_zoneid;     /* added to ibt_alt_ip_path_attr_s */

When dispatching connection requests, we also need to distinguish
among zones which have IPoIB instances using the same port/P_Key
combination. To do this, we revise ibt_get_src_ip() to also allow zone
information on input and output:

  typedef struct ibt_srcip_attr_s { 
      ib_gid_t        sip_gid;        /* REQUIRED: Local Port GID */ 
      zoneid_t        sip_zoneid;     /* Zero means Global Zone */ 
      ib_pkey_t       sip_pkey;       /* Optional */ 
      sa_family_t     sip_family;     /* Optional : IPv4 or IPv6 */ 
  } ibt_srcip_attr_t; 

  /* ip_flag : Flag to indicate whether list has any duplicate records. */
  #define IBT_IPADDR_NO_FLAGS     0
  #define IBT_IPADDR_DUPLICATE    1
 
  typedef struct ibt_srcip_info_s {
      ibt_ip_addr_t   ip_addr;
      zoneid_t        ip_zoneid;      /* ZoneId of this ip-addr */
      uint_t          ip_flag;        /* Flag to indicate any gotchas */
  } ibt_srcip_info_t;
 
  ibt_status_t ibt_get_src_ip(ibt_srcip_attr_t *srcip_attr,
       ibt_srcip_info_t **src_info_p, uint_t *entries_p);

   void ibt_free_srcip_info(ibt_srcip_info_t *src_info, uint_t entries);

Changes man pages: ibti.9, ibt_get_src_ip.9f,
ibt_alt_ip_path_attr_t.9s, ibt_ip_path_attr_t.9s


D. Corrected RDMA Limits

Previously defined "detailed" WQE sizes in PSARC/2008/726 did not take
into account adapter RDMA-Read limits. We remove the misleading RDMA
"sgl overhead" quantity (hca_conn_rdma_sgl_overhead) and just give
explicit RDMA-Read and RDMA-Write SGL limits. Added to ibt_hca_attr_t:

  uint_t        hca_conn_rdma_read_sgl_sz;      /* max RDMA-R SGL len */
  uint_t        hca_conn_rdma_write_sgl_sz;     /* max RDMA-W SGL len */

Changed man pages: ibt_hca_attr_t.9s


E. Kernel Map of User Buf Flag

When a "buf" is registered, it might refer to an area of kernel or
user address space. We add an explicit flag for the case of user
address space so platform optimizations can be done during
registration (e.g. turn on relaxed ordering). Added to mr_flags in
ibt_smr_attr_t:

    IBT_IOV_USER_BUF    = (1 << 3)      /* added to ibt_iov_flags_t */
    IBT_MR_USER_BUF     = (1 << 15)     /* added to ibt_mr_flags_t */
    IBT_VA_USER_BUF     = (1 << 6)      /* added to ibt_va_flags_t */

Changed man pages: ibc_map_mem_area.9e, ibc_map_mem_iov.9e, 
ibt_map_mem_area.9f, ibt_map_mem_iov.9f, ibt_smr_attr_t.9s


F. Corrected Memory Management Error

In the IB spec Release 1.2, the meaning of the "bind" memory window
completion error was generalized as the "memory management"
error. This increased scope allowed the error to include cases such as
problems with memory registration by WR (PSARC/2009/060).

Rename error as "Memory management operation" error:
  #define IBT_WC_MEM_MGT_OP_ERR 15      /* bind plus 1.2 memory extensions */ 

But also provide an alternate name for backward compatibility:
  #define IBT_WC_MEM_WIN_BIND_ERR       IBT_WC_MEM_MGT_OP_ERR 

Changed man pages: ibc_wc_status_t.9s, ibt_wc_status_t.9s


G. "clone" Flag Restrictions

When allocation of QP ranges was defined in PSARC/2009/060 most of the
attributes were inherited from the definition of the single QP
allocation. However, we have decided not allow the IBT_ACHAN_CLONE
flag for QP range allocation (which copied the attributes from another
QP). We haven't come up with a real need for this flag.

Changed man pages: ibt_alloc_ud_channel_range.9f


H. Version 4 ABI Change

Because of the binary changes in this case, our interface version is
incremented to version 4.

Change the TI version number in enum ibt_version_e (ibt_version_t):
  IBTI_V4 = 4   /* TI interface version */

Change the CI version number in enum ibc_version_e (ibc_version_t):
  IBCI_V4 = 4   /* CI interface version */

Changed man pages: ibc_hca_info_t.9s, ibt_clnt_modinfo_t.9s


4.3 Man Page Summary

Man Page                        Disposition     Reasons for change
(sorted by section)                             (section of 4.2)
------------------------------------------------------------------
ibci.9                          changed         A, B
ibti.9                          changed         A, B, C

ibc_alloc_cq.9e                 changed         A
ibc_alloc_cq_sched.9e           changed         A
ibc_map_mem_area.9e             changed         E
ibc_map_mem_iov.9e              changed         B, E
ibc_modify_cq.9e                changed         A
ibc_query_cq.9e                 changed         A
ibc_query_cq_handler_id.9e      new             A
ibc_register_dma_mr.9e          new             B

ibt_alloc_cq.9f                 changed         A
ibt_alloc_cq_sched.9f           changed         A
ibt_alloc_ud_channel_range.9f   changed         G       
ibt_get_src_ip.9f               changed         C
ibt_map_mem_area.9f             changed         E
ibt_map_mem_iov.9f              changed         B, E
ibt_modify_cq.9f                changed         A
ibt_query_cq.9f                 changed         A
ibt_query_cq_handler_id.9f      new             A
ibt_register_dma_mr.9f          new             B

ibc_hca_info_t.9s               changed         H
ibc_operations_t.9s             changed         A, B
ibc_wc_status_t.9s              changed         F
ibt_alt_ip_path_attr_t.9s       changed         C
ibt_clnt_modinfo_t.9s           changed         H
ibt_cq_sched_attr_t.9s          new             A
ibt_dmr_attr_t.9s               new             B
ibt_hca_attr_t.9s               changed         A, B, D
ibt_ip_path_attr_t.9s           changed         C
ibt_smr_attr_t.9s               changed         E
ibt_wc_status_t.9s              changed         F


6. Resources and Schedule
    6.4. Steering Committee requested information
        6.4.1. Consolidation C-team Name:
                ON
    6.5. ARC review type: FastTrack
    6.6. ARC Exposure: open

_______________________________________________
opensolaris-arc mailing list
opensolaris-arc@opensolaris.org

Reply via email to