On Tue, 2015-05-26 at 14:03 +0000, Wan, Kaike wrote:
> I. Introduction
> 
> After posting our design to the mailing list, we received comments concerning 
> various aspects of the
> design from Sean Hefty, Ira Weiny, Jason Gunthorpe, and Doug Ledford. Thank 
> you all for the help.
> 
> The main issues are listed below:
> 1. Extensibility: the design should be flexible and readily extended to other 
> applications;
> 2. Multiple data records: a query can return multiple data records (eg 
> multiple pathrecords);
> 3. Existing code: the design should use existing code as much as possible;
> 4. Various query points in the kernel: what are the requirements (parameters, 
> expected results) for
>    various queries that may exist in the kernel (IPoIB, RDMA CM, etc).
> 
> As our subject title indicates, we are trying to design for the kernel to 
> query a local user-space
> service, more specifically, for the ib_sa module to send a pathrecord query 
> to a local user-space SA cache.
> If anyone has information or requirements for other kernel query points, we 
> will be happy to know.
> 
> In our previous design, we created a data header to contain various 
> information about the query and
> response:
> 
> struct ib_nl_data_hdr {
>       __u8    version;
>       __u8    opcode;
>       __u16   status;
>       __u16   type;
>       __u16   reserved;
>       __u32   flags;
>       __u32   length;
> };
> 
> This was modeled after the ibacm messages and the message layout is 
> diagrammed below:
> 
>   +----------------+
>   | netlink header |
>   +----------------+
>   |  Data header   |
>   +----------------+
>   |      Data      |
>   +----------------+
> 
> The design was extensible, but suffered from the fact that it did not take 
> full use of the netlink 
> message header.
> 
> In this version of the design, we will make full use of the netlink header 
> and the existing attribute
> interface, as detailed below.
> 
> II. Message layout
> 
> The general message layout is shown here:
> 
> 
>   +----------------+
>   | netlink header |
>   +----------------+
>   |  Attribute 1   |
>   +----------------+
>   |  Attribute 2   |
>   +----------------+
>   |       ...      |
>   +----------------+
>   |  Attribute N   |
>   +----------------+
> 
> The number of attributes present in the request/response varies. As shown, 
> there is no new data 
> header to describe either the request nor the response. The netlink header 
> and various attributes
> will be described later.
> 
> III. Netlink protocol, multicast group, and kernel client
> 
> This design is targeted to the NETLINK_RDMA protocol, and a new multicast 
> group RDMA_NL_GROUP_LS is
> added for the local service:
> 
> enum {
>       RDMA_NL_GROUP_CM = 1,
>       RDMA_NL_GROUP_IWPM,
>       RDMA_NL_GROUP_LS,
>       RDMA_NL_NUM_GROUPS
> };
> 
> In addition, each kernel client should define a client index so that the 
> common rdma code could
> route the response to the right client. For this purpose, we define the 
> RDMA_NL_SA client for the
> ib_sa module:
> 
> enum {
>       RDMA_NL_RDMA_CM = 1,
>       RDMA_NL_NES,
>       RDMA_NL_C4IW,
>       RDMA_NL_SA,
>       RDMA_NL_NUM_CLIENTS
> };
> 
> As mentioned previously, each query point in the kernel should have its own 
> client index.
> 
> IV. Netlink message header
> 
> The netlink header is copied here:
> 
> struct nlmsghdr {
>       __u32           nlmsg_len;      /* Length of message including header */
>       __u16           nlmsg_type;     /* Message content */
>       __u16           nlmsg_flags;    /* Additional flags */
>       __u32           nlmsg_seq;      /* Sequence number */
>       __u32           nlmsg_pid;      /* Sending process port ID */
> };
> 
> The message type for rdma clients is also copied below:
> 
> #define RDMA_NL_GET_TYPE(client, op) ((client << 10) + op)
> 
> More clearly:
> 
>     Bits      Description
>    --------------------------
>     15-10       Client index
>     09-00       Opcode
> 
> As described previously, a netlink message is routed by protocol 
> (NETLINK_RDMA), multicast group
> (RDMA_NL_LS), and client (encoded in the nlmsg_type field for rdma messages). 
> Therefore, the
> opcode (encoded in nlmsg_type), the sequence number (nlmsg_seq) and addition 
> flags (nlmsg_flags)
> are all local to the client. This is important when we define these fields as 
> they can overlap for 
> different clients.
> 
> (1) Opcode
> 
> The opcode for local service SA client is defined below:
> 
> enum {
>       RDMA_NL_LS_OP_RESOLVE = 0,
>       RDMA_NL_LS_OP_SET_TIMEOUT,
>       RDMA_NL_LS_NUM_OPS
> };
> 
> The RESOLVE opcode is used by the ib_sa to send pathrecord query to the 
> user-space application 
> while the SET_TIMEOUT opcode can be used by the user-space application to set 
> the netlink timeout
> value for the kernel client. Additional opcodes can be added if necessary.
> 
> It should be emphasized that the opcode is client specific and therefore can 
> be overlapped for 
> different clients. Therefore, the 10 bits should be large enough for various 
> requests.
> 
> (2) nlmsg_flags
> 
> This flags fields are again client specific. But the lower byte (bits 7-0) is 
> generally reserved
> and the upper bits can be used to define request specific flags:
> 
> #define RDMA_NL_LS_F_OK               0x0100  /* Success response */
> #define RDMA_NL_LS_F_ERR      0x0200  /* Failed response */
> 
> These two bits can be used to indicate whether a message is a response. If 
> the status is ERR, an
> error code can be contained in a status attribute, as described low.
> 
> (3) Attribute type
> 
> Request parameters and response data records can be embedded in attributes.
> 
> The attribute header is copied here:
> 
> struct nlattr {
>       __u16           nla_len;
>       __u16           nla_type;
> };
> 
> Each attribute is preceded by the attribute header and followed by attribute 
> specific data.
> 
> It should be reminded that attribute type is request (opcode) specific and 
> therefore could be 
> overloaded for different requests if needed.
> 
> For ib_sa RESOLVE query, the following attribute types are defined:
> 
> enum {
>       LS_NLA_TYPE_STATUS = 0,
>       LS_NLA_TYPE_ADDRESS,
>       LS_NLA_TYPE_PATH_RECORD,
>       LS_NLA_TYPE_MAX
> };
> 
> (4) Status attribute
> 
> The status attribute is mostly used to carry error code if the 
> RDMA_NL_LS_F_ERR bits in nlmsg_flags
> field in the netlink message header is set. If the response is success, there 
> is no need to include
> this attribute in the response data (it's not an error, either).
> 
> num {
>       LS_NLA_STATUS_SUCCESS = 0,
>       LS_NLA_STATUS_INVAL,
>       LS_NLA_STATUS_ENODATA,
>       LS_NLA_STATUS_MAX
> };
> 
> struct rdma_nla_ls_status {
>       __u32           status;
> };
> 
> (5) Address attribute
> 
> This attribute is normally included in the RESOLVE request.
> 
> num {
>       LS_NLA_ADDR_F_SRC               = 1,
>       LS_NLA_ADDR_F_DST               = (1<<1),
>       LS_NLA_ADDR_F_HOSTNAME          = {1<<2},
>       LS_NLA_ADDR_F_IPV4              = (1<<3),
>       LS_NLA_ADDR_F_IPV6              = (1<<4)
> };
> 
> struct rdma_nla_ls_addr {
>       __u32           flags;
>       __u32           addr[0];
> };
> 
> The address can be hostname (string), IPv4 or IPv6 address. The source and 
> destination flags are
> also defined.
> 
> (6) Pathrecord attribute
> 
> This attribute can be included in both the RESOLVE request and response.
> 
> num {
>       LS_NLA_PATH_F_GMP               = 1,
>       LS_NLA_PATH_F_PRIMARY           = (1<<1),
>       LS_NLA_PATH_F_ALTERNATE         = (1<<2),
>       LS_NLA_PATH_F_OUTBOUND          = (1<<3),
>       LS_NLA_PATH_F_INBOUND           = (1<<4),
>       LS_NLA_PATH_F_INBOUND_REVERSE   = (1<<5),
>       LS_NLA_PATH_F_BIDIRECTIONAL     = IB_PATH_OUTBOUND | 
> IB_PATH_INBOUND_REVERSE,
>       LS_NLA_PATH_F_USER              = (1<6)
> };
> 
> struct rdma_nla_ls_path_rec {
>       __u32   flags;
>       __u32   path_rec[0];
> };
> 
> The format of the pathrecord can be indicated by the flags and the data is 
> contained in path_rec[].
> For example, when LS_NLA_PATH_F_USER is set, the format is struct 
> ib_user_path_rec.
> 
> V. Summary
> 
> It's clear that this design is flexible, extensible, and can be easily 
> enhanced to address various
> kernel query points. It uses the existing netlink message header and 
> attribute interface, and can
> contain multiple attribute records.
> 
> 
> 
> Change since v1:
> -- Completely revised the design to use netlink header and attribute 
> interface.

On the face of it, this is a much improved design.

-- 
Doug Ledford <dledf...@redhat.com>
              GPG KeyID: 0E572FDD

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to