On Tue, 2015-05-26 at 14:03 +0000, Wan, Kaike wrote: > I. Introduction > > After posting our design to the mailing list, we received comments concerning > various aspects of the > design from Sean Hefty, Ira Weiny, Jason Gunthorpe, and Doug Ledford. Thank > you all for the help. > > The main issues are listed below: > 1. Extensibility: the design should be flexible and readily extended to other > applications; > 2. Multiple data records: a query can return multiple data records (eg > multiple pathrecords); > 3. Existing code: the design should use existing code as much as possible; > 4. Various query points in the kernel: what are the requirements (parameters, > expected results) for > various queries that may exist in the kernel (IPoIB, RDMA CM, etc). > > As our subject title indicates, we are trying to design for the kernel to > query a local user-space > service, more specifically, for the ib_sa module to send a pathrecord query > to a local user-space SA cache. > If anyone has information or requirements for other kernel query points, we > will be happy to know. > > In our previous design, we created a data header to contain various > information about the query and > response: > > struct ib_nl_data_hdr { > __u8 version; > __u8 opcode; > __u16 status; > __u16 type; > __u16 reserved; > __u32 flags; > __u32 length; > }; > > This was modeled after the ibacm messages and the message layout is > diagrammed below: > > +----------------+ > | netlink header | > +----------------+ > | Data header | > +----------------+ > | Data | > +----------------+ > > The design was extensible, but suffered from the fact that it did not take > full use of the netlink > message header. > > In this version of the design, we will make full use of the netlink header > and the existing attribute > interface, as detailed below. > > II. Message layout > > The general message layout is shown here: > > > +----------------+ > | netlink header | > +----------------+ > | Attribute 1 | > +----------------+ > | Attribute 2 | > +----------------+ > | ... | > +----------------+ > | Attribute N | > +----------------+ > > The number of attributes present in the request/response varies. As shown, > there is no new data > header to describe either the request nor the response. The netlink header > and various attributes > will be described later. > > III. Netlink protocol, multicast group, and kernel client > > This design is targeted to the NETLINK_RDMA protocol, and a new multicast > group RDMA_NL_GROUP_LS is > added for the local service: > > enum { > RDMA_NL_GROUP_CM = 1, > RDMA_NL_GROUP_IWPM, > RDMA_NL_GROUP_LS, > RDMA_NL_NUM_GROUPS > }; > > In addition, each kernel client should define a client index so that the > common rdma code could > route the response to the right client. For this purpose, we define the > RDMA_NL_SA client for the > ib_sa module: > > enum { > RDMA_NL_RDMA_CM = 1, > RDMA_NL_NES, > RDMA_NL_C4IW, > RDMA_NL_SA, > RDMA_NL_NUM_CLIENTS > }; > > As mentioned previously, each query point in the kernel should have its own > client index. > > IV. Netlink message header > > The netlink header is copied here: > > struct nlmsghdr { > __u32 nlmsg_len; /* Length of message including header */ > __u16 nlmsg_type; /* Message content */ > __u16 nlmsg_flags; /* Additional flags */ > __u32 nlmsg_seq; /* Sequence number */ > __u32 nlmsg_pid; /* Sending process port ID */ > }; > > The message type for rdma clients is also copied below: > > #define RDMA_NL_GET_TYPE(client, op) ((client << 10) + op) > > More clearly: > > Bits Description > -------------------------- > 15-10 Client index > 09-00 Opcode > > As described previously, a netlink message is routed by protocol > (NETLINK_RDMA), multicast group > (RDMA_NL_LS), and client (encoded in the nlmsg_type field for rdma messages). > Therefore, the > opcode (encoded in nlmsg_type), the sequence number (nlmsg_seq) and addition > flags (nlmsg_flags) > are all local to the client. This is important when we define these fields as > they can overlap for > different clients. > > (1) Opcode > > The opcode for local service SA client is defined below: > > enum { > RDMA_NL_LS_OP_RESOLVE = 0, > RDMA_NL_LS_OP_SET_TIMEOUT, > RDMA_NL_LS_NUM_OPS > }; > > The RESOLVE opcode is used by the ib_sa to send pathrecord query to the > user-space application > while the SET_TIMEOUT opcode can be used by the user-space application to set > the netlink timeout > value for the kernel client. Additional opcodes can be added if necessary. > > It should be emphasized that the opcode is client specific and therefore can > be overlapped for > different clients. Therefore, the 10 bits should be large enough for various > requests. > > (2) nlmsg_flags > > This flags fields are again client specific. But the lower byte (bits 7-0) is > generally reserved > and the upper bits can be used to define request specific flags: > > #define RDMA_NL_LS_F_OK 0x0100 /* Success response */ > #define RDMA_NL_LS_F_ERR 0x0200 /* Failed response */ > > These two bits can be used to indicate whether a message is a response. If > the status is ERR, an > error code can be contained in a status attribute, as described low. > > (3) Attribute type > > Request parameters and response data records can be embedded in attributes. > > The attribute header is copied here: > > struct nlattr { > __u16 nla_len; > __u16 nla_type; > }; > > Each attribute is preceded by the attribute header and followed by attribute > specific data. > > It should be reminded that attribute type is request (opcode) specific and > therefore could be > overloaded for different requests if needed. > > For ib_sa RESOLVE query, the following attribute types are defined: > > enum { > LS_NLA_TYPE_STATUS = 0, > LS_NLA_TYPE_ADDRESS, > LS_NLA_TYPE_PATH_RECORD, > LS_NLA_TYPE_MAX > }; > > (4) Status attribute > > The status attribute is mostly used to carry error code if the > RDMA_NL_LS_F_ERR bits in nlmsg_flags > field in the netlink message header is set. If the response is success, there > is no need to include > this attribute in the response data (it's not an error, either). > > num { > LS_NLA_STATUS_SUCCESS = 0, > LS_NLA_STATUS_INVAL, > LS_NLA_STATUS_ENODATA, > LS_NLA_STATUS_MAX > }; > > struct rdma_nla_ls_status { > __u32 status; > }; > > (5) Address attribute > > This attribute is normally included in the RESOLVE request. > > num { > LS_NLA_ADDR_F_SRC = 1, > LS_NLA_ADDR_F_DST = (1<<1), > LS_NLA_ADDR_F_HOSTNAME = {1<<2}, > LS_NLA_ADDR_F_IPV4 = (1<<3), > LS_NLA_ADDR_F_IPV6 = (1<<4) > }; > > struct rdma_nla_ls_addr { > __u32 flags; > __u32 addr[0]; > }; > > The address can be hostname (string), IPv4 or IPv6 address. The source and > destination flags are > also defined. > > (6) Pathrecord attribute > > This attribute can be included in both the RESOLVE request and response. > > num { > LS_NLA_PATH_F_GMP = 1, > LS_NLA_PATH_F_PRIMARY = (1<<1), > LS_NLA_PATH_F_ALTERNATE = (1<<2), > LS_NLA_PATH_F_OUTBOUND = (1<<3), > LS_NLA_PATH_F_INBOUND = (1<<4), > LS_NLA_PATH_F_INBOUND_REVERSE = (1<<5), > LS_NLA_PATH_F_BIDIRECTIONAL = IB_PATH_OUTBOUND | > IB_PATH_INBOUND_REVERSE, > LS_NLA_PATH_F_USER = (1<6) > }; > > struct rdma_nla_ls_path_rec { > __u32 flags; > __u32 path_rec[0]; > }; > > The format of the pathrecord can be indicated by the flags and the data is > contained in path_rec[]. > For example, when LS_NLA_PATH_F_USER is set, the format is struct > ib_user_path_rec. > > V. Summary > > It's clear that this design is flexible, extensible, and can be easily > enhanced to address various > kernel query points. It uses the existing netlink message header and > attribute interface, and can > contain multiple attribute records. > > > > Change since v1: > -- Completely revised the design to use netlink header and attribute > interface.
On the face of it, this is a much improved design. -- Doug Ledford <dledf...@redhat.com> GPG KeyID: 0E572FDD
signature.asc
Description: This is a digitally signed message part