This is version 2 of the proposal, addressing comments from version 1. Changelog: - Use oflags to make API smaller - Clarify sharing semantics - Add documentation
This is the API proposal for support of the SRC (scalable reliable connected) protocol extension in libibverbs. This adds APIs to: - manage SRC domains - share SRC domains between processes, by means of creating a 1:1 association between an SRC domain and an inode. Notes: - The inode is specified by means of a file descriptor, this makes it possible for the user to manage file creation/deletion in the most flexible manner (e.g. tmpfile can be used). - I envision implementing this sharing mechanism in kernel by means of a per-device tree, with inode as a key and domain object as a value. Please comment. Signed-off-by: Michael S. Tsirkin <[EMAIL PROTECTED]> ---- diff --git a/SRC.txt b/SRC.txt new file mode 100644 index 0000000..3881477 --- /dev/null +++ b/SRC.txt @@ -0,0 +1,133 @@ +Here's some documentation on Scalable Reliable Connections. + + * * * + +SRC is an extension supported by recent Mellanox hardware +which is geared toward reducing the number of QPs +required for all-to-all communication on systems +with a high number of jobs per node. + +=================================================================== +Motivation: +=================================================================== +Given N nodes with J jobs per node, number of QPs required +for all-to-all communication is: + +With RC: + O((N * J) ^ 2) + + Since each job out of O(N * J) jobs must create a single QP + to communicate with each one of O(N * J) other jobs. + +With SRC: + O(N ^ 2 * J) + + This is achived by using a single send queue (per job, out of O(N * J) jobs) + to send data to all J jobs running on a specific node (out of O(N) nodes). + Hardware uses new "SRQ number" field in packet header to + multiplex receive WRs and WCs to private memory of each job. + +This is similiar idea to IB RD. +Q: Why not use RD then? +A: Because no hardware supports it. + +Details: + +=================================================================== +Verbs extension: +=================================================================== + +- There is a new transport/QP type "SRC". +- There is a new object type "SRC domain" +- Each SRQ gets new (optional) attributes: + SRC domain + SRC SRQ number + SRC CQ + SRQ must have either all 3 of these or none of these attributes + +- QPs of type SRC have all the same attributes as regular RC QPs + connected to SRQ, except that: + A. Each SRC QP has a new required attribute "SRC domain" + B. SRC QPs do *not* have "SRQ" attribute + (do not have a specific SRQ associated with them) + +=================================================================== +Protocol extension: +=================================================================== +SRC QP behaviour: Requestor +- Post send WR for this QP type is extended with SRQ number field + This number is sent as part of packet header +- SRC Packets follow rules for RC packets on the wire, exactly + What is different is their handling at the responder side + +SRC QP behaviour: Responder +Each incoming packet passes transport checks with respect +to the SRC QP, following RC rules, exactly. + +After this, SRQ number in packet header is used to look up +a specific SRQ. SRC domain of the resulting SRQ must be equal +to SRC domain of the QP, otherwise a NAK is sent, +and QP moves to error state. + +If the SRC domains match, receive WR and receive WC processing +are as follows: + +- RC Send + - Rather than using SRQ to which the QP is attached, + SRQ is looked up by SRQ number in the packet. + Receive WR is taken from this SRQ. + - Completions are generated on the CQ specified in the SRQ + +- RDMA/Atomic + - Rather than using PD to which the QP is attached, + SRQ is looked up by SRQ number in the packet. + PD of this SRQ is used for protection checks. + +=================================================================== +Pseudo code: +=================================================================== + +Consider again a setup where there are N nodes with J jobs per node. +All N * J jobs need to perform all-to-all communication. +Using RC QPs, this would call for O((N * J) ^ 2) QPs. +Here is how SRC can be used to reduce the number of QPs to O(N ^ 2 * J). + +At startup: +1. All jobs on each node share a single SRC domain +2. Each job creates a CQ for receive WCs +3. Each job creates a SRQ attached to this CQ and to the shared domain + +When job j1 needs to transmit to job j2 on remote node n for the first time: +1. Test: does job j1 have an existing connection to some job on node n? + - If no: + j1 creates an SRC QP qp1 (send QP) + qp1 is only used to post send WRs + j2 creates an SRC QP qp2 + qp2 is part of SRC domain + qp2 is only used to do transport checks: + neither send nor receive WRs are posted on qp2 + j1 and j2 create a connection between qp1 and qp2 + - If yes: + let qp1 be the QP which belongs to j1 and is connected + to some qp on node n + +2. j1 gets SRQ number from j2 +3. j1 can now use QP qp2 from step 1 + and SRQ number from step 3 to send data to j2 + +Cleanup: +When job j1 does not need to communicate to any jobs on node n, +it disconnects qp1 from qp2, and asks j2 to destroy qp2. + +=================================================================== + +Resources used (CQs are ignored below): +Each node: +- An SRC domain - to the total of n domains +- A Receive QP for each (remote) job - to the total of N * (N * J) recv QPs + +Each job: +- A SRQ - to the total of N * J SRQs +- A send QP for each (remote) node - to the total of N * (N * J) send QPs + +=================================================================== diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h index acc1b82..d18475a 100644 --- a/include/infiniband/verbs.h +++ b/include/infiniband/verbs.h @@ -142,6 +142,7 @@ struct ibv_device_attr { uint16_t max_pkeys; uint8_t local_ca_ack_delay; uint8_t phys_port_cnt; + int max_src_domain; }; enum ibv_mtu { @@ -370,6 +371,11 @@ struct ibv_ah_attr { uint8_t port_num; }; +struct ibv_src_domain { + struct ibv_context *context; + uint32_t handle; +}; + enum ibv_srq_attr_mask { IBV_SRQ_MAX_WR = 1 << 0, IBV_SRQ_LIMIT = 1 << 1 @@ -389,7 +395,8 @@ struct ibv_srq_init_attr { enum ibv_qp_type { IBV_QPT_RC = 2, IBV_QPT_UC, - IBV_QPT_UD + IBV_QPT_UD, + IBV_QPT_SRC }; struct ibv_qp_cap { @@ -408,6 +415,7 @@ struct ibv_qp_init_attr { struct ibv_qp_cap cap; enum ibv_qp_type qp_type; int sq_sig_all; + struct ibv_src_domain *src_domain; }; enum ibv_qp_attr_mask { @@ -526,6 +534,7 @@ struct ibv_send_wr { uint32_t remote_qkey; } ud; } wr; + uint32_t src_remote_srq_num; }; struct ibv_recv_wr { @@ -553,6 +562,10 @@ struct ibv_srq { pthread_mutex_t mutex; pthread_cond_t cond; uint32_t events_completed; + + uint32_t src_srq_num; + struct ibv_src_domain *src_domain; + struct ibv_cq *src_cq; }; struct ibv_qp { @@ -570,6 +583,8 @@ struct ibv_qp { pthread_mutex_t mutex; pthread_cond_t cond; uint32_t events_completed; + + struct ibv_src_domain *src_domain; }; struct ibv_comp_channel { @@ -912,6 +927,25 @@ struct ibv_srq *ibv_create_srq(struct ibv_pd *pd, struct ibv_srq_init_attr *srq_init_attr); /** + * ibv_create_src_srq - Creates a SRQ associated with the specified protection + * domain and src domain. + * @pd: The protection domain associated with the SRQ. + * @src_domain: The SRC domain associated with the SRQ. + * @src_cq: CQ to report completions for SRC packets on. + * + * @srq_init_attr: A list of initial attributes required to create the SRQ. + * + * srq_attr->max_wr and srq_attr->max_sge are read the determine the + * requested size of the SRQ, and set to the actual values allocated + * on return. If ibv_create_srq() succeeds, then max_wr and max_sge + * will always be at least as large as the requested values. + */ +struct ibv_srq *ibv_create_src_srq(struct ibv_pd *pd, + struct ibv_src_domain *src_domain, + struct ibv_cq *src_cq, + struct ibv_srq_init_attr *srq_init_attr); + +/** * ibv_modify_srq - Modifies the attributes for the specified SRQ. * @srq: The SRQ to modify. * @srq_attr: On input, specifies the SRQ attributes to modify. On output, @@ -1074,6 +1108,42 @@ int ibv_detach_mcast(struct ibv_qp *qp, union ibv_gid *gid, uint16_t lid); */ int ibv_fork_init(void); +/** + * ibv_open_src_domain - open an SRC domain + * Returns a reference to an SRC domain. + * + * @context: Device context + * @fd: descriptor for inode associated with the domain + * If fd == -1, no inode is associated with the domain; in this case, + * the only legal value for oflag is O_CREAT + * + * @oflag: oflag values are constructed by OR-ing flags from the following list + * + * O_CREAT + * If a domain belonging to device named by context is already associated + * with the inode, this flag has no effect, except as noted under O_EXCL + * below. Otherwise, a new SRC domain is created and is associated with + * inode specified by fd. + * + * O_EXCL + * If O_EXCL and O_CREAT are set, open will fail if a domain associated with + * the inode exists. The check for the existence of the domain and creation + * of the domain if it does not exist is atomic with respect to other + * processes executing open with fd naming the same inode. + */ +struct ibv_src_domain *ibv_open_src_domain(struct ibv_context *context, + int fd, int oflag); + +/** + * ibv_close_src_domain - close an SRC domain + * If this is the last reference, destroys the domain. + * + * @d: reference to SRC domain to close + * + * close is implicitly performed at process exit. + */ +int ibv_close_src_domain(struct ibv_src_domain *d); + END_C_DECLS # undef __attribute_const -- MST _______________________________________________ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg