On Mon, Dec 24, 2007 at 11:49:37PM +0000, Tang, Changqing wrote: > > > > -----Original Message----- > > From: Pavel Shamis (Pasha) [mailto:pa...@dev.mellanox.co.il] > > Sent: Monday, December 24, 2007 8:03 AM > > To: Tang, Changqing > > Cc: Jack Morgenstein; Roland Dreier; > > gene...@lists.openfabrics.org; Open MPI Developers; > > mvapich-disc...@cse.ohio-state.edu > > Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP > > independent of any one user process > > > > Hi CQ, > > Tang, Changqing wrote: > > > If I have a MPI server processes on a node, many other MPI > > > client processes will dynamically connect/disconnect with > > the server. The server use same XRC domain. > > > > > > Will this cause accumulating the "kernel" QP for such > > > application ? we want the server to run 365 days a year. > > > > > I have some question about the scenario above. Did you call > > for the mpi disconnect on the both ends (server/client) > > before the client exit (did we must to do it?) > > Yes, both ends will call disconnect. But for us, MPI_Comm_disconnect() call > is not a collective call, it is just a local operation. Bust spec says that MPI_Comm_disconnect() is a collective call: http://www.mpi-forum.org/docs/mpi-20-html/node114.htm#Node114
> > --CQ > > > > > > Regards, > > Pasha. > > > > > > Thanks. > > > --CQ > > > > > > > > > > > > > > > > > >> -----Original Message----- > > >> From: Pavel Shamis (Pasha) [mailto:pa...@dev.mellanox.co.il] > > >> Sent: Thursday, December 20, 2007 9:15 AM > > >> To: Jack Morgenstein > > >> Cc: Tang, Changqing; Roland Dreier; > > >> gene...@lists.openfabrics.org; Open MPI Developers; > > >> mvapich-disc...@cse.ohio-state.edu > > >> Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP > > >> independent of any one user process > > >> > > >> Adding Open MPI and MVAPICH community to the thread. > > >> > > >> Pasha (Pavel Shamis) > > >> > > >> Jack Morgenstein wrote: > > >> > > >>> background: see "XRC Cleanup order issue thread" at > > >>> > > >>> > > >>> > > >>> > > >> > > http://lists.openfabrics.org/pipermail/general/2007-December/043935.h > > >> t > > >> > > >>> ml > > >>> > > >>> (userspace process which created the receiving XRC qp on a > > >>> > > >> given host > > >> > > >>> dies before other processes which still need to receive XRC > > >>> > > >> messages > > >> > > >>> on their SRQs which are "paired" with the now-destroyed > > >>> > > >> receiving XRC > > >> > > >>> QP.) > > >>> > > >>> Solution: Add a userspace verb (as part of the XRC suite) which > > >>> enables the user process to create an XRC QP owned by the > > >>> > > >> kernel -- which belongs to the required XRC domain. > > >> > > >>> This QP will be destroyed when the XRC domain is closed > > >>> > > >> (i.e., as part > > >> > > >>> of a ibv_close_xrc_domain call, but only when the domain's > > >>> > > >> reference count goes to zero). > > >> > > >>> Below, I give the new userspace API for this function. Any > > >>> > > >> feedback will be appreciated. > > >> > > >>> This API will be implemented in the upcoming OFED 1.3 > > >>> > > >> release, so we need feedback ASAP. > > >> > > >>> Notes: > > >>> 1. There is no query or destroy verb for this QP. There is > > >>> > > >> also no userspace object for the > > >> > > >>> QP. Userspace has ONLY the raw qp number to use when > > >>> > > >> creating the (X)RC connection. > > >> > > >>> 2. Since the QP is "owned" by kernel space, async events > > >>> > > >> for this QP are also handled in kernel > > >> > > >>> space (i.e., reported in /var/log/messages). There are > > >>> > > >> no completion events for the QP, since > > >> > > >>> it does not send, and all receives completions are > > >>> > > >> reported in the XRC SRQ's cq. > > >> > > >>> If this QP enters the error state, the remote QP which > > >>> > > >> sends will start receiving RETRY_EXCEEDED > > >> > > >>> errors, so the application will be aware of the failure. > > >>> > > >>> - Jack > > >>> > > >>> > > >> > > ===================================================================== > > >> = > > >> > > >>> ================ > > >>> /** > > >>> * ibv_alloc_xrc_rcv_qp - creates an XRC QP for serving as > > >>> > > >> a receive-side only QP, > > >> > > >>> * and moves the created qp through the RESET->INIT and > > >>> > > >> INIT->RTR transitions. > > >> > > >>> * (The RTR->RTS transition is not needed, since this > > >>> > > >> QP does no sending). > > >> > > >>> * The sending XRC QP uses this QP as destination, while > > >>> > > >> specifying an XRC SRQ > > >> > > >>> * for actually receiving the transmissions and > > >>> > > >> generating all completions on the > > >> > > >>> * receiving side. > > >>> * > > >>> * This QP is created in kernel space, and persists > > >>> > > >> until the XRC domain is closed. > > >> > > >>> * (i.e., its reference count goes to zero). > > >>> * > > >>> * @pd: protection domain to use. At lower layer, this provides > > >>> access to userspace obj > > >>> * @xrc_domain: xrc domain to use for the QP. > > >>> * @attr: modify-qp attributes needed to bring the QP to RTR. > > >>> * @attr_mask: bitmap indicating which attributes are > > >>> > > >> provided in the attr struct. > > >> > > >>> * used for validity checking. > > >>> * @xrc_rcv_qpn: qp_num of created QP (if success). To be > > >>> > > >> passed to the remote node. The > > >> > > >>> * remote node will use xrc_rcv_qpn in > > >>> > > >> ibv_post_send when sending to > > >> > > >>> * XRC SRQ's on this host in the same xrc domain. > > >>> * > > >>> * RETURNS: success (0), or a (negative) error value. > > >>> */ > > >>> > > >>> int ibv_alloc_xrc_rcv_qp(struct ibv_pd *pd, > > >>> struct ibv_xrc_domain *xrc_domain, > > >>> struct ibv_qp_attr *attr, > > >>> enum ibv_qp_attr_mask attr_mask, > > >>> uint32_t *xrc_rcv_qpn); > > >>> > > >>> Notes: > > >>> > > >>> 1. Although the kernel creates the qp in the kernel's own > > >>> > > >> PD, we still need the PD > > >> > > >>> parameter to determine the device. > > >>> > > >>> 2. I chose to use struct ibv_qp_attr, which is used in > > >>> > > >> modify QP, rather than create > > >> > > >>> a new structure for this purpose. This also guards > > >>> > > >> against API changes in the event > > >> > > >>> that during development I notice that more modify-qp > > >>> > > >> parameters must be specified > > >> > > >>> for this operation to work. > > >>> > > >>> 3. Table of the ibv_qp_attr parameters showing what values to set: > > >>> > > >>> struct ibv_qp_attr { > > >>> enum ibv_qp_state qp_state; Not needed > > >>> enum ibv_qp_state cur_qp_state; Not needed > > >>> -- Driver starts from RESET and takes qp to RTR. > > >>> enum ibv_mtu path_mtu; Yes > > >>> enum ibv_mig_state path_mig_state; Yes > > >>> uint32_t qkey; Yes > > >>> uint32_t rq_psn; Yes > > >>> uint32_t sq_psn; Not needed > > >>> uint32_t dest_qp_num; Yes > > >>> > > >> -- this is the remote side QP for the RC conn. > > >> > > >>> int qp_access_flags; Yes > > >>> struct ibv_qp_cap cap; Need > > >>> > > >> only XRC domain. > > >> > > >>> Other > > >>> > > >> caps will use hard-coded values: > > >> > > >> max_send_wr = 1; > > >> > > >> max_recv_wr = 0; > > >> > > >> max_send_sge = 1; > > >> > > >> max_recv_sge = 0; > > >> > > >> max_inline_data = 0; > > >> > > >>> struct ibv_ah_attr ah_attr; Yes > > >>> struct ibv_ah_attr alt_ah_attr; Optional > > >>> uint16_t pkey_index; Yes > > >>> uint16_t alt_pkey_index; Optional > > >>> uint8_t en_sqd_async_notify; Not > > >>> > > >> needed (No sq) > > >> > > >>> uint8_t sq_draining; Not > > >>> > > >> needed (No sq) > > >> > > >>> uint8_t max_rd_atomic; Not > > >>> > > >> needed (No sq) > > >> > > >>> uint8_t max_dest_rd_atomic; Yes > > >>> > > >> -- Total max outstanding RDMAs expected > > >> > > >>> for > > >>> > > >> ALL srq destinations using this receive QP. > > >> > > >>> (if > > >>> > > >> you are only using SENDs, this value can be 0). > > >> > > >>> uint8_t min_rnr_timer; default - 0 > > >>> uint8_t port_num; Yes > > >>> uint8_t timeout; Yes > > >>> uint8_t retry_cnt; Yes > > >>> uint8_t rnr_retry; Yes > > >>> uint8_t alt_port_num; Optional > > >>> uint8_t alt_timeout; Optional > > >>> }; > > >>> > > >>> 4. Attribute mask bits to set: > > >>> For RESET_to_INIT transition: > > >>> IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_PORT > > >>> > > >>> For INIT_to_RTR transition: > > >>> IB_QP_AV | IB_QP_PATH_MTU | > > >>> IB_QP_DEST_QPN | IB_QP_RQ_PSN | IB_QP_MIN_RNR_TIMER > > >>> If you are using RDMA or atomics, also set: > > >>> IB_QP_MAX_DEST_RD_ATOMIC > > >>> > > >>> > > >>> _______________________________________________ > > >>> general mailing list > > >>> gene...@lists.openfabrics.org > > >>> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > >>> > > >>> To unsubscribe, please visit > > >>> http://openib.org/mailman/listinfo/openib-general > > >>> > > >>> > > >>> > > >> -- > > >> Pavel Shamis (Pasha) > > >> Mellanox Technologies > > >> > > >> > > >> > > > > > > > > > > > > -- > > Pavel Shamis (Pasha) > > Mellanox Technologies > > > > > _______________________________________________ > general mailing list > gene...@lists.openfabrics.org > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Gleb.