Re: [PATCH v3 12/23] staging/rdma/hfi1: Macro code clean up
On Mon, Oct 26, 2015 at 10:28:38AM -0400, ira.we...@intel.com wrote: > From: Mitko Haralanov > > Clean up the context and sdma macros and move them to a more logical place in > hfi.h > > Signed-off-by: Mitko Haralanov > Signed-off-by: Ira Weiny > --- > drivers/staging/rdma/hfi1/hfi.h | 22 ++ > 1 file changed, 10 insertions(+), 12 deletions(-) > > diff --git a/drivers/staging/rdma/hfi1/hfi.h b/drivers/staging/rdma/hfi1/hfi.h > index a35213e9b500..41ad9a30149b 100644 > --- a/drivers/staging/rdma/hfi1/hfi.h > +++ b/drivers/staging/rdma/hfi1/hfi.h > @@ -1104,6 +1104,16 @@ struct hfi1_filedata { > int rec_cpu_num; > }; > > +/* for use in system calls, where we want to know device type, etc. */ > +#define fp_to_fd(fp) ((struct hfi1_filedata *)(fp)->private_data) > +#define ctxt_fp(fp) (fp_to_fd((fp))->uctxt) > +#define subctxt_fp(fp) (fp_to_fd((fp))->subctxt) > +#define tidcursor_fp(fp) (fp_to_fd((fp))->tidcursor) > +#define user_sdma_pkt_fp(fp) (fp_to_fd((fp))->pq) > +#define user_sdma_comp_fp(fp) (fp_to_fd((fp))->cq) > +#define notifier_fp(fp) (fp_to_fd((fp))->mn) > +#define rb_fp(fp) (fp_to_fd((fp))->tid_rb_root) Ick, no, don't do this, just spell it all out (odds are you will see tht you can make the code simpler...) If you don't know what "cq" or "pq" are, then name them properly. These need to be all removed. thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 14/23] staging/rdma/hfi1: Implement Expected Receive TID caching
On Mon, Oct 26, 2015 at 10:28:40AM -0400, ira.we...@intel.com wrote: > From: Mitko Haralanov > > Expected receives work by user-space libraries (PSM) calling into the > driver with information about the user's receive buffer and have the driver > DMA-map that buffer and program the HFI to receive data directly into it. > > This is an expensive operation as it requires the driver to pin the pages > which > the user's buffer maps to, DMA-map them, and then program the HFI. > > When the receive is complete, user-space libraries have to call into the > driver > again so the buffer is removed from the HFI, un-mapped, and the pages > unpinned. > > All of these operations are expensive, considering that a lot of applications > (especially micro-benchmarks) use the same buffer over and over. > > In order to get better performance for user-space applications, it is highly > beneficial that they don't continuously call into the driver to register and > unregister the same buffer. Rather, they can register the buffer and cache it > for future work. The buffer can be unregistered when it is freed by the user. > > This change implements such buffer caching by making use of the kernel's MMU > notifier API. User-space libraries call into the driver only when the need to > register a new buffer. > > Once a buffer is registered, it stays programmed into the HFI until the kernel > notifies the driver that the buffer has been freed by the user. At that time, > the user-space library is notified and it can do the necessary work to remove > the buffer from its cache. > > Buffers which have been invalidated by the kernel are not automatically > removed > from the HFI and do not have their pages unpinned. Buffers are only completely > removed when the user-space libraries call into the driver to free them. This > is done to ensure that any ongoing transfers into that buffer are complete. > This is important when a buffer is not completely freed but rather it is > shrunk. The user-space library could still have uncompleted transfers into the > remaining buffer. > > With this feature, it is important that systems are setup with reasonable > limits for the amount of lockable memory. Keeping the limit at "unlimited" > (as > we've done up to this point), may result in jobs being killed by the kernel's > OOM due to them taking up excessive amounts of memory. > > Reviewed-by: Arthur Kepner > Reviewed-by: Dennis Dalessandro > Signed-off-by: Mitko Haralanov > Signed-off-by: Ira Weiny > > --- > Changes from V2: > Fix random Kconfig 0-day build error > Fix leak of random memory to user space caught by Dan Carpenter > Separate out pointer bug fix into a previous patch > Change error checks in case statement per Dan's comments > > drivers/staging/rdma/hfi1/Kconfig|1 + > drivers/staging/rdma/hfi1/Makefile |2 +- > drivers/staging/rdma/hfi1/common.h | 15 +- > drivers/staging/rdma/hfi1/file_ops.c | 490 ++--- > drivers/staging/rdma/hfi1/hfi.h | 43 +- > drivers/staging/rdma/hfi1/init.c |5 +- > drivers/staging/rdma/hfi1/trace.h| 132 ++-- > drivers/staging/rdma/hfi1/user_exp_rcv.c | 1171 > ++ > drivers/staging/rdma/hfi1/user_exp_rcv.h | 82 +++ > drivers/staging/rdma/hfi1/user_pages.c | 110 +-- > drivers/staging/rdma/hfi1/user_sdma.c| 13 + > drivers/staging/rdma/hfi1/user_sdma.h| 10 +- > include/uapi/rdma/hfi/hfi1_user.h| 42 +- > 13 files changed, 1481 insertions(+), 635 deletions(-) > create mode 100644 drivers/staging/rdma/hfi1/user_exp_rcv.c > create mode 100644 drivers/staging/rdma/hfi1/user_exp_rcv.h This is way too big to review properly, please break it up into reviewable chunks. thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 19/23] staging/rdma/hfi: modify workqueue for parallelism
On Mon, Oct 26, 2015 at 10:28:45AM -0400, ira.we...@intel.com wrote: > From: Mike Marciniszyn > > The workqueue is currently single threaded per port which for a small number > of > SDMA engines is ok. > > For hfi1, the there are up to 16 SDMA engines that can be fed descriptors in > parallel. > > This patch: > - Converts to use alloc_workqueue > - Changes the workqueue limit from 1 to num_sdma > - Makes the queue WQ_CPU_INTENSIVE and WQ_HIGHPRI > - The sdma_engine now has a cpu that is initialized > as the MSI-X vectors are setup > - Adjusts the post send logic to call a new scheduler > that doesn't get the s_lock > - The new and old workqueue schedule now pass a > cpu > - post send now uses the new scheduler > - RC/UC QPs now pre-compute the sc, sde > - The sde wq is eliminated since the new hfi1_wq is > multi-threaded When you have to start enumerating all of the different things that your patch does, that's a huge hint that you need to break it up into smaller pieces. Please break this up, it's not acceptable as-is. thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 23/23] staging/rdma/hfi1: Update driver version string to 0.9-294
On Mon, Oct 26, 2015 at 10:28:49AM -0400, ira.we...@intel.com wrote: > From: Jubin John > > Signed-off-by: Jubin John > Signed-off-by: Ira Weiny > --- > drivers/staging/rdma/hfi1/common.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/staging/rdma/hfi1/common.h > b/drivers/staging/rdma/hfi1/common.h > index 7809093eb55e..5dd92720faae 100644 > --- a/drivers/staging/rdma/hfi1/common.h > +++ b/drivers/staging/rdma/hfi1/common.h > @@ -205,7 +205,7 @@ > * to the driver itself, not the software interfaces it supports. > */ > #ifndef HFI1_DRIVER_VERSION_BASE > -#define HFI1_DRIVER_VERSION_BASE "0.9-248" > +#define HFI1_DRIVER_VERSION_BASE "0.9-294" Patches like this make no sense at all, please drop it and only use the kernel version. Trust me, it's going to get messy really fast (hint, it already did...) greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4 v2] staging: ipath: ipath_driver: Use setup_timer
On Sun, Oct 25, 2015 at 01:21:11PM +0200, Leon Romanovsky wrote: > On Sun, Oct 25, 2015 at 12:17 PM, Muhammad Falak R Wani > wrote: > Please follow standard naming convention for the patches. > It should be [PATCH v2 1/4] and not [PATCH 1/4 v2]. Does this matter? It's in a thread so it sorts fine either way. regards, dan carpenter -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] iser-target: Remove explicit mlx4 work-around
The driver now exposes sufficient limits so we can avoid having mlx4 specific work-around. Signed-off-by: Sagi Grimberg --- drivers/infiniband/ulp/isert/ib_isert.c | 10 ++ 1 files changed, 2 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c index 96336a9..303cea7 100644 --- a/drivers/infiniband/ulp/isert/ib_isert.c +++ b/drivers/infiniband/ulp/isert/ib_isert.c @@ -141,14 +141,8 @@ isert_create_qp(struct isert_conn *isert_conn, attr.recv_cq = comp->cq; attr.cap.max_send_wr = ISERT_QP_MAX_REQ_DTOS; attr.cap.max_recv_wr = ISERT_QP_MAX_RECV_DTOS + 1; - /* -* FIXME: Use devattr.max_sge - 2 for max_send_sge as -* work-around for RDMA_READs with ConnectX-2. -* -* Also, still make sure to have at least two SGEs for -* outgoing control PDU responses. -*/ - attr.cap.max_send_sge = max(2, device->ib_device->max_sge - 2); + attr.cap.max_send_sge = min(device->ib_device->max_sge, + device->ib_device->max_sge_rd); isert_conn->max_sge = attr.cap.max_send_sge; attr.cap.max_recv_sge = 1; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] Expose max_sge_rd correctly
This addresses a specific mlx4 issue where the max_sge_rd is actually smaller than max_sge (rdma reads with max_sge entries completes with error). The second patch removes the explicit work-around from the iser target code. This applies on top of Christoph's device attributes modification. Sagi Grimberg (2): mlx4: Expose correct max_sge_rd limit iser-target: Remove explicit mlx4 work-around drivers/infiniband/hw/mlx4/main.c |3 ++- drivers/infiniband/ulp/isert/ib_isert.c | 10 ++ 2 files changed, 4 insertions(+), 9 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] mlx4: Expose correct max_sge_rd limit
mlx4 devices (ConnectX-2, ConnectX-3) can not issue max_sge in a single RDMA_READ request (resulting in a completion error). Thus, expose lower max_sge_rd to avoid this issue. Signed-off-by: Sagi Grimberg --- drivers/infiniband/hw/mlx4/main.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index 3889723..46305dc 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -499,7 +499,8 @@ static int mlx4_ib_init_device_flags(struct ib_device *ibdev) ibdev->max_qp_wr = dev->dev->caps.max_wqes - MLX4_IB_SQ_MAX_SPARE; ibdev->max_sge = min(dev->dev->caps.max_sq_sg, dev->dev->caps.max_rq_sg); - ibdev->max_sge_rd = ibdev->max_sge; + /* reserve 2 sge slots for rdma reads */ + ibdev->max_sge_rd = ibdev->max_sge - 2; ibdev->max_cq = dev->dev->quotas.cq; ibdev->max_cqe = dev->dev->caps.max_cqes; ibdev->max_mr = dev->dev->quotas.mpt; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4 v2] staging: ipath: ipath_driver: Use setup_timer
On Tue, Oct 27, 2015 at 11:19 AM, Dan Carpenter wrote: > On Sun, Oct 25, 2015 at 01:21:11PM +0200, Leon Romanovsky wrote: >> On Sun, Oct 25, 2015 at 12:17 PM, Muhammad Falak R Wani >> wrote: >> Please follow standard naming convention for the patches. >> It should be [PATCH v2 1/4] and not [PATCH 1/4 v2]. > > Does this matter? It's in a thread so it sorts fine either way. It will be wise if people read guides and follow examples. [1] https://www.kernel.org/doc/Documentation/SubmittingPatches > > regards, > dan carpenter > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4 v2] staging: ipath: ipath_driver: Use setup_timer
On Tue, Oct 27, 2015 at 11:45:18AM +0200, Leon Romanovsky wrote: > On Tue, Oct 27, 2015 at 11:19 AM, Dan Carpenter > wrote: > > On Sun, Oct 25, 2015 at 01:21:11PM +0200, Leon Romanovsky wrote: > >> On Sun, Oct 25, 2015 at 12:17 PM, Muhammad Falak R Wani > >> wrote: > >> Please follow standard naming convention for the patches. > >> It should be [PATCH v2 1/4] and not [PATCH 1/4 v2]. > > > > Does this matter? It's in a thread so it sorts fine either way. > It will be wise if people read guides and follow examples. > > [1] https://www.kernel.org/doc/Documentation/SubmittingPatches That document doesn't really specify one way or the other. And even if it did then why would you care? Stop being so picky for no reason. regards, dan carpenter -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH libibverbs] Expose QP block self multicast loopback creation flag
Add QP creation flag which indicates that the QP will not receive self multicast loopback traffic. ibv_cmd_create_qp_ex was already defined but could not get extended, add ibv_cmd_create_qp_ex2 which follows the extension scheme and hence could be extendible in the future for more features. Signed-off-by: Eran Ben Elisha Reviewed-by: Moshe Lazer --- Hi Doug, This is the user space equivalent for the loopback prevention patches that were acceptad into 4.4 ib-next. include/infiniband/driver.h | 9 ++ include/infiniband/kern-abi.h | 53 +++ include/infiniband/verbs.h| 9 +- src/cmd.c | 200 ++ src/libibverbs.map| 1 + 5 files changed, 200 insertions(+), 72 deletions(-) diff --git a/include/infiniband/driver.h b/include/infiniband/driver.h index 8227df0..b7f1fae 100644 --- a/include/infiniband/driver.h +++ b/include/infiniband/driver.h @@ -179,6 +179,15 @@ int ibv_cmd_create_qp_ex(struct ibv_context *context, struct ibv_qp_init_attr_ex *attr_ex, struct ibv_create_qp *cmd, size_t cmd_size, struct ibv_create_qp_resp *resp, size_t resp_size); +int ibv_cmd_create_qp_ex2(struct ibv_context *context, + struct verbs_qp *qp, int vqp_sz, + struct ibv_qp_init_attr_ex *qp_attr, + struct ibv_create_qp_ex *cmd, + size_t cmd_core_size, + size_t cmd_size, + struct ibv_create_qp_resp_ex *resp, + size_t resp_core_size, + size_t resp_size); int ibv_cmd_open_qp(struct ibv_context *context, struct verbs_qp *qp, int vqp_sz, struct ibv_qp_open_attr *attr, diff --git a/include/infiniband/kern-abi.h b/include/infiniband/kern-abi.h index 800c5ab..2278f63 100644 --- a/include/infiniband/kern-abi.h +++ b/include/infiniband/kern-abi.h @@ -110,6 +110,8 @@ enum { enum { IB_USER_VERBS_CMD_QUERY_DEVICE_EX = IB_USER_VERBS_CMD_EXTENDED_MASK | IB_USER_VERBS_CMD_QUERY_DEVICE, + IB_USER_VERBS_CMD_CREATE_QP_EX = IB_USER_VERBS_CMD_EXTENDED_MASK | +IB_USER_VERBS_CMD_CREATE_QP, IB_USER_VERBS_CMD_CREATE_FLOW = IB_USER_VERBS_CMD_EXTENDED_MASK + IB_USER_VERBS_CMD_THRESHOLD, IB_USER_VERBS_CMD_DESTROY_FLOW @@ -527,28 +529,35 @@ struct ibv_kern_qp_attr { __u8reserved[5]; }; +#define IBV_CREATE_QP_COMMON \ + __u64 user_handle; \ + __u32 pd_handle;\ + __u32 send_cq_handle; \ + __u32 recv_cq_handle; \ + __u32 srq_handle; \ + __u32 max_send_wr; \ + __u32 max_recv_wr; \ + __u32 max_send_sge; \ + __u32 max_recv_sge; \ + __u32 max_inline_data; \ + __u8 sq_sig_all; \ + __u8 qp_type; \ + __u8 is_srq; \ + __u8 reserved + struct ibv_create_qp { __u32 command; __u16 in_words; __u16 out_words; __u64 response; - __u64 user_handle; - __u32 pd_handle; - __u32 send_cq_handle; - __u32 recv_cq_handle; - __u32 srq_handle; - __u32 max_send_wr; - __u32 max_recv_wr; - __u32 max_send_sge; - __u32 max_recv_sge; - __u32 max_inline_data; - __u8 sq_sig_all; - __u8 qp_type; - __u8 is_srq; - __u8 reserved; + IBV_CREATE_QP_COMMON; __u64 driver_data[0]; }; +struct ibv_create_qp_common { + IBV_CREATE_QP_COMMON; +}; + struct ibv_open_qp { __u32 command; __u16 in_words; @@ -574,6 +583,19 @@ struct ibv_create_qp_resp { __u32 reserved; }; +struct ibv_create_qp_ex { + struct ex_hdr hdr; + struct ibv_create_qp_common base; + __u32 comp_mask; + __u32 create_flags; +}; + +struct ibv_create_qp_resp_ex { + struct ibv_create_qp_resp base; + __u32 comp_mask; + __u32 response_length; +}; + struct ibv_qp_dest { __u8 dgid[16]; __u32 flow_label; @@ -1031,7 +1053,8 @@ enum { IB_USER_VERBS_CMD_OPEN_QP_V2 = -1, IB_USER_VERBS_CMD_CREATE_FLOW_V2 = -1, IB_USER_VERBS_CMD_DESTROY_FLOW_V2 = -1, - IB_USER_VERBS_CMD_QUERY_DEVICE_EX_V2 = -1 + IB_USER_VERBS_CMD_QUERY_DEVICE_EX_V2 = -1, + IB_USER_VERBS_CMD_CREATE_QP_EX_V2 = -1, }; struct ibv_modify_srq_v3 { diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h index ae22768..941e5dc 100644 --- a/include/infiniband/verbs.h +++ b/include/infiniband/verbs.h @@ -580,7 +580,12 @@ struct ibv_qp_init_attr { enum ibv_qp_init_attr_mask { IBV_QP_INIT_ATTR_PD = 1 << 0, IBV_QP_INIT_ATTR_XRCD = 1 << 1, - IBV_QP_INIT_ATTR_RESE
[PATCH libmlx4] Add support for ibv_cmd_create_qp_ex2
Add an extension verb mlx4_cmd_create_qp_ex that follows the standard extension verb mechanism. This function is called from mlx4_create_qp_ex but supports the extension verbs functions and stores the creation flags. In addition, check that the comp_mask values of struct ibv_qp_init_attr_ex are valid. Signed-off-by: Eran Ben Elisha Signed-off-by: Yishai Hadas --- src/mlx4-abi.h | 18 ++ src/verbs.c| 51 +++ 2 files changed, 65 insertions(+), 4 deletions(-) diff --git a/src/mlx4-abi.h b/src/mlx4-abi.h index b48f6fc..ac21fa8 100644 --- a/src/mlx4-abi.h +++ b/src/mlx4-abi.h @@ -111,4 +111,22 @@ struct mlx4_create_qp { __u8reserved[5]; }; +struct mlx4_create_qp_drv_ex { + __u64 buf_addr; + __u64 db_addr; + __u8log_sq_bb_count; + __u8log_sq_stride; + __u8sq_no_prefetch; /* was reserved in ABI 2 */ + __u8reserved[5]; +}; + +struct mlx4_create_qp_ex { + struct ibv_create_qp_ex ibv_cmd; + struct mlx4_create_qp_drv_exdrv_ex; +}; + +struct mlx4_create_qp_resp_ex { + struct ibv_create_qp_resp_exibv_resp; +}; + #endif /* MLX4_ABI_H */ diff --git a/src/verbs.c b/src/verbs.c index 2cb1f8a..2cf240d 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -458,6 +458,43 @@ int mlx4_destroy_srq(struct ibv_srq *srq) return 0; } +static int mlx4_cmd_create_qp_ex(struct ibv_context *context, +struct ibv_qp_init_attr_ex *attr, +struct mlx4_create_qp *cmd, +struct mlx4_qp *qp) +{ + struct mlx4_create_qp_ex cmd_ex; + struct mlx4_create_qp_resp_ex resp; + int ret; + + memset(&cmd_ex, 0, sizeof(cmd_ex)); + memcpy(&cmd_ex.ibv_cmd.base, &cmd->ibv_cmd.user_handle, + offsetof(typeof(cmd->ibv_cmd), is_srq) + + sizeof(cmd->ibv_cmd.is_srq) - + offsetof(typeof(cmd->ibv_cmd), user_handle)); + + memcpy(&cmd_ex.drv_ex, &cmd->buf_addr, + offsetof(typeof(*cmd), sq_no_prefetch) + + sizeof(cmd->sq_no_prefetch) - sizeof(cmd->ibv_cmd)); + + ret = ibv_cmd_create_qp_ex2(context, &qp->verbs_qp, + sizeof(qp->verbs_qp), attr, + &cmd_ex.ibv_cmd, sizeof(cmd_ex.ibv_cmd), + sizeof(cmd_ex), &resp.ibv_resp, + sizeof(resp.ibv_resp), sizeof(resp)); + return ret; +} + +enum { + MLX4_CREATE_QP_SUP_COMP_MASK = (IBV_QP_INIT_ATTR_PD | + IBV_QP_INIT_ATTR_XRCD | + IBV_QP_INIT_ATTR_CREATE_FLAGS), +}; + +enum { + MLX4_CREATE_QP_EX2_COMP_MASK = (IBV_QP_INIT_ATTR_CREATE_FLAGS), +}; + struct ibv_qp *mlx4_create_qp_ex(struct ibv_context *context, struct ibv_qp_init_attr_ex *attr) { @@ -474,6 +511,9 @@ struct ibv_qp *mlx4_create_qp_ex(struct ibv_context *context, attr->cap.max_inline_data > 1024) return NULL; + if (attr->comp_mask & ~MLX4_CREATE_QP_SUP_COMP_MASK) + return NULL; + qp = calloc(1, sizeof *qp); if (!qp) return NULL; @@ -529,12 +569,15 @@ struct ibv_qp *mlx4_create_qp_ex(struct ibv_context *context, ; /* nothing */ cmd.sq_no_prefetch = 0; /* OK for ABI 2: just a reserved field */ memset(cmd.reserved, 0, sizeof cmd.reserved); - pthread_mutex_lock(&to_mctx(context)->qp_table_mutex); - ret = ibv_cmd_create_qp_ex(context, &qp->verbs_qp, - sizeof(qp->verbs_qp), attr, - &cmd.ibv_cmd, sizeof cmd, &resp, sizeof resp); + if (attr->comp_mask & MLX4_CREATE_QP_EX2_COMP_MASK) + ret = mlx4_cmd_create_qp_ex(context, attr, &cmd, qp); + else + ret = ibv_cmd_create_qp_ex(context, &qp->verbs_qp, + sizeof(qp->verbs_qp), attr, + &cmd.ibv_cmd, sizeof(cmd), &resp, + sizeof(resp)); if (ret) goto err_rq_db; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH libibverbs] Expose QP block self multicast loopback creation flag
On 10/27/2015 2:53 PM, Eran Ben Elisha wrote: Add QP creation flag which indicates that the QP will not receive self multicast loopback traffic. ibv_cmd_create_qp_ex was already defined but could not get extended, add ibv_cmd_create_qp_ex2 which follows the extension scheme and hence could be extendible in the future for more features. Signed-off-by: Eran Ben Elisha Reviewed-by: Moshe Lazer Eran, If there's a V1, I would use "Add QP creation flags, support blocking self multicast loopback" for the title, b/c this better reflects what the patch is doing. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] iser-target: Remove an unused variable
On 22/10/2015 21:14, Bart Van Assche wrote: Detected this by compiling with W=1. Signed-off-by: Bart Van Assche Cc: Sagi Grimberg FWIW, Reviewed-by: Sagi Grimberg -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/iser: Remove an unused variable
Detected this by compiling with W=1. Signed-off-by: Bart Van Assche Cc: Sagi Grimberg FWIW, Reviewed-by: Sagi Grimberg -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: merge struct ib_device_attr into struct ib_device V2
Did we converge on this? Just a heads up to Doug, this conflicts with [PATCH v4 11/16] xprtrdma: Pre-allocate Work Requests for backchannel but it's trivial to sort out... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4 v2] staging: ipath: ipath_driver: Use setup_timer
On October 27, 2015 4:40:42 PM GMT+05:30, Dan Carpenter wrote: >On Tue, Oct 27, 2015 at 11:45:18AM +0200, Leon Romanovsky wrote: >> On Tue, Oct 27, 2015 at 11:19 AM, Dan Carpenter >> wrote: >> > On Sun, Oct 25, 2015 at 01:21:11PM +0200, Leon Romanovsky wrote: >> >> On Sun, Oct 25, 2015 at 12:17 PM, Muhammad Falak R Wani >> >> wrote: >> >> Please follow standard naming convention for the patches. >> >> It should be [PATCH v2 1/4] and not [PATCH 1/4 v2]. >> > >> > Does this matter? It's in a thread so it sorts fine either way. >> It will be wise if people read guides and follow examples. >> >> [1] https://www.kernel.org/doc/Documentation/SubmittingPatches > >That document doesn't really specify one way or the other. And even if >it did then why would you care? Stop being so picky for no reason. > >regards, >dan carpenter Sorry, my bad . Won't repeat such mistakes. -- mfrw -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 0/2] Expose max_sge_rd correctly
> -Original Message- > From: Sagi Grimberg [mailto:sa...@mellanox.com] > Sent: Tuesday, October 27, 2015 4:41 AM > To: linux-rdma@vger.kernel.org; target-de...@vger.kernel.org > Cc: Steve Wise; Nicholas A. Bellinger; Or Gerlitz; Doug Ledford > Subject: [PATCH 0/2] Expose max_sge_rd correctly > > This addresses a specific mlx4 issue where the max_sge_rd > is actually smaller than max_sge (rdma reads with max_sge > entries completes with error). > > The second patch removes the explicit work-around from the > iser target code. > > This applies on top of Christoph's device attributes modification. > Looks correct to me. Series Reviewed-by: Steve Wise -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] mlx4: Expose correct max_sge_rd limit
On 10/27/2015 11:40 AM, Sagi Grimberg wrote: mlx4 devices (ConnectX-2, ConnectX-3) can not issue max_sge in a single RDMA_READ request (resulting in a completion error). Thus, expose lower max_sge_rd to avoid this issue. Sagi, I can hear your pain when wearing the iser target driver maintainer hat. Still, this patch is currently pure WA b/c we didn't do RCA (Root Cause Analysis) Lets wait for RCA (which might yield the same patch, BTW) and keep suffering in LIO Or. Signed-off-by: Sagi Grimberg --- drivers/infiniband/hw/mlx4/main.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index 3889723..46305dc 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -499,7 +499,8 @@ static int mlx4_ib_init_device_flags(struct ib_device *ibdev) ibdev->max_qp_wr= dev->dev->caps.max_wqes - MLX4_IB_SQ_MAX_SPARE; ibdev->max_sge = min(dev->dev->caps.max_sq_sg, dev->dev->caps.max_rq_sg); - ibdev->max_sge_rd = ibdev->max_sge; + /* reserve 2 sge slots for rdma reads */ + ibdev->max_sge_rd = ibdev->max_sge - 2; ibdev->max_cq = dev->dev->quotas.cq; ibdev->max_cqe = dev->dev->caps.max_cqes; ibdev->max_mr = dev->dev->quotas.mpt; -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] IB/sa: replace GFP_KERNEL with GFP_ATOMIC
replace GFP_KERNEL with GFP_ATOMIC, as code while holding a spinlock should be atomic GFP_KERNEL may sleep and can cause deadlock, where as GFP_ATOMIC may fail but certainly avoids deadlock Signed-off-by: Saurabh Sengar --- drivers/infiniband/core/sa_query.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index 8c014b3..cd1f911 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -526,7 +526,7 @@ static int ib_nl_send_msg(struct ib_sa_query *query) if (len <= 0) return -EMSGSIZE; - skb = nlmsg_new(len, GFP_KERNEL); + skb = nlmsg_new(len, GFP_ATOMIC); if (!skb) return -ENOMEM; @@ -544,7 +544,7 @@ static int ib_nl_send_msg(struct ib_sa_query *query) /* Repair the nlmsg header length */ nlmsg_end(skb, nlh); - ret = ibnl_multicast(skb, nlh, RDMA_NL_GROUP_LS, GFP_KERNEL); + ret = ibnl_multicast(skb, nlh, RDMA_NL_GROUP_LS, GFP_ATOMIC); if (!ret) ret = len; else -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] mlx4: Expose correct max_sge_rd limit
On 27/10/2015 16:39, Or Gerlitz wrote: On 10/27/2015 11:40 AM, Sagi Grimberg wrote: mlx4 devices (ConnectX-2, ConnectX-3) can not issue max_sge in a single RDMA_READ request (resulting in a completion error). Thus, expose lower max_sge_rd to avoid this issue. Sagi, Hey Or, Still, this patch is currently pure WA b/c we didn't do RCA (Root Cause Analysis) So from my discussions with the HW folks a RDMA_READ wqe cannot exceed 512B. The wqe control segment is 16 bytes, the rdma section is 12 bytes (rkey + raddr) and each sge is 16 bytes so the computation is: (512B-16B-12B)/16B = 30. The reason is that the HW needs to fetch the rdma_read wqe on the RX path (rdma_read response) and it has a limited buffer at that point. Perhaps a dedicated #define for that is needed here. I'll add that in the change log in v1. Cheers, Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] mlx4: Expose correct max_sge_rd limit
On 10/27/2015 6:03 PM, Sagi Grimberg wrote: So from my discussions with the HW folks a RDMA_READ wqe cannot exceed 512B. The wqe control segment is 16 bytes, the rdma section is 12 bytes (rkey + raddr) and each sge is 16 bytes so the computation is: (512B-16B-12B)/16B = 30. But AFAIR, the magic number was 28... how this goes hand in hand with your findings? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] mlx4: Expose correct max_sge_rd limit
But AFAIR, the magic number was 28... how this goes hand in hand with your findings? mlx4 max_sge is 32, and isert does max_sge - 2 = 30. So it always used 30... and I run it reliably with this for a while now. This thing exists before I was involved so I might not be familiar with all the details... Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] mlx4: Expose correct max_sge_rd limit
On 10/27/2015 02:40 AM, Sagi Grimberg wrote: mlx4 devices (ConnectX-2, ConnectX-3) can not issue max_sge in a single RDMA_READ request (resulting in a completion error). Thus, expose lower max_sge_rd to avoid this issue. Signed-off-by: Sagi Grimberg --- drivers/infiniband/hw/mlx4/main.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index 3889723..46305dc 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -499,7 +499,8 @@ static int mlx4_ib_init_device_flags(struct ib_device *ibdev) ibdev->max_qp_wr= dev->dev->caps.max_wqes - MLX4_IB_SQ_MAX_SPARE; ibdev->max_sge = min(dev->dev->caps.max_sq_sg, dev->dev->caps.max_rq_sg); - ibdev->max_sge_rd = ibdev->max_sge; + /* reserve 2 sge slots for rdma reads */ + ibdev->max_sge_rd = ibdev->max_sge - 2; ibdev->max_cq = dev->dev->quotas.cq; ibdev->max_cqe = dev->dev->caps.max_cqes; ibdev->max_mr = dev->dev->quotas.mpt; Hello Sagi, Is this the same issue as what has been discussed in http://www.spinics.net/lists/linux-rdma/msg21799.html ? Thanks, Bart. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] mlx4: Expose correct max_sge_rd limit
Hello Sagi, Is this the same issue as what has been discussed in http://www.spinics.net/lists/linux-rdma/msg21799.html ? Looks like it. I think this patch addresses this issue, but lets CC Eli to comment if I'm missing something. Thanks for digging this up... Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v1 0/2] Handle mlx4 max_sge_rd correctly
This addresses a specific mlx4 issue where the max_sge_rd is actually smaller than max_sge (rdma reads with max_sge entries completes with error). The second patch removes the explicit work-around from the iser target code. Changes from v0: - Used a dedicated enumeration MLX4_MAX_SGE_RD and added a root cause analysis to patch change log. - Fixed isert qp creation to be max_sge but construct rdma work request with the minimum of max_sge and max_sge_rd as non-rdma sends (login rsp) take 2 sges (and some devices have max_sge_rd = 1. Sagi Grimberg (2): mlx4: Expose correct max_sge_rd limit iser-target: Remove explicit mlx4 work-around drivers/infiniband/hw/mlx4/main.c |2 +- drivers/infiniband/ulp/isert/ib_isert.c | 13 +++-- include/linux/mlx4/device.h | 11 +++ 3 files changed, 15 insertions(+), 11 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v1 2/2] iser-target: Remove explicit mlx4 work-around
The driver now exposes sufficient limits so we can avoid having mlx4 specific work-around. Signed-off-by: Sagi Grimberg --- drivers/infiniband/ulp/isert/ib_isert.c | 13 +++-- 1 files changed, 3 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/ulp/isert/ib_isert.c b/drivers/infiniband/ulp/isert/ib_isert.c index 96336a9..eb985f9 100644 --- a/drivers/infiniband/ulp/isert/ib_isert.c +++ b/drivers/infiniband/ulp/isert/ib_isert.c @@ -141,16 +141,9 @@ isert_create_qp(struct isert_conn *isert_conn, attr.recv_cq = comp->cq; attr.cap.max_send_wr = ISERT_QP_MAX_REQ_DTOS; attr.cap.max_recv_wr = ISERT_QP_MAX_RECV_DTOS + 1; - /* -* FIXME: Use devattr.max_sge - 2 for max_send_sge as -* work-around for RDMA_READs with ConnectX-2. -* -* Also, still make sure to have at least two SGEs for -* outgoing control PDU responses. -*/ - attr.cap.max_send_sge = max(2, device->ib_device->max_sge - 2); - isert_conn->max_sge = attr.cap.max_send_sge; - + attr.cap.max_send_sge = device->ib_device->max_sge; + isert_conn->max_sge = min(device->ib_device->max_sge, + device->ib_device->max_sge_rd); attr.cap.max_recv_sge = 1; attr.sq_sig_type = IB_SIGNAL_REQ_WR; attr.qp_type = IB_QPT_RC; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v1 1/2] mlx4: Expose correct max_sge_rd limit
mlx4 devices (ConnectX-2, ConnectX-3) has a limitation where rdma read work queue entries cannot exceed 512 bytes. A rdma_read wqe needs to fit in 512 bytes: - wqe control segment (16 bytes) - rdma segment (12 bytes) - scatter elements (16 bytes each) So max_sge_rd should be: (512 - 16 - 12) / 16 = 30. Signed-off-by: Sagi Grimberg --- drivers/infiniband/hw/mlx4/main.c |2 +- include/linux/mlx4/device.h | 11 +++ 2 files changed, 12 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c index 3889723..d8453f1 100644 --- a/drivers/infiniband/hw/mlx4/main.c +++ b/drivers/infiniband/hw/mlx4/main.c @@ -499,7 +499,7 @@ static int mlx4_ib_init_device_flags(struct ib_device *ibdev) ibdev->max_qp_wr = dev->dev->caps.max_wqes - MLX4_IB_SQ_MAX_SPARE; ibdev->max_sge = min(dev->dev->caps.max_sq_sg, dev->dev->caps.max_rq_sg); - ibdev->max_sge_rd = ibdev->max_sge; + ibdev->max_sge_rd = MLX4_MAX_SGE_RD; ibdev->max_cq = dev->dev->quotas.cq; ibdev->max_cqe = dev->dev->caps.max_cqes; ibdev->max_mr = dev->dev->quotas.mpt; diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h index baad4cb..90c12f0 100644 --- a/include/linux/mlx4/device.h +++ b/include/linux/mlx4/device.h @@ -425,6 +425,17 @@ enum { }; enum { + /* +* Max wqe size for rdma read is 512 bytes, so this +* limits our max_sge_rd as the wqe needs to fit: +* - ctrl segment (16 bytes) +* - rdma segment (12 bytes) +* - scatter elements (16 bytes each) +*/ + MLX4_MAX_SGE_RD = (512 - 16 - 12) / 16 +}; + +enum { MLX4_DEV_PMC_SUBTYPE_GUID_INFO = 0x14, MLX4_DEV_PMC_SUBTYPE_PORT_INFO = 0x15, MLX4_DEV_PMC_SUBTYPE_PKEY_TABLE = 0x16, -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH libibverbs 0/5] Completion timestamping
Hi Doug, This series adds completion timestamp for libibverbs. In order to do so, we add an extensible poll cq. The problem with extending the WC is that you could run out of the current cache line when adding new features and degrade performance. This is solved by introducing a custom WC. The user creates a CQ using ibv_create_cq_ex, stating which WC fields should be returned by this CQ. When the user calls ibv_poll_cq_ex, this custom WC is returned. The fields orders and sizes are declared in advanced (we avoid alignment rules by putting the fields starting from the 64bit fields --> 8bit fields). Each WC has a wc_flags field representing which fields are valid in this WC. The vendor drivers could optimize those calls extensively. Completion timestamp is added on top of these extended ibv_create_cq_ex verb and ibv_poll_cq_ex verb. The user should call ibv_create_cq_ex stating that this CQ should support reporting completion timestamp. ibv_poll_cq_ex reports this raw completion timestamp value in every packet. In the future, a verb like the following could be added in order to transform this time into system time: ibv_get_timestamp(struct ibv_context *context, uint64_t raw_time, struct timespec *ts, int flags); The timestamp mask (number of supported bits) and the HCA's frequency are given in ibv_query_device_ex verb. We also give the user an ability to read the HCA's current clock. This is done via ibv_query_values_ex. This verb could be extended in the future for other interesting information. Thanks, Matan Matan Barak (5): Add ibv_poll_cq_ex verb Add timestamp_mask and hca_core_clock to ibv_query_device_ex Add support for extended ibv_create_cq Add completion timestmap support for ibv_poll_cq_ex Add ibv_query_values_ex Makefile.am | 6 +- examples/devinfo.c| 10 ++ include/infiniband/compiler.h | 89 include/infiniband/driver.h | 9 ++ include/infiniband/kern-abi.h | 26 +++- include/infiniband/verbs.h| 318 ++ man/ibv_create_cq_ex.3| 71 ++ man/ibv_poll_cq_ex.3 | 173 +++ man/ibv_query_device_ex.3 | 6 +- src/cmd.c | 63 + src/device.c | 44 ++ src/ibverbs.h | 12 ++ src/libibverbs.map| 1 + 13 files changed, 822 insertions(+), 6 deletions(-) create mode 100644 include/infiniband/compiler.h create mode 100644 man/ibv_create_cq_ex.3 create mode 100644 man/ibv_poll_cq_ex.3 -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH libibverbs 1/5] Add ibv_poll_cq_ex verb
This is an extension verb for ibv_poll_cq. It allows the user to poll the cq for specific wc fields only, while allowing to extend the wc. The verb calls the provider in order to fill the WC with the required information. Signed-off-by: Matan Barak --- Makefile.am | 5 +- include/infiniband/compiler.h | 89 + include/infiniband/verbs.h| 215 ++ man/ibv_poll_cq_ex.3 | 171 + 4 files changed, 478 insertions(+), 2 deletions(-) create mode 100644 include/infiniband/compiler.h create mode 100644 man/ibv_poll_cq_ex.3 diff --git a/Makefile.am b/Makefile.am index c85e98a..339bcec 100644 --- a/Makefile.am +++ b/Makefile.am @@ -44,7 +44,8 @@ libibverbsincludedir = $(includedir)/infiniband libibverbsinclude_HEADERS = include/infiniband/arch.h include/infiniband/driver.h \ include/infiniband/kern-abi.h include/infiniband/opcode.h include/infiniband/verbs.h \ -include/infiniband/sa-kern-abi.h include/infiniband/sa.h include/infiniband/marshall.h +include/infiniband/sa-kern-abi.h include/infiniband/sa.h include/infiniband/marshall.h \ +include/infiniband/compiler.h man_MANS = man/ibv_asyncwatch.1 man/ibv_devices.1 man/ibv_devinfo.1\ man/ibv_rc_pingpong.1 man/ibv_uc_pingpong.1 man/ibv_ud_pingpong.1 \ @@ -63,7 +64,7 @@ man_MANS = man/ibv_asyncwatch.1 man/ibv_devices.1 man/ibv_devinfo.1 \ man/ibv_req_notify_cq.3 man/ibv_resize_cq.3 man/ibv_rate_to_mbps.3 \ man/ibv_create_qp_ex.3 man/ibv_create_srq_ex.3 man/ibv_open_xrcd.3 \ man/ibv_get_srq_num.3 man/ibv_open_qp.3 \ -man/ibv_query_device_ex.3 +man/ibv_query_device_ex.3 man/ibv_poll_cq_ex.3 DEBIAN = debian/changelog debian/compat debian/control debian/copyright \ debian/ibverbs-utils.install debian/libibverbs1.install \ diff --git a/include/infiniband/compiler.h b/include/infiniband/compiler.h new file mode 100644 index 000..b4bab98 --- /dev/null +++ b/include/infiniband/compiler.h @@ -0,0 +1,89 @@ +/* + * Copyright (c) 2015 Mellanox, Ltd. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + *copyright notice, this list of conditions and the following + *disclaimer. + * + * - Redistributions in binary form must reproduce the above + *copyright notice, this list of conditions and the following + *disclaimer in the documentation and/or other materials + *provided with the distribution. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#ifndef _COMPILER_ +#define _COMPILER_ + +#if (__GNUC__ > 3 || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4)) +#define ibv_popcount64 __builtin_popcountll +#endif + +#ifndef __has_builtin + #define __has_builtin(x) 0 /* Compatibility with non-clang compilers. */ +#endif + +#if __has_builtin(__builtin_popcountll) && !defined(ibv_popcount64) + #define ibv_popcount64 __builtin_popcountll +#endif + +#ifndef ibv_popcount64 +/* From FreeBSD: + * Copyright (C) 1995, 1996, 1997, and 1998 WIDE Project. + * All rights reserved. + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * 1. Redistributions of source code must retain the above copyright + *notice, this list of conditions and the following disclaimer. + * 2. Redistributions in binary form must reproduce the above copyright + *notice, this list of conditions and the following disclaimer in the + *documentation and/or other materials provided with the distribution. + * 3. Neither the name of the project nor the names of its contributors + *may be used to endorse or promote products derived from this software + *without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE PROJECT AND CONTRIBUTORS ``AS IS'' AND + * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITN
[PATCH libibverbs 4/5] Add completion timestmap support for ibv_poll_cq_ex
Add support for raw completion timestamp through ibv_poll_cq_ex. Signed-off-by: Matan Barak --- include/infiniband/verbs.h | 7 ++- man/ibv_poll_cq_ex.3 | 2 ++ 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h index f80126a..3d66726 100644 --- a/include/infiniband/verbs.h +++ b/include/infiniband/verbs.h @@ -391,6 +391,7 @@ enum ibv_wc_flags_ex { IBV_WC_EX_WITH_SLID = 1 << 7, IBV_WC_EX_WITH_SL = 1 << 8, IBV_WC_EX_WITH_DLID_PATH_BITS = 1 << 9, + IBV_WC_EX_WITH_COMPLETION_TIMESTAMP = 1 << 10, }; enum { @@ -409,6 +410,10 @@ enum { }; /* fields order in wc_ex + * // Raw timestamp of completion. A raw timestamp is implementation + * // defined and can not be relied upon to have any ordering value + * // between more than one HCA or driver. + * uint64_tcompletion_timestamp; * uint32_tbyte_len, * uint32_timm_data; // in network byte order * uint32_tqp_num; @@ -420,7 +425,7 @@ enum { */ enum { - IBV_WC_EX_WITH_64BIT_FIELDS = 0 + IBV_WC_EX_WITH_64BIT_FIELDS = IBV_WC_EX_WITH_COMPLETION_TIMESTAMP }; enum { diff --git a/man/ibv_poll_cq_ex.3 b/man/ibv_poll_cq_ex.3 index 8f336bc..3eb9bc0 100644 --- a/man/ibv_poll_cq_ex.3 +++ b/man/ibv_poll_cq_ex.3 @@ -54,12 +54,14 @@ IBV_WC_EX_WITH_PKEY_INDEX = 1 << 6, /* The returned wc_ex contain IBV_WC_EX_WITH_SLID = 1 << 7, /* The returned wc_ex contains slid field */ IBV_WC_EX_WITH_SL = 1 << 8, /* The returned wc_ex contains sl field */ IBV_WC_EX_WITH_DLID_PATH_BITS = 1 << 9, /* The returned wc_ex contains dlid_path_bits field */ +IBV_WC_EX_WITH_COMPLETION_TIMESTAMP = 1 << 10, /* The returned wc_ex contains completion_timestmap field */ .in -8 }; .fi wc_flags describes which of the fields in buffer[0] have a valid value. The order of these fields and sizes are always the following: .nf +uint64_tcompletion_timestamp; /* Raw timestamp of completion. Implementation defined. Can't be relied upon to have any ordering value between more than one driver/hca */ uint32_tbyte_len, uint32_timm_data; /* in network byte order */ uint32_tqp_num; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH libibverbs 5/5] Add ibv_query_values_ex
Add an extension verb to query certain values of device. Currently, only IBV_VALUES_HW_CLOCK is supported, but this verb could support other flags like IBV_VALUES_TEMP_SENSOR, IBV_VALUES_CORE_FREQ, etc. This extension verb only calls the provider. The provider has to query this value somehow and mark the queried values in comp_mask. Signed-off-by: Matan Barak --- include/infiniband/verbs.h | 33 + 1 file changed, 33 insertions(+) diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h index 3d66726..4829dac 100644 --- a/include/infiniband/verbs.h +++ b/include/infiniband/verbs.h @@ -1234,6 +1234,16 @@ struct ibv_create_cq_attr_ex { uint32_tflags; }; +enum ibv_values_mask { + IBV_VALUES_MASK_RAW_CLOCK = 1 << 0, + IBV_VALUES_MASK_RESERVED= 1 << 1 +}; + +struct ibv_values_ex { + uint32_tcomp_mask; + struct timespec raw_clock; +}; + enum verbs_context_mask { VERBS_CONTEXT_XRCD = 1 << 0, VERBS_CONTEXT_SRQ = 1 << 1, @@ -1250,6 +1260,8 @@ struct ibv_poll_cq_ex_attr { struct verbs_context { /* "grows up" - new fields go here */ + int (*query_values)(struct ibv_context *context, + struct ibv_values_ex *values); struct ibv_cq *(*create_cq_ex)(struct ibv_context *context, struct ibv_create_cq_attr_ex *); void *priv; @@ -1730,6 +1742,27 @@ ibv_create_qp_ex(struct ibv_context *context, struct ibv_qp_init_attr_ex *qp_ini } /** + * ibv_query_values_ex - Get current @q_values of device, + * @q_values is mask (Or's bits of enum ibv_values_mask) of the attributes + * we need to query. + */ +static inline int +ibv_query_values_ex(struct ibv_context *context, + struct ibv_values_ex *values) +{ + struct verbs_context *vctx; + + vctx = verbs_get_ctx_op(context, query_values); + if (!vctx) + return ENOSYS; + + if (values->comp_mask & ~(IBV_VALUES_MASK_RESERVED - 1)) + return EINVAL; + + return vctx->query_values(context, values); +} + +/** * ibv_query_device_ex - Get extended device properties */ static inline int -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH libibverbs 3/5] Add support for extended ibv_create_cq
Adding ibv_create_cq_ex. This extended verbs follows the extension verbs scheme and hence could be extendible in the future for more features. The new command supports creation flags with timestamp. Signed-off-by: Matan Barak --- Makefile.am | 3 +- include/infiniband/driver.h | 9 ++ include/infiniband/kern-abi.h | 24 +-- include/infiniband/verbs.h| 63 ++ man/ibv_create_cq_ex.3| 71 +++ src/cmd.c | 42 + src/device.c | 44 +++ src/ibverbs.h | 5 +++ src/libibverbs.map| 1 + 9 files changed, 259 insertions(+), 3 deletions(-) create mode 100644 man/ibv_create_cq_ex.3 diff --git a/Makefile.am b/Makefile.am index 339bcec..b6399d6 100644 --- a/Makefile.am +++ b/Makefile.am @@ -64,7 +64,8 @@ man_MANS = man/ibv_asyncwatch.1 man/ibv_devices.1 man/ibv_devinfo.1 \ man/ibv_req_notify_cq.3 man/ibv_resize_cq.3 man/ibv_rate_to_mbps.3 \ man/ibv_create_qp_ex.3 man/ibv_create_srq_ex.3 man/ibv_open_xrcd.3 \ man/ibv_get_srq_num.3 man/ibv_open_qp.3 \ -man/ibv_query_device_ex.3 man/ibv_poll_cq_ex.3 +man/ibv_query_device_ex.3 man/ibv_poll_cq_ex.3 \ +man/ibv_create_cq_ex.3 DEBIAN = debian/changelog debian/compat debian/control debian/copyright \ debian/ibverbs-utils.install debian/libibverbs1.install \ diff --git a/include/infiniband/driver.h b/include/infiniband/driver.h index 8227df0..0d53554 100644 --- a/include/infiniband/driver.h +++ b/include/infiniband/driver.h @@ -144,6 +144,15 @@ int ibv_cmd_create_cq(struct ibv_context *context, int cqe, int comp_vector, struct ibv_cq *cq, struct ibv_create_cq *cmd, size_t cmd_size, struct ibv_create_cq_resp *resp, size_t resp_size); +int ibv_cmd_create_cq_ex(struct ibv_context *context, +struct ibv_create_cq_attr_ex *cq_attr, +struct ibv_cq *cq, +struct ibv_create_cq_ex *cmd, +size_t cmd_core_size, +size_t cmd_size, +struct ibv_create_cq_resp_ex *resp, +size_t resp_core_size, +size_t resp_size); int ibv_cmd_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc); int ibv_cmd_req_notify_cq(struct ibv_cq *cq, int solicited_only); #define IBV_CMD_RESIZE_CQ_HAS_RESP_PARAMS diff --git a/include/infiniband/kern-abi.h b/include/infiniband/kern-abi.h index cce6ade..b2dcda6 100644 --- a/include/infiniband/kern-abi.h +++ b/include/infiniband/kern-abi.h @@ -110,9 +110,11 @@ enum { enum { IB_USER_VERBS_CMD_QUERY_DEVICE_EX = IB_USER_VERBS_CMD_EXTENDED_MASK | IB_USER_VERBS_CMD_QUERY_DEVICE, + IB_USER_VERBS_CMD_CREATE_CQ_EX = IB_USER_VERBS_CMD_EXTENDED_MASK | + IB_USER_VERBS_CMD_CREATE_CQ, IB_USER_VERBS_CMD_CREATE_FLOW = IB_USER_VERBS_CMD_EXTENDED_MASK + IB_USER_VERBS_CMD_THRESHOLD, - IB_USER_VERBS_CMD_DESTROY_FLOW + IB_USER_VERBS_CMD_DESTROY_FLOW, }; /* @@ -400,6 +402,23 @@ struct ibv_create_cq_resp { __u32 cqe; }; +struct ibv_create_cq_ex { + struct ex_hdr hdr; + __u64 user_handle; + __u32 cqe; + __u32 comp_vector; + __s32 comp_channel; + __u32 comp_mask; + __u32 flags; + __u32 reserved; +}; + +struct ibv_create_cq_resp_ex { + struct ibv_create_cq_resp base; + __u32 comp_mask; + __u32 response_length; +}; + struct ibv_kern_wc { __u64 wr_id; __u32 status; @@ -1033,7 +1052,8 @@ enum { IB_USER_VERBS_CMD_OPEN_QP_V2 = -1, IB_USER_VERBS_CMD_CREATE_FLOW_V2 = -1, IB_USER_VERBS_CMD_DESTROY_FLOW_V2 = -1, - IB_USER_VERBS_CMD_QUERY_DEVICE_EX_V2 = -1 + IB_USER_VERBS_CMD_QUERY_DEVICE_EX_V2 = -1, + IB_USER_VERBS_CMD_CREATE_CQ_EX_V2 = -1, }; struct ibv_modify_srq_v3 { diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h index 51b880b..f80126a 100644 --- a/include/infiniband/verbs.h +++ b/include/infiniband/verbs.h @@ -1193,6 +1193,42 @@ struct ibv_context { void *abi_compat; }; +enum ibv_create_cq_attr { + IBV_CREATE_CQ_ATTR_FLAGS= 1 << 0, + IBV_CREATE_CQ_ATTR_RESERVED = 1 << 1 +}; + +enum ibv_create_cq_attr_flags { + IBV_CREATE_CQ_ATTR_COMPLETION_TIMESTAMP = 1 << 0, +}; + +struct ibv_create_cq_attr_ex { + /* Minimum number of entries required for CQ */ + int cqe; + /* Consumer-supplied context returned for completion events */ +
[PATCH libibverbs 2/5] Add timestamp_mask and hca_core_clock to ibv_query_device_ex
The fields timestamp_mask and hca_core_clock were added to the extended version of ibv_query_device verb. timestamp_mask represents the allowed mask of the timestamp. Users could infer the accuracy of the reported possible timestamp. hca_core_clock represents the frequency of the HCA (in HZ). Since timestamp and reading the HCA's core clock is given in hardware cycles, knowing the frequency is mandatory in order to convert this number into seconds. Signed-off-by: Matan Barak --- examples/devinfo.c| 10 ++ include/infiniband/kern-abi.h | 2 ++ include/infiniband/verbs.h| 28 +++- man/ibv_query_device_ex.3 | 6 -- src/cmd.c | 21 + src/ibverbs.h | 7 +++ 6 files changed, 59 insertions(+), 15 deletions(-) diff --git a/examples/devinfo.c b/examples/devinfo.c index a8de982..0af8c3b 100644 --- a/examples/devinfo.c +++ b/examples/devinfo.c @@ -339,6 +339,16 @@ static int print_hca_cap(struct ibv_device *ib_dev, uint8_t ib_port) printf("\tlocal_ca_ack_delay:\t\t%d\n", device_attr.orig_attr.local_ca_ack_delay); print_odp_caps(&device_attr.odp_caps); + if (device_attr.completion_timestamp_mask) + printf("\tcompletion timestamp_mask:\t\t\t0x%016lx\n", + device_attr.completion_timestamp_mask); + else + printf("\tcompletion_timestamp_mask not supported\n"); + + if (device_attr.hca_core_clock) + printf("\thca_core_clock:\t\t\t%lukHZ\n", device_attr.hca_core_clock); + else + printf("\tcore clock not supported\n"); } for (port = 1; port <= device_attr.orig_attr.phys_port_cnt; ++port) { diff --git a/include/infiniband/kern-abi.h b/include/infiniband/kern-abi.h index 800c5ab..cce6ade 100644 --- a/include/infiniband/kern-abi.h +++ b/include/infiniband/kern-abi.h @@ -267,6 +267,8 @@ struct ibv_query_device_resp_ex { __u32 comp_mask; __u32 response_length; struct ibv_odp_caps_resp odp_caps; + __u64 timestamp_mask; + __u64 hca_core_clock; }; struct ibv_query_port { diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h index 479bfca..51b880b 100644 --- a/include/infiniband/verbs.h +++ b/include/infiniband/verbs.h @@ -204,6 +204,8 @@ struct ibv_device_attr_ex { struct ibv_device_attr orig_attr; uint32_tcomp_mask; struct ibv_odp_caps odp_caps; + uint64_tcompletion_timestamp_mask; + uint64_thca_core_clock; }; enum ibv_mtu { @@ -378,6 +380,19 @@ struct ibv_wc { uint8_t dlid_path_bits; }; +enum ibv_wc_flags_ex { + IBV_WC_EX_GRH = 1 << 0, + IBV_WC_EX_IMM = 1 << 1, + IBV_WC_EX_WITH_BYTE_LEN = 1 << 2, + IBV_WC_EX_WITH_IMM = 1 << 3, + IBV_WC_EX_WITH_QP_NUM = 1 << 4, + IBV_WC_EX_WITH_SRC_QP = 1 << 5, + IBV_WC_EX_WITH_PKEY_INDEX = 1 << 6, + IBV_WC_EX_WITH_SLID = 1 << 7, + IBV_WC_EX_WITH_SL = 1 << 8, + IBV_WC_EX_WITH_DLID_PATH_BITS = 1 << 9, +}; + enum { IBV_WC_FEATURE_FLAGS = IBV_WC_EX_GRH | IBV_WC_EX_IMM }; @@ -393,19 +408,6 @@ enum { IBV_WC_EX_WITH_DLID_PATH_BITS }; -enum ibv_wc_flags_ex { - IBV_WC_EX_GRH = 1 << 0, - IBV_WC_EX_IMM = 1 << 1, - IBV_WC_EX_WITH_BYTE_LEN = 1 << 2, - IBV_WC_EX_WITH_IMM = 1 << 3, - IBV_WC_EX_WITH_QP_NUM = 1 << 4, - IBV_WC_EX_WITH_SRC_QP = 1 << 5, - IBV_WC_EX_WITH_PKEY_INDEX = 1 << 6, - IBV_WC_EX_WITH_SLID = 1 << 7, - IBV_WC_EX_WITH_SL = 1 << 8, - IBV_WC_EX_WITH_DLID_PATH_BITS = 1 << 9, -}; - /* fields order in wc_ex * uint32_tbyte_len, * uint32_timm_data; // in network byte order diff --git a/man/ibv_query_device_ex.3 b/man/ibv_query_device_ex.3 index 1f483d2..db12c2b 100644 --- a/man/ibv_query_device_ex.3 +++ b/man/ibv_query_device_ex.3 @@ -22,8 +22,10 @@ is a pointer to an ibv_device_attr_ex struct, as defined in struct ibv_device_attr_ex { .in +8 struct ibv_device_attr orig_attr; -uint32_t comp_mask; /* Compatibility mask that defines which of the following variables are valid */ -struct ibv_odp_capsodp_caps; /* On-Demand Paging capabilities */ +uint32_t comp_mask; /* Compatibility mask that defines which of the following variables are valid */ +struct ibv_odp_capsodp_caps; /* On-Demand Paging capabilities */ +uint64_t completion_timestamp_mas
[PATCH libibverbs 3/7] Implement ibv_poll_cq_ex extension verb
Add an implementation for verb_poll_cq extension verb. This patch implements the new API via the standard function mlx4_poll_one. Signed-off-by: Matan Barak --- src/cq.c| 307 ++-- src/mlx4.c | 1 + src/mlx4.h | 4 + src/verbs.c | 1 + 4 files changed, 284 insertions(+), 29 deletions(-) diff --git a/src/cq.c b/src/cq.c index 32c9070..c86e824 100644 --- a/src/cq.c +++ b/src/cq.c @@ -52,6 +52,7 @@ enum { }; enum { + CQ_CONTINUE = 1, CQ_OK = 0, CQ_EMPTY= -1, CQ_POLL_ERR = -2 @@ -121,7 +122,9 @@ static void update_cons_index(struct mlx4_cq *cq) *cq->set_ci_db = htonl(cq->cons_index & 0xff); } -static void mlx4_handle_error_cqe(struct mlx4_err_cqe *cqe, struct ibv_wc *wc) +static void mlx4_handle_error_cqe(struct mlx4_err_cqe *cqe, + enum ibv_wc_status *status, + enum ibv_wc_opcode *vendor_err) { if (cqe->syndrome == MLX4_CQE_SYNDROME_LOCAL_QP_OP_ERR) printf(PFX "local QP operation err " @@ -133,64 +136,68 @@ static void mlx4_handle_error_cqe(struct mlx4_err_cqe *cqe, struct ibv_wc *wc) switch (cqe->syndrome) { case MLX4_CQE_SYNDROME_LOCAL_LENGTH_ERR: - wc->status = IBV_WC_LOC_LEN_ERR; + *status = IBV_WC_LOC_LEN_ERR; break; case MLX4_CQE_SYNDROME_LOCAL_QP_OP_ERR: - wc->status = IBV_WC_LOC_QP_OP_ERR; + *status = IBV_WC_LOC_QP_OP_ERR; break; case MLX4_CQE_SYNDROME_LOCAL_PROT_ERR: - wc->status = IBV_WC_LOC_PROT_ERR; + *status = IBV_WC_LOC_PROT_ERR; break; case MLX4_CQE_SYNDROME_WR_FLUSH_ERR: - wc->status = IBV_WC_WR_FLUSH_ERR; + *status = IBV_WC_WR_FLUSH_ERR; break; case MLX4_CQE_SYNDROME_MW_BIND_ERR: - wc->status = IBV_WC_MW_BIND_ERR; + *status = IBV_WC_MW_BIND_ERR; break; case MLX4_CQE_SYNDROME_BAD_RESP_ERR: - wc->status = IBV_WC_BAD_RESP_ERR; + *status = IBV_WC_BAD_RESP_ERR; break; case MLX4_CQE_SYNDROME_LOCAL_ACCESS_ERR: - wc->status = IBV_WC_LOC_ACCESS_ERR; + *status = IBV_WC_LOC_ACCESS_ERR; break; case MLX4_CQE_SYNDROME_REMOTE_INVAL_REQ_ERR: - wc->status = IBV_WC_REM_INV_REQ_ERR; + *status = IBV_WC_REM_INV_REQ_ERR; break; case MLX4_CQE_SYNDROME_REMOTE_ACCESS_ERR: - wc->status = IBV_WC_REM_ACCESS_ERR; + *status = IBV_WC_REM_ACCESS_ERR; break; case MLX4_CQE_SYNDROME_REMOTE_OP_ERR: - wc->status = IBV_WC_REM_OP_ERR; + *status = IBV_WC_REM_OP_ERR; break; case MLX4_CQE_SYNDROME_TRANSPORT_RETRY_EXC_ERR: - wc->status = IBV_WC_RETRY_EXC_ERR; + *status = IBV_WC_RETRY_EXC_ERR; break; case MLX4_CQE_SYNDROME_RNR_RETRY_EXC_ERR: - wc->status = IBV_WC_RNR_RETRY_EXC_ERR; + *status = IBV_WC_RNR_RETRY_EXC_ERR; break; case MLX4_CQE_SYNDROME_REMOTE_ABORTED_ERR: - wc->status = IBV_WC_REM_ABORT_ERR; + *status = IBV_WC_REM_ABORT_ERR; break; default: - wc->status = IBV_WC_GENERAL_ERR; + *status = IBV_WC_GENERAL_ERR; break; } - wc->vendor_err = cqe->vendor_err; + *vendor_err = cqe->vendor_err; } -static int mlx4_poll_one(struct mlx4_cq *cq, -struct mlx4_qp **cur_qp, -struct ibv_wc *wc) +static inline int mlx4_handle_cq(struct mlx4_cq *cq, +struct mlx4_qp **cur_qp, +uint64_t *wc_wr_id, +enum ibv_wc_status *wc_status, +uint32_t *wc_vendor_err, +struct mlx4_cqe **pcqe, +uint32_t *pqpn, +int *pis_send) { struct mlx4_wq *wq; struct mlx4_cqe *cqe; struct mlx4_srq *srq; uint32_t qpn; - uint32_t g_mlpath_rqpn; - uint16_t wqe_index; int is_error; int is_send; + uint16_t wqe_index; cqe = next_cqe_sw(cq); if (!cqe) @@ -201,7 +208,7 @@ static int mlx4_poll_one(struct mlx4_cq *cq, ++cq->cons_index; - VALGRIND_MAKE_MEM_DEFINED(cqe, sizeof *cqe); + VALGRIND_MAKE_MEM_DEFINED(cqe, sizeof(*cqe)); /* * Make sure we read CQ entry contents after we've checked the @@ -210,7 +217,6 @@ st
[PATCH libibverbs 5/7] Add support for ibv_query_values_ex
Adding mlx4_query_values as implementation for ibv_query_values_ex. mlx4_query_values follows the standard extension verb mechanism. This function supports reading the hwclock via mmaping the required space from kernel. Signed-off-by: Matan Barak --- src/mlx4.c | 36 src/mlx4.h | 3 +++ src/verbs.c | 45 + 3 files changed, 84 insertions(+) diff --git a/src/mlx4.c b/src/mlx4.c index cc1211f..6d66cf0 100644 --- a/src/mlx4.c +++ b/src/mlx4.c @@ -116,6 +116,28 @@ static struct ibv_context_ops mlx4_ctx_ops = { .detach_mcast = ibv_cmd_detach_mcast }; +static int mlx4_map_internal_clock(struct mlx4_device *dev, + struct ibv_context *ibv_ctx) +{ + struct mlx4_context *context = to_mctx(ibv_ctx); + void *hca_clock_page; + + hca_clock_page = mmap(NULL, dev->page_size, PROT_READ, MAP_SHARED, + ibv_ctx->cmd_fd, dev->page_size * 3); + + if (hca_clock_page == MAP_FAILED) { + fprintf(stderr, PFX + "Warning: Timestamp available,\n" + "but failed to mmap() hca core clock page, errno=%d.\n", + errno); + return -1; + } + + context->hca_core_clock = hca_clock_page + + context->core_clock_offset % dev->page_size; + return 0; +} + static int mlx4_init_context(struct verbs_device *v_device, struct ibv_context *ibv_ctx, int cmd_fd) { @@ -127,6 +149,10 @@ static int mlx4_init_context(struct verbs_device *v_device, __u16 bf_reg_size; struct mlx4_device *dev = to_mdev(&v_device->device); struct verbs_context *verbs_ctx = verbs_get_ctx(ibv_ctx); + struct ibv_query_device_ex_input input_query_device = {.comp_mask = 0}; + struct ibv_device_attr_ex dev_attrs; + uint32_tdev_attrs_comp_mask; + int err; /* memory footprint of mlx4_context and verbs_context share * struct ibv_context. @@ -194,6 +220,12 @@ static int mlx4_init_context(struct verbs_device *v_device, context->bf_buf_size = 0; } + context->hca_core_clock = NULL; + err = _mlx4_query_device_ex(ibv_ctx, &input_query_device, &dev_attrs, + sizeof(dev_attrs), &dev_attrs_comp_mask); + if (!err && dev_attrs_comp_mask & QUERY_DEVICE_RESP_MASK_TIMESTAMP) + mlx4_map_internal_clock(dev, ibv_ctx); + pthread_spin_init(&context->uar_lock, PTHREAD_PROCESS_PRIVATE); ibv_ctx->ops = mlx4_ctx_ops; @@ -210,6 +242,7 @@ static int mlx4_init_context(struct verbs_device *v_device, verbs_set_ctx_op(verbs_ctx, query_device_ex, mlx4_query_device_ex); verbs_set_ctx_op(verbs_ctx, create_cq_ex, mlx4_create_cq_ex); verbs_set_ctx_op(verbs_ctx, poll_cq_ex, mlx4_poll_cq_ex); + verbs_set_ctx_op(verbs_ctx, query_values, mlx4_query_values); return 0; @@ -223,6 +256,9 @@ static void mlx4_uninit_context(struct verbs_device *v_device, munmap(context->uar, to_mdev(&v_device->device)->page_size); if (context->bf_page) munmap(context->bf_page, to_mdev(&v_device->device)->page_size); + if (context->hca_core_clock) + munmap(context->hca_core_clock - context->core_clock_offset, + to_mdev(&v_device->device)->page_size); } diff --git a/src/mlx4.h b/src/mlx4.h index 2465298..8e1935d 100644 --- a/src/mlx4.h +++ b/src/mlx4.h @@ -199,6 +199,7 @@ struct mlx4_context { enum ibv_port_cap_flags caps; } port_query_cache[MLX4_PORTS_NUM]; uint64_tcore_clock_offset; + void *hca_core_clock; }; struct mlx4_buf { @@ -403,6 +404,8 @@ int _mlx4_query_device_ex(struct ibv_context *context, int mlx4_query_device_ex(struct ibv_context *context, const struct ibv_query_device_ex_input *input, struct ibv_device_attr_ex *attr, size_t attr_size); +int mlx4_query_values(struct ibv_context *context, + struct ibv_values_ex *values); int mlx4_query_port(struct ibv_context *context, uint8_t port, struct ibv_port_attr *attr); diff --git a/src/verbs.c b/src/verbs.c index a8d6bd7..843ca1e 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -114,6 +114,51 @@ int mlx4_query_device_ex(struct ibv_context *context, return _mlx4_query_device_ex(context, input, attr, attr_size, NULL); } +#define READL(ptr) (*((uint32_t *)(ptr))) +static int mlx4_read_clock(struct ibv_context *context, uint64_t *cycles) +{ + unsigned int clockhi, clocklo, clockhi1; + int i; + struct mlx4_context *ctx = to_mctx(context); + + if (!ctx->hca_core_clo
[PATCH libibverbs 4/7] Add timestmap support to extended poll_cq verb
Adding support to the extended version of poll_cq verb to read completion timestamp. Reading timestamp isn't supported with reading IBV_WC_EX_WITH_SL and IBV_WC_EX_WITH_SLID. Signed-off-by: Matan Barak --- src/cq.c| 10 ++ src/mlx4.h | 25 - src/verbs.c | 3 ++- 3 files changed, 32 insertions(+), 6 deletions(-) diff --git a/src/cq.c b/src/cq.c index c86e824..7f40f12 100644 --- a/src/cq.c +++ b/src/cq.c @@ -399,6 +399,16 @@ static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq, if (err != CQ_CONTINUE) return err; + if (wc_flags & IBV_WC_EX_WITH_COMPLETION_TIMESTAMP) { + uint16_t timestamp_0_15 = cqe->timestamp_0_7 | + cqe->timestamp_8_15 << 8; + + wc_flags_out |= IBV_WC_EX_WITH_COMPLETION_TIMESTAMP; + *wc_buffer.b64++ = (((uint64_t)ntohl(cqe->timestamp_16_47) ++ !timestamp_0_15) << 16) | + (uint64_t)timestamp_0_15; + } + if (is_send) { switch (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) { case MLX4_OPCODE_RDMA_WRITE_IMM: diff --git a/src/mlx4.h b/src/mlx4.h index e22f879..2465298 100644 --- a/src/mlx4.h +++ b/src/mlx4.h @@ -312,14 +312,29 @@ struct mlx4_cqe { uint32_tvlan_my_qpn; uint32_timmed_rss_invalid; uint32_tg_mlpath_rqpn; - uint8_t sl_vid; - uint8_t reserved1; - uint16_trlid; - uint32_tstatus; + union { + struct { + union { + struct { + uint8_t sl_vid; + uint8_t reserved1; + uint16_t rlid; + }; + uint32_t timestamp_16_47; + }; + uint32_t status; + }; + struct { + uint16_t reserved2; + uint8_t smac[6]; + }; + }; uint32_tbyte_cnt; uint16_twqe_index; uint16_tchecksum; - uint8_t reserved3[3]; + uint8_t reserved3; + uint8_t timestamp_8_15; + uint8_t timestamp_0_7; uint8_t owner_sr_opcode; }; diff --git a/src/verbs.c b/src/verbs.c index 0dcdc87..a8d6bd7 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -286,7 +286,8 @@ enum { }; enum { - CREATE_CQ_SUPPORTED_WC_FLAGS = IBV_WC_STANDARD_FLAGS + CREATE_CQ_SUPPORTED_WC_FLAGS = IBV_WC_STANDARD_FLAGS| + IBV_WC_EX_WITH_COMPLETION_TIMESTAMP }; static struct ibv_cq *create_cq(struct ibv_context *context, -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH libibverbs 1/7] Add support for extended version of ibv_query_device
The new mlx4_query_device_ex implementation uses the extended version of libibverbs/uverbs query_device command. In addition, it reads the hca_core_clock offset in the bar from the vendor specific part of ibv_query_device_ex command. Signed-off-by: Matan Barak --- src/mlx4-abi.h | 13 + src/mlx4.c | 1 + src/mlx4.h | 8 src/verbs.c| 54 +++--- 4 files changed, 73 insertions(+), 3 deletions(-) diff --git a/src/mlx4-abi.h b/src/mlx4-abi.h index b48f6fc..b348ce3 100644 --- a/src/mlx4-abi.h +++ b/src/mlx4-abi.h @@ -111,4 +111,17 @@ struct mlx4_create_qp { __u8reserved[5]; }; +enum query_device_resp_mask { + QUERY_DEVICE_RESP_MASK_TIMESTAMP = 1UL << 0, +}; + +struct query_device_ex_resp { + struct ibv_query_device_resp_ex core; + struct { + uint32_t comp_mask; + uint32_t response_length; + uint64_t hca_core_clock_offset; + }; +}; + #endif /* MLX4_ABI_H */ diff --git a/src/mlx4.c b/src/mlx4.c index c30f4bf..d41dff0 100644 --- a/src/mlx4.c +++ b/src/mlx4.c @@ -207,6 +207,7 @@ static int mlx4_init_context(struct verbs_device *v_device, verbs_set_ctx_op(verbs_ctx, open_qp, mlx4_open_qp); verbs_set_ctx_op(verbs_ctx, ibv_create_flow, ibv_cmd_create_flow); verbs_set_ctx_op(verbs_ctx, ibv_destroy_flow, ibv_cmd_destroy_flow); + verbs_set_ctx_op(verbs_ctx, query_device_ex, mlx4_query_device_ex); return 0; diff --git a/src/mlx4.h b/src/mlx4.h index 519d8f4..0f643bc 100644 --- a/src/mlx4.h +++ b/src/mlx4.h @@ -198,6 +198,7 @@ struct mlx4_context { uint8_t link_layer; enum ibv_port_cap_flags caps; } port_query_cache[MLX4_PORTS_NUM]; + uint64_tcore_clock_offset; }; struct mlx4_buf { @@ -378,6 +379,13 @@ void mlx4_free_db(struct mlx4_context *context, enum mlx4_db_type type, uint32_t int mlx4_query_device(struct ibv_context *context, struct ibv_device_attr *attr); +int _mlx4_query_device_ex(struct ibv_context *context, + const struct ibv_query_device_ex_input *input, + struct ibv_device_attr_ex *attr, size_t attr_size, + uint32_t *comp_mask); +int mlx4_query_device_ex(struct ibv_context *context, +const struct ibv_query_device_ex_input *input, +struct ibv_device_attr_ex *attr, size_t attr_size); int mlx4_query_port(struct ibv_context *context, uint8_t port, struct ibv_port_attr *attr); diff --git a/src/verbs.c b/src/verbs.c index 2cb1f8a..e93114b 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -45,6 +45,14 @@ #include "mlx4-abi.h" #include "wqe.h" +static void parse_raw_fw_ver(uint64_t raw_fw_ver, unsigned *major, +unsigned *minor, unsigned *sub_minor) +{ + *major = (raw_fw_ver >> 32) & 0x; + *minor = (raw_fw_ver >> 16) & 0x; + *sub_minor = raw_fw_ver & 0x; +} + int mlx4_query_device(struct ibv_context *context, struct ibv_device_attr *attr) { struct ibv_query_device cmd; @@ -56,9 +64,7 @@ int mlx4_query_device(struct ibv_context *context, struct ibv_device_attr *attr) if (ret) return ret; - major = (raw_fw_ver >> 32) & 0x; - minor = (raw_fw_ver >> 16) & 0x; - sub_minor = raw_fw_ver & 0x; + parse_raw_fw_ver(raw_fw_ver, &major, &minor, &sub_minor); snprintf(attr->fw_ver, sizeof attr->fw_ver, "%d.%d.%03d", major, minor, sub_minor); @@ -66,6 +72,48 @@ int mlx4_query_device(struct ibv_context *context, struct ibv_device_attr *attr) return 0; } +int _mlx4_query_device_ex(struct ibv_context *context, + const struct ibv_query_device_ex_input *input, + struct ibv_device_attr_ex *attr, size_t attr_size, + uint32_t *comp_mask) +{ + struct ibv_query_device_ex cmd; + struct query_device_ex_resp resp; + uint64_t raw_fw_ver; + unsigned major, minor, sub_minor; + int ret; + + memset(&resp, 0, sizeof(resp)); + + ret = ibv_cmd_query_device_ex(context, input, attr, attr_size, + &raw_fw_ver, &cmd, sizeof(cmd), + sizeof(cmd), &resp.core, + sizeof(resp.core), sizeof(resp)); + if (ret) + return ret; + + parse_raw_fw_ver(raw_fw_ver, &major, &minor, &sub_minor); + + snprintf(attr->orig_attr.fw_ver, sizeof(attr->orig_attr.fw_ver), +"%d.%d.%03d", major, minor, sub_minor); + + if (resp.comp_mask & QUERY_DEVICE_RESP_MASK_TIMESTAMP) + to_mctx(context)->core_clock_offset = + re
[PATCH libibverbs 6/7] Add support for different poll_one_ex functions
In order to opitimize the poll_one extended verb for different wc_flags, add support for poll_one_ex callback function. Signed-off-by: Matan Barak --- src/cq.c| 5 +++-- src/mlx4.h | 5 + src/verbs.c | 1 + 3 files changed, 9 insertions(+), 2 deletions(-) diff --git a/src/cq.c b/src/cq.c index 7f40f12..1f2d572 100644 --- a/src/cq.c +++ b/src/cq.c @@ -601,7 +601,8 @@ int mlx4_poll_cq_ex(struct ibv_cq *ibcq, int npolled; int err = CQ_OK; unsigned int ne = attr->max_entries; - uint64_t wc_flags = cq->wc_flags; + int (*poll_fn)(struct mlx4_cq *cq, struct mlx4_qp **cur_qp, + struct ibv_wc_ex **wc_ex) = cq->mlx4_poll_one; if (attr->comp_mask) return -EINVAL; @@ -609,7 +610,7 @@ int mlx4_poll_cq_ex(struct ibv_cq *ibcq, pthread_spin_lock(&cq->lock); for (npolled = 0; npolled < ne; ++npolled) { - err = _mlx4_poll_one_ex(cq, &qp, &wc, wc_flags); + err = poll_fn(cq, &qp, &wc); if (err != CQ_OK) break; } diff --git a/src/mlx4.h b/src/mlx4.h index 8e1935d..46a18d6 100644 --- a/src/mlx4.h +++ b/src/mlx4.h @@ -215,6 +215,8 @@ struct mlx4_pd { struct mlx4_cq { struct ibv_cq ibv_cq; uint64_twc_flags; + int (*mlx4_poll_one)(struct mlx4_cq *cq, struct mlx4_qp **cur_qp, +struct ibv_wc_ex **wc_ex); struct mlx4_buf buf; struct mlx4_buf resize_buf; pthread_spinlock_t lock; @@ -432,6 +434,9 @@ int mlx4_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc); int mlx4_poll_cq_ex(struct ibv_cq *ibcq, struct ibv_wc_ex *wc, struct ibv_poll_cq_ex_attr *attr); +int mlx4_poll_one_ex(struct mlx4_cq *cq, +struct mlx4_qp **cur_qp, +struct ibv_wc_ex **pwc_ex); int mlx4_arm_cq(struct ibv_cq *cq, int solicited); void mlx4_cq_event(struct ibv_cq *cq); void __mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq); diff --git a/src/verbs.c b/src/verbs.c index 843ca1e..62908c1 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -432,6 +432,7 @@ static struct ibv_cq *create_cq(struct ibv_context *context, if (ret) goto err_db; + cq->mlx4_poll_one = mlx4_poll_one_ex; cq->creation_flags = cmd_e.ibv_cmd.flags; cq->wc_flags = cq_attr->wc_flags; cq->cqn = resp.cqn; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH libibverbs 7/7] Optimize ibv_poll_cq_ex for common scenarios
The current ibv_poll_cq_ex mechanism needs to query every field for its existence. In order to avoid this penalty at runtime, add optimized functions for special cases. Signed-off-by: Matan Barak --- configure.ac | 17 src/cq.c | 268 ++- src/mlx4.h | 20 - src/verbs.c | 10 +-- 4 files changed, 271 insertions(+), 44 deletions(-) diff --git a/configure.ac b/configure.ac index 6e98f20..9dbbb4b 100644 --- a/configure.ac +++ b/configure.ac @@ -45,6 +45,23 @@ AC_CHECK_MEMBER([struct verbs_context.ibv_create_flow], [], [AC_MSG_ERROR([libmlx4 requires libibverbs >= 1.2.0])], [[#include ]]) +AC_MSG_CHECKING("always inline") +CFLAGS_BAK="$CFLAGS" +CFLAGS="$CFLAGS -Werror" +AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[ + static inline int f(void) + __attribute((always_inline)); + static inline int f(void) + { + return 1; + } +]],[[ + int a = f(); + a = a; +]])], [AC_MSG_RESULT([yes]) AC_DEFINE([HAVE_ALWAYS_INLINE], [1], [Define if __attribute((always_inline)).])], +[AC_MSG_RESULT([no])]) +CFLAGS="$CFLAGS_BAK" + dnl Checks for typedefs, structures, and compiler characteristics. AC_C_CONST AC_CHECK_SIZEOF(long) diff --git a/src/cq.c b/src/cq.c index 1f2d572..56c0fa4 100644 --- a/src/cq.c +++ b/src/cq.c @@ -377,10 +377,22 @@ union wc_buffer { uint64_t*b64; }; +#define IS_IN_WC_FLAGS(yes, no, maybe, flag) (((yes) & (flag)) ||\ + (!((no) & (flag)) && \ + ((maybe) & (flag static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq, struct mlx4_qp **cur_qp, struct ibv_wc_ex **pwc_ex, - uint64_t wc_flags) + uint64_t wc_flags, + uint64_t yes_wc_flags, + uint64_t no_wc_flags) + ALWAYS_INLINE; +static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq, + struct mlx4_qp **cur_qp, + struct ibv_wc_ex **pwc_ex, + uint64_t wc_flags, + uint64_t wc_flags_yes, + uint64_t wc_flags_no) { struct mlx4_cqe *cqe; uint32_t qpn; @@ -392,14 +404,14 @@ static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq, uint64_t wc_flags_out = 0; wc_buffer.b64 = (uint64_t *)&wc_ex->buffer; - wc_ex->wc_flags = 0; wc_ex->reserved = 0; err = mlx4_handle_cq(cq, cur_qp, &wc_ex->wr_id, &wc_ex->status, &wc_ex->vendor_err, &cqe, &qpn, &is_send); if (err != CQ_CONTINUE) return err; - if (wc_flags & IBV_WC_EX_WITH_COMPLETION_TIMESTAMP) { + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, + IBV_WC_EX_WITH_COMPLETION_TIMESTAMP)) { uint16_t timestamp_0_15 = cqe->timestamp_0_7 | cqe->timestamp_8_15 << 8; @@ -415,80 +427,101 @@ static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq, wc_flags_out |= IBV_WC_EX_IMM; case MLX4_OPCODE_RDMA_WRITE: wc_ex->opcode= IBV_WC_RDMA_WRITE; - if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, + IBV_WC_EX_WITH_BYTE_LEN)) wc_buffer.b32++; - if (wc_flags & IBV_WC_EX_WITH_IMM) + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, + IBV_WC_EX_WITH_IMM)) wc_buffer.b32++; break; case MLX4_OPCODE_SEND_IMM: wc_flags_out |= IBV_WC_EX_IMM; case MLX4_OPCODE_SEND: wc_ex->opcode= IBV_WC_SEND; - if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, + IBV_WC_EX_WITH_BYTE_LEN)) wc_buffer.b32++; - if (wc_flags & IBV_WC_EX_WITH_IMM) + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, + IBV_WC_EX_WITH_IMM)) wc_buffer.b32++; break; case MLX4_OPCODE_RDMA_READ: wc_ex->opcode= IBV_WC_RDMA_READ; - if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) { + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_fla
[PATCH libibverbs 0/7] Completion timestamping
Hi Yishai, This series adds support for completion timestamp. In order to support this feature, several extended verbs were implemented (as instructed in libibverbs). ibv_query_device_ex was extended to support reading the hca_core_clock and timestamp mask. The same verb was extended with vendor dependant data which is used in order to map the HCA's free running clock register. When libmlx4 initializes, it tried to mmap this free running clock register. This mapping is used in order to implement ibv_query_values_ex efficiently. In order to support CQ completion timestmap reporting, we implement ibv_create_cq_ex verb. This verb is used both for creating a CQ which supports timestamp and in order to state which fields should be returned via WC. Returning this data is done via implementing ibv_poll_cq_ex. We query the CQ requested wc_flags for every field the user has requested and populate it according to the carried network operation and WC status. Last but not least, ibv_poll_cq_ex was optimized in order to eliminate the if statements and or operations for common combinations of wc fields. This is done by inlining and using a custom poll_one_ex function for these fields. Thanks, Matan Matan Barak (7): Add support for extended version of ibv_query_device Add support for ibv_create_cq_ex Implement ibv_poll_cq_ex extension verb Add timestmap support to extended poll_cq verb Add support for ibv_query_values_ex Add support for different poll_one_ex functions Optimize ibv_poll_cq_ex for common scenarios configure.ac | 17 ++ src/cq.c | 512 + src/mlx4-abi.h | 25 +++ src/mlx4.c | 39 + src/mlx4.h | 64 +++- src/verbs.c| 219 +--- 6 files changed, 823 insertions(+), 53 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH libibverbs 2/7] Add support for ibv_create_cq_ex
Add an extension verb mlx4_create_cq_ex that follows the standard extension verb mechanism. This function is similar to mlx4_create_cq but supports the extension verbs functions and stores the creation flags for later use (for example, timestamp flag is used in poll_cq). The function fails if the user passes unsupported WC attributes. Signed-off-by: Matan Barak --- src/mlx4-abi.h | 12 ++ src/mlx4.c | 1 + src/mlx4.h | 3 ++ src/verbs.c| 117 + 4 files changed, 117 insertions(+), 16 deletions(-) diff --git a/src/mlx4-abi.h b/src/mlx4-abi.h index b348ce3..9b765e4 100644 --- a/src/mlx4-abi.h +++ b/src/mlx4-abi.h @@ -72,12 +72,24 @@ struct mlx4_create_cq { __u64 db_addr; }; +struct mlx4_create_cq_ex { + struct ibv_create_cq_ex ibv_cmd; + __u64 buf_addr; + __u64 db_addr; +}; + struct mlx4_create_cq_resp { struct ibv_create_cq_resp ibv_resp; __u32 cqn; __u32 reserved; }; +struct mlx4_create_cq_resp_ex { + struct ibv_create_cq_resp_exibv_resp; + __u32 cqn; + __u32 reserved; +}; + struct mlx4_resize_cq { struct ibv_resize_cqibv_cmd; __u64 buf_addr; diff --git a/src/mlx4.c b/src/mlx4.c index d41dff0..9cfd013 100644 --- a/src/mlx4.c +++ b/src/mlx4.c @@ -208,6 +208,7 @@ static int mlx4_init_context(struct verbs_device *v_device, verbs_set_ctx_op(verbs_ctx, ibv_create_flow, ibv_cmd_create_flow); verbs_set_ctx_op(verbs_ctx, ibv_destroy_flow, ibv_cmd_destroy_flow); verbs_set_ctx_op(verbs_ctx, query_device_ex, mlx4_query_device_ex); + verbs_set_ctx_op(verbs_ctx, create_cq_ex, mlx4_create_cq_ex); return 0; diff --git a/src/mlx4.h b/src/mlx4.h index 0f643bc..91eb79c 100644 --- a/src/mlx4.h +++ b/src/mlx4.h @@ -222,6 +222,7 @@ struct mlx4_cq { uint32_t *arm_db; int arm_sn; int cqe_size; + int creation_flags; }; struct mlx4_srq { @@ -402,6 +403,8 @@ int mlx4_dereg_mr(struct ibv_mr *mr); struct ibv_cq *mlx4_create_cq(struct ibv_context *context, int cqe, struct ibv_comp_channel *channel, int comp_vector); +struct ibv_cq *mlx4_create_cq_ex(struct ibv_context *context, +struct ibv_create_cq_attr_ex *cq_attr); int mlx4_alloc_cq_buf(struct mlx4_device *dev, struct mlx4_buf *buf, int nent, int entry_size); int mlx4_resize_cq(struct ibv_cq *cq, int cqe); diff --git a/src/verbs.c b/src/verbs.c index e93114b..3290b86 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -272,19 +272,69 @@ int align_queue_size(int req) return nent; } -struct ibv_cq *mlx4_create_cq(struct ibv_context *context, int cqe, - struct ibv_comp_channel *channel, - int comp_vector) +enum cmd_type { + MLX4_CMD_TYPE_BASIC, + MLX4_CMD_TYPE_EXTENDED +}; + +enum { + CREATE_CQ_SUPPORTED_COMP_MASK = IBV_CREATE_CQ_ATTR_FLAGS +}; + +enum { + CREATE_CQ_SUPPORTED_FLAGS = IBV_CREATE_CQ_ATTR_COMPLETION_TIMESTAMP +}; + +enum { + CREATE_CQ_SUPPORTED_WC_FLAGS = IBV_WC_STANDARD_FLAGS +}; + +static struct ibv_cq *create_cq(struct ibv_context *context, + struct ibv_create_cq_attr_ex *cq_attr, + enum cmd_type cmd_type) { - struct mlx4_create_cq cmd; - struct mlx4_create_cq_resp resp; - struct mlx4_cq*cq; - intret; - struct mlx4_context *mctx = to_mctx(context); + struct mlx4_create_cq cmd; + struct mlx4_create_cq_excmd_e; + struct mlx4_create_cq_resp resp; + struct mlx4_create_cq_resp_ex resp_e; + struct mlx4_cq *cq; + int ret; + struct mlx4_context *mctx = to_mctx(context); + struct ibv_create_cq_attr_excq_attr_e; + int cqe; /* Sanity check CQ size before proceeding */ - if (cqe > 0x3f) + if (cq_attr->cqe > 0x3f) + return NULL; + + if (cq_attr->comp_mask & ~CREATE_CQ_SUPPORTED_COMP_MASK) { + errno = EINVAL; return NULL; + } + + if (cq_attr->comp_mask & IBV_CREATE_CQ_ATTR_FLAGS && + cq_attr->flags & ~CREATE_CQ_SUPPORTED_FLAGS) { + errno = EINVAL; + return NULL; + } + + if (cq_attr->wc_flags & ~CREATE_CQ_SUPPORTED_WC_FLAGS) { + errno = ENOTSUP; + return
Re: [PATCH libibverbs 6/7] Add support for different poll_one_ex functions
On Tue, Oct 27, 2015 at 6:52 PM, Matan Barak wrote: > In order to opitimize the poll_one extended verb for different > wc_flags, add support for poll_one_ex callback function. > > Signed-off-by: Matan Barak > --- > src/cq.c| 5 +++-- > src/mlx4.h | 5 + > src/verbs.c | 1 + > 3 files changed, 9 insertions(+), 2 deletions(-) > > diff --git a/src/cq.c b/src/cq.c > index 7f40f12..1f2d572 100644 > --- a/src/cq.c > +++ b/src/cq.c > @@ -601,7 +601,8 @@ int mlx4_poll_cq_ex(struct ibv_cq *ibcq, > int npolled; > int err = CQ_OK; > unsigned int ne = attr->max_entries; > - uint64_t wc_flags = cq->wc_flags; > + int (*poll_fn)(struct mlx4_cq *cq, struct mlx4_qp **cur_qp, > + struct ibv_wc_ex **wc_ex) = cq->mlx4_poll_one; > > if (attr->comp_mask) > return -EINVAL; > @@ -609,7 +610,7 @@ int mlx4_poll_cq_ex(struct ibv_cq *ibcq, > pthread_spin_lock(&cq->lock); > > for (npolled = 0; npolled < ne; ++npolled) { > - err = _mlx4_poll_one_ex(cq, &qp, &wc, wc_flags); > + err = poll_fn(cq, &qp, &wc); > if (err != CQ_OK) > break; > } > diff --git a/src/mlx4.h b/src/mlx4.h > index 8e1935d..46a18d6 100644 > --- a/src/mlx4.h > +++ b/src/mlx4.h > @@ -215,6 +215,8 @@ struct mlx4_pd { > struct mlx4_cq { > struct ibv_cq ibv_cq; > uint64_twc_flags; > + int (*mlx4_poll_one)(struct mlx4_cq *cq, struct mlx4_qp **cur_qp, > +struct ibv_wc_ex **wc_ex); > struct mlx4_buf buf; > struct mlx4_buf resize_buf; > pthread_spinlock_t lock; > @@ -432,6 +434,9 @@ int mlx4_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc > *wc); > int mlx4_poll_cq_ex(struct ibv_cq *ibcq, > struct ibv_wc_ex *wc, > struct ibv_poll_cq_ex_attr *attr); > +int mlx4_poll_one_ex(struct mlx4_cq *cq, > +struct mlx4_qp **cur_qp, > +struct ibv_wc_ex **pwc_ex); > int mlx4_arm_cq(struct ibv_cq *cq, int solicited); > void mlx4_cq_event(struct ibv_cq *cq); > void __mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq); > diff --git a/src/verbs.c b/src/verbs.c > index 843ca1e..62908c1 100644 > --- a/src/verbs.c > +++ b/src/verbs.c > @@ -432,6 +432,7 @@ static struct ibv_cq *create_cq(struct ibv_context > *context, > if (ret) > goto err_db; > > + cq->mlx4_poll_one = mlx4_poll_one_ex; > cq->creation_flags = cmd_e.ibv_cmd.flags; > cq->wc_flags = cq_attr->wc_flags; > cq->cqn = resp.cqn; > -- > 2.1.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html This should have libmlx4 prefix. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH libibverbs 1/7] Add support for extended version of ibv_query_device
On Tue, Oct 27, 2015 at 6:52 PM, Matan Barak wrote: > The new mlx4_query_device_ex implementation uses the > extended version of libibverbs/uverbs query_device command. > In addition, it reads the hca_core_clock offset in the bar > from the vendor specific part of ibv_query_device_ex command. > > Signed-off-by: Matan Barak > --- > src/mlx4-abi.h | 13 + > src/mlx4.c | 1 + > src/mlx4.h | 8 > src/verbs.c| 54 +++--- > 4 files changed, 73 insertions(+), 3 deletions(-) > > diff --git a/src/mlx4-abi.h b/src/mlx4-abi.h > index b48f6fc..b348ce3 100644 > --- a/src/mlx4-abi.h > +++ b/src/mlx4-abi.h > @@ -111,4 +111,17 @@ struct mlx4_create_qp { > __u8reserved[5]; > }; > > +enum query_device_resp_mask { > + QUERY_DEVICE_RESP_MASK_TIMESTAMP = 1UL << 0, > +}; > + > +struct query_device_ex_resp { > + struct ibv_query_device_resp_ex core; > + struct { > + uint32_t comp_mask; > + uint32_t response_length; > + uint64_t hca_core_clock_offset; > + }; > +}; > + > #endif /* MLX4_ABI_H */ > diff --git a/src/mlx4.c b/src/mlx4.c > index c30f4bf..d41dff0 100644 > --- a/src/mlx4.c > +++ b/src/mlx4.c > @@ -207,6 +207,7 @@ static int mlx4_init_context(struct verbs_device > *v_device, > verbs_set_ctx_op(verbs_ctx, open_qp, mlx4_open_qp); > verbs_set_ctx_op(verbs_ctx, ibv_create_flow, ibv_cmd_create_flow); > verbs_set_ctx_op(verbs_ctx, ibv_destroy_flow, ibv_cmd_destroy_flow); > + verbs_set_ctx_op(verbs_ctx, query_device_ex, mlx4_query_device_ex); > > return 0; > > diff --git a/src/mlx4.h b/src/mlx4.h > index 519d8f4..0f643bc 100644 > --- a/src/mlx4.h > +++ b/src/mlx4.h > @@ -198,6 +198,7 @@ struct mlx4_context { > uint8_t link_layer; > enum ibv_port_cap_flags caps; > } port_query_cache[MLX4_PORTS_NUM]; > + uint64_tcore_clock_offset; > }; > > struct mlx4_buf { > @@ -378,6 +379,13 @@ void mlx4_free_db(struct mlx4_context *context, enum > mlx4_db_type type, uint32_t > > int mlx4_query_device(struct ibv_context *context, >struct ibv_device_attr *attr); > +int _mlx4_query_device_ex(struct ibv_context *context, > + const struct ibv_query_device_ex_input *input, > + struct ibv_device_attr_ex *attr, size_t attr_size, > + uint32_t *comp_mask); > +int mlx4_query_device_ex(struct ibv_context *context, > +const struct ibv_query_device_ex_input *input, > +struct ibv_device_attr_ex *attr, size_t attr_size); > int mlx4_query_port(struct ibv_context *context, uint8_t port, > struct ibv_port_attr *attr); > > diff --git a/src/verbs.c b/src/verbs.c > index 2cb1f8a..e93114b 100644 > --- a/src/verbs.c > +++ b/src/verbs.c > @@ -45,6 +45,14 @@ > #include "mlx4-abi.h" > #include "wqe.h" > > +static void parse_raw_fw_ver(uint64_t raw_fw_ver, unsigned *major, > +unsigned *minor, unsigned *sub_minor) > +{ > + *major = (raw_fw_ver >> 32) & 0x; > + *minor = (raw_fw_ver >> 16) & 0x; > + *sub_minor = raw_fw_ver & 0x; > +} > + > int mlx4_query_device(struct ibv_context *context, struct ibv_device_attr > *attr) > { > struct ibv_query_device cmd; > @@ -56,9 +64,7 @@ int mlx4_query_device(struct ibv_context *context, struct > ibv_device_attr *attr) > if (ret) > return ret; > > - major = (raw_fw_ver >> 32) & 0x; > - minor = (raw_fw_ver >> 16) & 0x; > - sub_minor = raw_fw_ver & 0x; > + parse_raw_fw_ver(raw_fw_ver, &major, &minor, &sub_minor); > > snprintf(attr->fw_ver, sizeof attr->fw_ver, > "%d.%d.%03d", major, minor, sub_minor); > @@ -66,6 +72,48 @@ int mlx4_query_device(struct ibv_context *context, struct > ibv_device_attr *attr) > return 0; > } > > +int _mlx4_query_device_ex(struct ibv_context *context, > + const struct ibv_query_device_ex_input *input, > + struct ibv_device_attr_ex *attr, size_t attr_size, > + uint32_t *comp_mask) > +{ > + struct ibv_query_device_ex cmd; > + struct query_device_ex_resp resp; > + uint64_t raw_fw_ver; > + unsigned major, minor, sub_minor; > + int ret; > + > + memset(&resp, 0, sizeof(resp)); > + > + ret = ibv_cmd_query_device_ex(context, input, attr, attr_size, > + &raw_fw_ver, &cmd, sizeof(cmd), > + sizeof(cmd), &resp.core, > + sizeof(resp.core), sizeof(resp)); > + if (ret) > + return ret; > + > + parse_raw_fw_ver(raw_fw_ver, &major,
Re: [PATCH libibverbs 4/7] Add timestmap support to extended poll_cq verb
On Tue, Oct 27, 2015 at 6:52 PM, Matan Barak wrote: > Adding support to the extended version of poll_cq verb to read > completion timestamp. Reading timestamp isn't supported with reading > IBV_WC_EX_WITH_SL and IBV_WC_EX_WITH_SLID. > > Signed-off-by: Matan Barak > --- > src/cq.c| 10 ++ > src/mlx4.h | 25 - > src/verbs.c | 3 ++- > 3 files changed, 32 insertions(+), 6 deletions(-) > > diff --git a/src/cq.c b/src/cq.c > index c86e824..7f40f12 100644 > --- a/src/cq.c > +++ b/src/cq.c > @@ -399,6 +399,16 @@ static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq, > if (err != CQ_CONTINUE) > return err; > > + if (wc_flags & IBV_WC_EX_WITH_COMPLETION_TIMESTAMP) { > + uint16_t timestamp_0_15 = cqe->timestamp_0_7 | > + cqe->timestamp_8_15 << 8; > + > + wc_flags_out |= IBV_WC_EX_WITH_COMPLETION_TIMESTAMP; > + *wc_buffer.b64++ = (((uint64_t)ntohl(cqe->timestamp_16_47) > ++ !timestamp_0_15) << 16) | > + (uint64_t)timestamp_0_15; > + } > + > if (is_send) { > switch (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) { > case MLX4_OPCODE_RDMA_WRITE_IMM: > diff --git a/src/mlx4.h b/src/mlx4.h > index e22f879..2465298 100644 > --- a/src/mlx4.h > +++ b/src/mlx4.h > @@ -312,14 +312,29 @@ struct mlx4_cqe { > uint32_tvlan_my_qpn; > uint32_timmed_rss_invalid; > uint32_tg_mlpath_rqpn; > - uint8_t sl_vid; > - uint8_t reserved1; > - uint16_trlid; > - uint32_tstatus; > + union { > + struct { > + union { > + struct { > + uint8_t sl_vid; > + uint8_t reserved1; > + uint16_t rlid; > + }; > + uint32_t timestamp_16_47; > + }; > + uint32_t status; > + }; > + struct { > + uint16_t reserved2; > + uint8_t smac[6]; > + }; > + }; > uint32_tbyte_cnt; > uint16_twqe_index; > uint16_tchecksum; > - uint8_t reserved3[3]; > + uint8_t reserved3; > + uint8_t timestamp_8_15; > + uint8_t timestamp_0_7; > uint8_t owner_sr_opcode; > }; > > diff --git a/src/verbs.c b/src/verbs.c > index 0dcdc87..a8d6bd7 100644 > --- a/src/verbs.c > +++ b/src/verbs.c > @@ -286,7 +286,8 @@ enum { > }; > > enum { > - CREATE_CQ_SUPPORTED_WC_FLAGS = IBV_WC_STANDARD_FLAGS > + CREATE_CQ_SUPPORTED_WC_FLAGS = IBV_WC_STANDARD_FLAGS| > + IBV_WC_EX_WITH_COMPLETION_TIMESTAMP > }; > > static struct ibv_cq *create_cq(struct ibv_context *context, > -- > 2.1.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html This should have libmlx4 prefix. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH libibverbs 2/7] Add support for ibv_create_cq_ex
On Tue, Oct 27, 2015 at 6:52 PM, Matan Barak wrote: > Add an extension verb mlx4_create_cq_ex that follows the > standard extension verb mechanism. > This function is similar to mlx4_create_cq but supports the > extension verbs functions and stores the creation flags > for later use (for example, timestamp flag is used in poll_cq). > The function fails if the user passes unsupported WC attributes. > > Signed-off-by: Matan Barak > --- > src/mlx4-abi.h | 12 ++ > src/mlx4.c | 1 + > src/mlx4.h | 3 ++ > src/verbs.c| 117 > + > 4 files changed, 117 insertions(+), 16 deletions(-) > > diff --git a/src/mlx4-abi.h b/src/mlx4-abi.h > index b348ce3..9b765e4 100644 > --- a/src/mlx4-abi.h > +++ b/src/mlx4-abi.h > @@ -72,12 +72,24 @@ struct mlx4_create_cq { > __u64 db_addr; > }; > > +struct mlx4_create_cq_ex { > + struct ibv_create_cq_ex ibv_cmd; > + __u64 buf_addr; > + __u64 db_addr; > +}; > + > struct mlx4_create_cq_resp { > struct ibv_create_cq_resp ibv_resp; > __u32 cqn; > __u32 reserved; > }; > > +struct mlx4_create_cq_resp_ex { > + struct ibv_create_cq_resp_exibv_resp; > + __u32 cqn; > + __u32 reserved; > +}; > + > struct mlx4_resize_cq { > struct ibv_resize_cqibv_cmd; > __u64 buf_addr; > diff --git a/src/mlx4.c b/src/mlx4.c > index d41dff0..9cfd013 100644 > --- a/src/mlx4.c > +++ b/src/mlx4.c > @@ -208,6 +208,7 @@ static int mlx4_init_context(struct verbs_device > *v_device, > verbs_set_ctx_op(verbs_ctx, ibv_create_flow, ibv_cmd_create_flow); > verbs_set_ctx_op(verbs_ctx, ibv_destroy_flow, ibv_cmd_destroy_flow); > verbs_set_ctx_op(verbs_ctx, query_device_ex, mlx4_query_device_ex); > + verbs_set_ctx_op(verbs_ctx, create_cq_ex, mlx4_create_cq_ex); > > return 0; > > diff --git a/src/mlx4.h b/src/mlx4.h > index 0f643bc..91eb79c 100644 > --- a/src/mlx4.h > +++ b/src/mlx4.h > @@ -222,6 +222,7 @@ struct mlx4_cq { > uint32_t *arm_db; > int arm_sn; > int cqe_size; > + int creation_flags; > }; > > struct mlx4_srq { > @@ -402,6 +403,8 @@ int mlx4_dereg_mr(struct ibv_mr *mr); > struct ibv_cq *mlx4_create_cq(struct ibv_context *context, int cqe, >struct ibv_comp_channel *channel, >int comp_vector); > +struct ibv_cq *mlx4_create_cq_ex(struct ibv_context *context, > +struct ibv_create_cq_attr_ex *cq_attr); > int mlx4_alloc_cq_buf(struct mlx4_device *dev, struct mlx4_buf *buf, int > nent, > int entry_size); > int mlx4_resize_cq(struct ibv_cq *cq, int cqe); > diff --git a/src/verbs.c b/src/verbs.c > index e93114b..3290b86 100644 > --- a/src/verbs.c > +++ b/src/verbs.c > @@ -272,19 +272,69 @@ int align_queue_size(int req) > return nent; > } > > -struct ibv_cq *mlx4_create_cq(struct ibv_context *context, int cqe, > - struct ibv_comp_channel *channel, > - int comp_vector) > +enum cmd_type { > + MLX4_CMD_TYPE_BASIC, > + MLX4_CMD_TYPE_EXTENDED > +}; > + > +enum { > + CREATE_CQ_SUPPORTED_COMP_MASK = IBV_CREATE_CQ_ATTR_FLAGS > +}; > + > +enum { > + CREATE_CQ_SUPPORTED_FLAGS = IBV_CREATE_CQ_ATTR_COMPLETION_TIMESTAMP > +}; > + > +enum { > + CREATE_CQ_SUPPORTED_WC_FLAGS = IBV_WC_STANDARD_FLAGS > +}; > + > +static struct ibv_cq *create_cq(struct ibv_context *context, > + struct ibv_create_cq_attr_ex *cq_attr, > + enum cmd_type cmd_type) > { > - struct mlx4_create_cq cmd; > - struct mlx4_create_cq_resp resp; > - struct mlx4_cq*cq; > - intret; > - struct mlx4_context *mctx = to_mctx(context); > + struct mlx4_create_cq cmd; > + struct mlx4_create_cq_excmd_e; > + struct mlx4_create_cq_resp resp; > + struct mlx4_create_cq_resp_ex resp_e; > + struct mlx4_cq *cq; > + int ret; > + struct mlx4_context *mctx = to_mctx(context); > + struct ibv_create_cq_attr_excq_attr_e; > + int cqe; > > /* Sanity check CQ size before proceeding */ > - if (cqe > 0x3f) > + if (cq_attr->cqe > 0x3f) > + return NULL; > + > + if (cq_attr->comp_mask & ~CREATE_CQ_SUPPORTED_COMP_MASK) { > + errno = EINVAL; > return NULL; > + } > +
Re: [PATCH libibverbs 7/7] Optimize ibv_poll_cq_ex for common scenarios
On Tue, Oct 27, 2015 at 6:52 PM, Matan Barak wrote: > The current ibv_poll_cq_ex mechanism needs to query every field > for its existence. In order to avoid this penalty at runtime, > add optimized functions for special cases. > > Signed-off-by: Matan Barak > --- > configure.ac | 17 > src/cq.c | 268 > ++- > src/mlx4.h | 20 - > src/verbs.c | 10 +-- > 4 files changed, 271 insertions(+), 44 deletions(-) > > diff --git a/configure.ac b/configure.ac > index 6e98f20..9dbbb4b 100644 > --- a/configure.ac > +++ b/configure.ac > @@ -45,6 +45,23 @@ AC_CHECK_MEMBER([struct verbs_context.ibv_create_flow], [], > [AC_MSG_ERROR([libmlx4 requires libibverbs >= 1.2.0])], > [[#include ]]) > > +AC_MSG_CHECKING("always inline") > +CFLAGS_BAK="$CFLAGS" > +CFLAGS="$CFLAGS -Werror" > +AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[ > + static inline int f(void) > + __attribute((always_inline)); > + static inline int f(void) > + { > + return 1; > + } > +]],[[ > + int a = f(); > + a = a; > +]])], [AC_MSG_RESULT([yes]) AC_DEFINE([HAVE_ALWAYS_INLINE], [1], [Define if > __attribute((always_inline)).])], > +[AC_MSG_RESULT([no])]) > +CFLAGS="$CFLAGS_BAK" > + > dnl Checks for typedefs, structures, and compiler characteristics. > AC_C_CONST > AC_CHECK_SIZEOF(long) > diff --git a/src/cq.c b/src/cq.c > index 1f2d572..56c0fa4 100644 > --- a/src/cq.c > +++ b/src/cq.c > @@ -377,10 +377,22 @@ union wc_buffer { > uint64_t*b64; > }; > > +#define IS_IN_WC_FLAGS(yes, no, maybe, flag) (((yes) & (flag)) ||\ > + (!((no) & (flag)) && \ > + ((maybe) & (flag > static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq, > struct mlx4_qp **cur_qp, > struct ibv_wc_ex **pwc_ex, > - uint64_t wc_flags) > + uint64_t wc_flags, > + uint64_t yes_wc_flags, > + uint64_t no_wc_flags) > + ALWAYS_INLINE; > +static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq, > + struct mlx4_qp **cur_qp, > + struct ibv_wc_ex **pwc_ex, > + uint64_t wc_flags, > + uint64_t wc_flags_yes, > + uint64_t wc_flags_no) > { > struct mlx4_cqe *cqe; > uint32_t qpn; > @@ -392,14 +404,14 @@ static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq, > uint64_t wc_flags_out = 0; > > wc_buffer.b64 = (uint64_t *)&wc_ex->buffer; > - wc_ex->wc_flags = 0; > wc_ex->reserved = 0; > err = mlx4_handle_cq(cq, cur_qp, &wc_ex->wr_id, &wc_ex->status, > &wc_ex->vendor_err, &cqe, &qpn, &is_send); > if (err != CQ_CONTINUE) > return err; > > - if (wc_flags & IBV_WC_EX_WITH_COMPLETION_TIMESTAMP) { > + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, > + IBV_WC_EX_WITH_COMPLETION_TIMESTAMP)) { > uint16_t timestamp_0_15 = cqe->timestamp_0_7 | > cqe->timestamp_8_15 << 8; > > @@ -415,80 +427,101 @@ static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq, > wc_flags_out |= IBV_WC_EX_IMM; > case MLX4_OPCODE_RDMA_WRITE: > wc_ex->opcode= IBV_WC_RDMA_WRITE; > - if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) > + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, > wc_flags, > + IBV_WC_EX_WITH_BYTE_LEN)) > wc_buffer.b32++; > - if (wc_flags & IBV_WC_EX_WITH_IMM) > + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, > wc_flags, > + IBV_WC_EX_WITH_IMM)) > wc_buffer.b32++; > break; > case MLX4_OPCODE_SEND_IMM: > wc_flags_out |= IBV_WC_EX_IMM; > case MLX4_OPCODE_SEND: > wc_ex->opcode= IBV_WC_SEND; > - if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) > + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, > wc_flags, > + IBV_WC_EX_WITH_BYTE_LEN)) > wc_buffer.b32++; > - if (wc_flags & IBV_WC_EX_WITH_IMM) > + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, > wc_flags, > + IBV_WC_EX_WITH_IMM)) > wc_b
Re: [PATCH libibverbs 5/7] Add support for ibv_query_values_ex
On Tue, Oct 27, 2015 at 6:52 PM, Matan Barak wrote: > Adding mlx4_query_values as implementation for > ibv_query_values_ex. mlx4_query_values follows the > standard extension verb mechanism. > This function supports reading the hwclock via mmaping > the required space from kernel. > > Signed-off-by: Matan Barak > --- > src/mlx4.c | 36 > src/mlx4.h | 3 +++ > src/verbs.c | 45 + > 3 files changed, 84 insertions(+) > > diff --git a/src/mlx4.c b/src/mlx4.c > index cc1211f..6d66cf0 100644 > --- a/src/mlx4.c > +++ b/src/mlx4.c > @@ -116,6 +116,28 @@ static struct ibv_context_ops mlx4_ctx_ops = { > .detach_mcast = ibv_cmd_detach_mcast > }; > > +static int mlx4_map_internal_clock(struct mlx4_device *dev, > + struct ibv_context *ibv_ctx) > +{ > + struct mlx4_context *context = to_mctx(ibv_ctx); > + void *hca_clock_page; > + > + hca_clock_page = mmap(NULL, dev->page_size, PROT_READ, MAP_SHARED, > + ibv_ctx->cmd_fd, dev->page_size * 3); > + > + if (hca_clock_page == MAP_FAILED) { > + fprintf(stderr, PFX > + "Warning: Timestamp available,\n" > + "but failed to mmap() hca core clock page, > errno=%d.\n", > + errno); > + return -1; > + } > + > + context->hca_core_clock = hca_clock_page + > + context->core_clock_offset % dev->page_size; > + return 0; > +} > + > static int mlx4_init_context(struct verbs_device *v_device, > struct ibv_context *ibv_ctx, int cmd_fd) > { > @@ -127,6 +149,10 @@ static int mlx4_init_context(struct verbs_device > *v_device, > __u16 bf_reg_size; > struct mlx4_device *dev = to_mdev(&v_device->device); > struct verbs_context *verbs_ctx = verbs_get_ctx(ibv_ctx); > + struct ibv_query_device_ex_input input_query_device = {.comp_mask = > 0}; > + struct ibv_device_attr_ex dev_attrs; > + uint32_tdev_attrs_comp_mask; > + int err; > > /* memory footprint of mlx4_context and verbs_context share > * struct ibv_context. > @@ -194,6 +220,12 @@ static int mlx4_init_context(struct verbs_device > *v_device, > context->bf_buf_size = 0; > } > > + context->hca_core_clock = NULL; > + err = _mlx4_query_device_ex(ibv_ctx, &input_query_device, &dev_attrs, > + sizeof(dev_attrs), &dev_attrs_comp_mask); > + if (!err && dev_attrs_comp_mask & QUERY_DEVICE_RESP_MASK_TIMESTAMP) > + mlx4_map_internal_clock(dev, ibv_ctx); > + > pthread_spin_init(&context->uar_lock, PTHREAD_PROCESS_PRIVATE); > ibv_ctx->ops = mlx4_ctx_ops; > > @@ -210,6 +242,7 @@ static int mlx4_init_context(struct verbs_device > *v_device, > verbs_set_ctx_op(verbs_ctx, query_device_ex, mlx4_query_device_ex); > verbs_set_ctx_op(verbs_ctx, create_cq_ex, mlx4_create_cq_ex); > verbs_set_ctx_op(verbs_ctx, poll_cq_ex, mlx4_poll_cq_ex); > + verbs_set_ctx_op(verbs_ctx, query_values, mlx4_query_values); > > return 0; > > @@ -223,6 +256,9 @@ static void mlx4_uninit_context(struct verbs_device > *v_device, > munmap(context->uar, to_mdev(&v_device->device)->page_size); > if (context->bf_page) > munmap(context->bf_page, > to_mdev(&v_device->device)->page_size); > + if (context->hca_core_clock) > + munmap(context->hca_core_clock - context->core_clock_offset, > + to_mdev(&v_device->device)->page_size); > > } > > diff --git a/src/mlx4.h b/src/mlx4.h > index 2465298..8e1935d 100644 > --- a/src/mlx4.h > +++ b/src/mlx4.h > @@ -199,6 +199,7 @@ struct mlx4_context { > enum ibv_port_cap_flags caps; > } port_query_cache[MLX4_PORTS_NUM]; > uint64_tcore_clock_offset; > + void *hca_core_clock; > }; > > struct mlx4_buf { > @@ -403,6 +404,8 @@ int _mlx4_query_device_ex(struct ibv_context *context, > int mlx4_query_device_ex(struct ibv_context *context, > const struct ibv_query_device_ex_input *input, > struct ibv_device_attr_ex *attr, size_t attr_size); > +int mlx4_query_values(struct ibv_context *context, > + struct ibv_values_ex *values); > int mlx4_query_port(struct ibv_context *context, uint8_t port, > struct ibv_port_attr *attr); > > diff --git a/src/verbs.c b/src/verbs.c > index a8d6bd7..843ca1e 100644 > --- a/src/verbs.c > +++ b/src/verbs.c > @@ -114,6 +114,51 @@ int mlx4_query_device_ex(struct ibv_context *context, > return _mlx4_query_device_ex(context, input, attr, attr_size, N
Re: [PATCH libibverbs 3/7] Implement ibv_poll_cq_ex extension verb
On Tue, Oct 27, 2015 at 6:52 PM, Matan Barak wrote: > Add an implementation for verb_poll_cq extension verb. > This patch implements the new API via the standard > function mlx4_poll_one. > > Signed-off-by: Matan Barak > --- > src/cq.c| 307 > ++-- > src/mlx4.c | 1 + > src/mlx4.h | 4 + > src/verbs.c | 1 + > 4 files changed, 284 insertions(+), 29 deletions(-) > > diff --git a/src/cq.c b/src/cq.c > index 32c9070..c86e824 100644 > --- a/src/cq.c > +++ b/src/cq.c > @@ -52,6 +52,7 @@ enum { > }; > > enum { > + CQ_CONTINUE = 1, > CQ_OK = 0, > CQ_EMPTY= -1, > CQ_POLL_ERR = -2 > @@ -121,7 +122,9 @@ static void update_cons_index(struct mlx4_cq *cq) > *cq->set_ci_db = htonl(cq->cons_index & 0xff); > } > > -static void mlx4_handle_error_cqe(struct mlx4_err_cqe *cqe, struct ibv_wc > *wc) > +static void mlx4_handle_error_cqe(struct mlx4_err_cqe *cqe, > + enum ibv_wc_status *status, > + enum ibv_wc_opcode *vendor_err) > { > if (cqe->syndrome == MLX4_CQE_SYNDROME_LOCAL_QP_OP_ERR) > printf(PFX "local QP operation err " > @@ -133,64 +136,68 @@ static void mlx4_handle_error_cqe(struct mlx4_err_cqe > *cqe, struct ibv_wc *wc) > > switch (cqe->syndrome) { > case MLX4_CQE_SYNDROME_LOCAL_LENGTH_ERR: > - wc->status = IBV_WC_LOC_LEN_ERR; > + *status = IBV_WC_LOC_LEN_ERR; > break; > case MLX4_CQE_SYNDROME_LOCAL_QP_OP_ERR: > - wc->status = IBV_WC_LOC_QP_OP_ERR; > + *status = IBV_WC_LOC_QP_OP_ERR; > break; > case MLX4_CQE_SYNDROME_LOCAL_PROT_ERR: > - wc->status = IBV_WC_LOC_PROT_ERR; > + *status = IBV_WC_LOC_PROT_ERR; > break; > case MLX4_CQE_SYNDROME_WR_FLUSH_ERR: > - wc->status = IBV_WC_WR_FLUSH_ERR; > + *status = IBV_WC_WR_FLUSH_ERR; > break; > case MLX4_CQE_SYNDROME_MW_BIND_ERR: > - wc->status = IBV_WC_MW_BIND_ERR; > + *status = IBV_WC_MW_BIND_ERR; > break; > case MLX4_CQE_SYNDROME_BAD_RESP_ERR: > - wc->status = IBV_WC_BAD_RESP_ERR; > + *status = IBV_WC_BAD_RESP_ERR; > break; > case MLX4_CQE_SYNDROME_LOCAL_ACCESS_ERR: > - wc->status = IBV_WC_LOC_ACCESS_ERR; > + *status = IBV_WC_LOC_ACCESS_ERR; > break; > case MLX4_CQE_SYNDROME_REMOTE_INVAL_REQ_ERR: > - wc->status = IBV_WC_REM_INV_REQ_ERR; > + *status = IBV_WC_REM_INV_REQ_ERR; > break; > case MLX4_CQE_SYNDROME_REMOTE_ACCESS_ERR: > - wc->status = IBV_WC_REM_ACCESS_ERR; > + *status = IBV_WC_REM_ACCESS_ERR; > break; > case MLX4_CQE_SYNDROME_REMOTE_OP_ERR: > - wc->status = IBV_WC_REM_OP_ERR; > + *status = IBV_WC_REM_OP_ERR; > break; > case MLX4_CQE_SYNDROME_TRANSPORT_RETRY_EXC_ERR: > - wc->status = IBV_WC_RETRY_EXC_ERR; > + *status = IBV_WC_RETRY_EXC_ERR; > break; > case MLX4_CQE_SYNDROME_RNR_RETRY_EXC_ERR: > - wc->status = IBV_WC_RNR_RETRY_EXC_ERR; > + *status = IBV_WC_RNR_RETRY_EXC_ERR; > break; > case MLX4_CQE_SYNDROME_REMOTE_ABORTED_ERR: > - wc->status = IBV_WC_REM_ABORT_ERR; > + *status = IBV_WC_REM_ABORT_ERR; > break; > default: > - wc->status = IBV_WC_GENERAL_ERR; > + *status = IBV_WC_GENERAL_ERR; > break; > } > > - wc->vendor_err = cqe->vendor_err; > + *vendor_err = cqe->vendor_err; > } > > -static int mlx4_poll_one(struct mlx4_cq *cq, > -struct mlx4_qp **cur_qp, > -struct ibv_wc *wc) > +static inline int mlx4_handle_cq(struct mlx4_cq *cq, > +struct mlx4_qp **cur_qp, > +uint64_t *wc_wr_id, > +enum ibv_wc_status *wc_status, > +uint32_t *wc_vendor_err, > +struct mlx4_cqe **pcqe, > +uint32_t *pqpn, > +int *pis_send) > { > struct mlx4_wq *wq; > struct mlx4_cqe *cqe; > struct mlx4_srq *srq; > uint32_t qpn; > - uint32_t g_mlpath_rqpn; > - uint16_t wqe_index; > int is_error; > int is_send; > + uint16_t wqe_index; > > cqe = next_cqe_sw(cq); > if (!cqe) > @@ -201,7 +
[PATCH v1 libmlx4 0/7] Completion timestamping
Hi Yishai, This series adds support for completion timestamp. In order to support this feature, several extended verbs were implemented (as instructed in libibverbs). ibv_query_device_ex was extended to support reading the hca_core_clock and timestamp mask. The same verb was extended with vendor dependant data which is used in order to map the HCA's free running clock register. When libmlx4 initializes, it tried to mmap this free running clock register. This mapping is used in order to implement ibv_query_values_ex efficiently. In order to support CQ completion timestmap reporting, we implement ibv_create_cq_ex verb. This verb is used both for creating a CQ which supports timestamp and in order to state which fields should be returned via WC. Returning this data is done via implementing ibv_poll_cq_ex. We query the CQ requested wc_flags for every field the user has requested and populate it according to the carried network operation and WC status. Last but not least, ibv_poll_cq_ex was optimized in order to eliminate the if statements and or operations for common combinations of wc fields. This is done by inlining and using a custom poll_one_ex function for these fields. Thanks, Matan Changes from v0: * Changed patch-set to correct prefix. Matan Barak (7): Add support for extended version of ibv_query_device Add support for ibv_create_cq_ex Implement ibv_poll_cq_ex extension verb Add timestmap support to extended poll_cq verb Add support for ibv_query_values_ex Add support for different poll_one_ex functions Optimize ibv_poll_cq_ex for common scenarios configure.ac | 17 ++ src/cq.c | 512 + src/mlx4-abi.h | 25 +++ src/mlx4.c | 39 + src/mlx4.h | 64 +++- src/verbs.c| 219 +--- 6 files changed, 823 insertions(+), 53 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v1 libmlx4 4/7] Add timestmap support to extended poll_cq verb
Adding support to the extended version of poll_cq verb to read completion timestamp. Reading timestamp isn't supported with reading IBV_WC_EX_WITH_SL and IBV_WC_EX_WITH_SLID. Signed-off-by: Matan Barak --- src/cq.c| 10 ++ src/mlx4.h | 25 - src/verbs.c | 3 ++- 3 files changed, 32 insertions(+), 6 deletions(-) diff --git a/src/cq.c b/src/cq.c index c86e824..7f40f12 100644 --- a/src/cq.c +++ b/src/cq.c @@ -399,6 +399,16 @@ static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq, if (err != CQ_CONTINUE) return err; + if (wc_flags & IBV_WC_EX_WITH_COMPLETION_TIMESTAMP) { + uint16_t timestamp_0_15 = cqe->timestamp_0_7 | + cqe->timestamp_8_15 << 8; + + wc_flags_out |= IBV_WC_EX_WITH_COMPLETION_TIMESTAMP; + *wc_buffer.b64++ = (((uint64_t)ntohl(cqe->timestamp_16_47) ++ !timestamp_0_15) << 16) | + (uint64_t)timestamp_0_15; + } + if (is_send) { switch (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) { case MLX4_OPCODE_RDMA_WRITE_IMM: diff --git a/src/mlx4.h b/src/mlx4.h index e22f879..2465298 100644 --- a/src/mlx4.h +++ b/src/mlx4.h @@ -312,14 +312,29 @@ struct mlx4_cqe { uint32_tvlan_my_qpn; uint32_timmed_rss_invalid; uint32_tg_mlpath_rqpn; - uint8_t sl_vid; - uint8_t reserved1; - uint16_trlid; - uint32_tstatus; + union { + struct { + union { + struct { + uint8_t sl_vid; + uint8_t reserved1; + uint16_t rlid; + }; + uint32_t timestamp_16_47; + }; + uint32_t status; + }; + struct { + uint16_t reserved2; + uint8_t smac[6]; + }; + }; uint32_tbyte_cnt; uint16_twqe_index; uint16_tchecksum; - uint8_t reserved3[3]; + uint8_t reserved3; + uint8_t timestamp_8_15; + uint8_t timestamp_0_7; uint8_t owner_sr_opcode; }; diff --git a/src/verbs.c b/src/verbs.c index 0dcdc87..a8d6bd7 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -286,7 +286,8 @@ enum { }; enum { - CREATE_CQ_SUPPORTED_WC_FLAGS = IBV_WC_STANDARD_FLAGS + CREATE_CQ_SUPPORTED_WC_FLAGS = IBV_WC_STANDARD_FLAGS| + IBV_WC_EX_WITH_COMPLETION_TIMESTAMP }; static struct ibv_cq *create_cq(struct ibv_context *context, -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v1 libmlx4 7/7] Optimize ibv_poll_cq_ex for common scenarios
The current ibv_poll_cq_ex mechanism needs to query every field for its existence. In order to avoid this penalty at runtime, add optimized functions for special cases. Signed-off-by: Matan Barak --- configure.ac | 17 src/cq.c | 268 ++- src/mlx4.h | 20 - src/verbs.c | 10 +-- 4 files changed, 271 insertions(+), 44 deletions(-) diff --git a/configure.ac b/configure.ac index 6e98f20..9dbbb4b 100644 --- a/configure.ac +++ b/configure.ac @@ -45,6 +45,23 @@ AC_CHECK_MEMBER([struct verbs_context.ibv_create_flow], [], [AC_MSG_ERROR([libmlx4 requires libibverbs >= 1.2.0])], [[#include ]]) +AC_MSG_CHECKING("always inline") +CFLAGS_BAK="$CFLAGS" +CFLAGS="$CFLAGS -Werror" +AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[ + static inline int f(void) + __attribute((always_inline)); + static inline int f(void) + { + return 1; + } +]],[[ + int a = f(); + a = a; +]])], [AC_MSG_RESULT([yes]) AC_DEFINE([HAVE_ALWAYS_INLINE], [1], [Define if __attribute((always_inline)).])], +[AC_MSG_RESULT([no])]) +CFLAGS="$CFLAGS_BAK" + dnl Checks for typedefs, structures, and compiler characteristics. AC_C_CONST AC_CHECK_SIZEOF(long) diff --git a/src/cq.c b/src/cq.c index 1f2d572..56c0fa4 100644 --- a/src/cq.c +++ b/src/cq.c @@ -377,10 +377,22 @@ union wc_buffer { uint64_t*b64; }; +#define IS_IN_WC_FLAGS(yes, no, maybe, flag) (((yes) & (flag)) ||\ + (!((no) & (flag)) && \ + ((maybe) & (flag static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq, struct mlx4_qp **cur_qp, struct ibv_wc_ex **pwc_ex, - uint64_t wc_flags) + uint64_t wc_flags, + uint64_t yes_wc_flags, + uint64_t no_wc_flags) + ALWAYS_INLINE; +static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq, + struct mlx4_qp **cur_qp, + struct ibv_wc_ex **pwc_ex, + uint64_t wc_flags, + uint64_t wc_flags_yes, + uint64_t wc_flags_no) { struct mlx4_cqe *cqe; uint32_t qpn; @@ -392,14 +404,14 @@ static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq, uint64_t wc_flags_out = 0; wc_buffer.b64 = (uint64_t *)&wc_ex->buffer; - wc_ex->wc_flags = 0; wc_ex->reserved = 0; err = mlx4_handle_cq(cq, cur_qp, &wc_ex->wr_id, &wc_ex->status, &wc_ex->vendor_err, &cqe, &qpn, &is_send); if (err != CQ_CONTINUE) return err; - if (wc_flags & IBV_WC_EX_WITH_COMPLETION_TIMESTAMP) { + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, + IBV_WC_EX_WITH_COMPLETION_TIMESTAMP)) { uint16_t timestamp_0_15 = cqe->timestamp_0_7 | cqe->timestamp_8_15 << 8; @@ -415,80 +427,101 @@ static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq, wc_flags_out |= IBV_WC_EX_IMM; case MLX4_OPCODE_RDMA_WRITE: wc_ex->opcode= IBV_WC_RDMA_WRITE; - if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, + IBV_WC_EX_WITH_BYTE_LEN)) wc_buffer.b32++; - if (wc_flags & IBV_WC_EX_WITH_IMM) + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, + IBV_WC_EX_WITH_IMM)) wc_buffer.b32++; break; case MLX4_OPCODE_SEND_IMM: wc_flags_out |= IBV_WC_EX_IMM; case MLX4_OPCODE_SEND: wc_ex->opcode= IBV_WC_SEND; - if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, + IBV_WC_EX_WITH_BYTE_LEN)) wc_buffer.b32++; - if (wc_flags & IBV_WC_EX_WITH_IMM) + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, + IBV_WC_EX_WITH_IMM)) wc_buffer.b32++; break; case MLX4_OPCODE_RDMA_READ: wc_ex->opcode= IBV_WC_RDMA_READ; - if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) { + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_fla
[PATCH v1 libmlx4 2/7] Add support for ibv_create_cq_ex
Add an extension verb mlx4_create_cq_ex that follows the standard extension verb mechanism. This function is similar to mlx4_create_cq but supports the extension verbs functions and stores the creation flags for later use (for example, timestamp flag is used in poll_cq). The function fails if the user passes unsupported WC attributes. Signed-off-by: Matan Barak --- src/mlx4-abi.h | 12 ++ src/mlx4.c | 1 + src/mlx4.h | 3 ++ src/verbs.c| 117 + 4 files changed, 117 insertions(+), 16 deletions(-) diff --git a/src/mlx4-abi.h b/src/mlx4-abi.h index b348ce3..9b765e4 100644 --- a/src/mlx4-abi.h +++ b/src/mlx4-abi.h @@ -72,12 +72,24 @@ struct mlx4_create_cq { __u64 db_addr; }; +struct mlx4_create_cq_ex { + struct ibv_create_cq_ex ibv_cmd; + __u64 buf_addr; + __u64 db_addr; +}; + struct mlx4_create_cq_resp { struct ibv_create_cq_resp ibv_resp; __u32 cqn; __u32 reserved; }; +struct mlx4_create_cq_resp_ex { + struct ibv_create_cq_resp_exibv_resp; + __u32 cqn; + __u32 reserved; +}; + struct mlx4_resize_cq { struct ibv_resize_cqibv_cmd; __u64 buf_addr; diff --git a/src/mlx4.c b/src/mlx4.c index d41dff0..9cfd013 100644 --- a/src/mlx4.c +++ b/src/mlx4.c @@ -208,6 +208,7 @@ static int mlx4_init_context(struct verbs_device *v_device, verbs_set_ctx_op(verbs_ctx, ibv_create_flow, ibv_cmd_create_flow); verbs_set_ctx_op(verbs_ctx, ibv_destroy_flow, ibv_cmd_destroy_flow); verbs_set_ctx_op(verbs_ctx, query_device_ex, mlx4_query_device_ex); + verbs_set_ctx_op(verbs_ctx, create_cq_ex, mlx4_create_cq_ex); return 0; diff --git a/src/mlx4.h b/src/mlx4.h index 0f643bc..91eb79c 100644 --- a/src/mlx4.h +++ b/src/mlx4.h @@ -222,6 +222,7 @@ struct mlx4_cq { uint32_t *arm_db; int arm_sn; int cqe_size; + int creation_flags; }; struct mlx4_srq { @@ -402,6 +403,8 @@ int mlx4_dereg_mr(struct ibv_mr *mr); struct ibv_cq *mlx4_create_cq(struct ibv_context *context, int cqe, struct ibv_comp_channel *channel, int comp_vector); +struct ibv_cq *mlx4_create_cq_ex(struct ibv_context *context, +struct ibv_create_cq_attr_ex *cq_attr); int mlx4_alloc_cq_buf(struct mlx4_device *dev, struct mlx4_buf *buf, int nent, int entry_size); int mlx4_resize_cq(struct ibv_cq *cq, int cqe); diff --git a/src/verbs.c b/src/verbs.c index e93114b..3290b86 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -272,19 +272,69 @@ int align_queue_size(int req) return nent; } -struct ibv_cq *mlx4_create_cq(struct ibv_context *context, int cqe, - struct ibv_comp_channel *channel, - int comp_vector) +enum cmd_type { + MLX4_CMD_TYPE_BASIC, + MLX4_CMD_TYPE_EXTENDED +}; + +enum { + CREATE_CQ_SUPPORTED_COMP_MASK = IBV_CREATE_CQ_ATTR_FLAGS +}; + +enum { + CREATE_CQ_SUPPORTED_FLAGS = IBV_CREATE_CQ_ATTR_COMPLETION_TIMESTAMP +}; + +enum { + CREATE_CQ_SUPPORTED_WC_FLAGS = IBV_WC_STANDARD_FLAGS +}; + +static struct ibv_cq *create_cq(struct ibv_context *context, + struct ibv_create_cq_attr_ex *cq_attr, + enum cmd_type cmd_type) { - struct mlx4_create_cq cmd; - struct mlx4_create_cq_resp resp; - struct mlx4_cq*cq; - intret; - struct mlx4_context *mctx = to_mctx(context); + struct mlx4_create_cq cmd; + struct mlx4_create_cq_excmd_e; + struct mlx4_create_cq_resp resp; + struct mlx4_create_cq_resp_ex resp_e; + struct mlx4_cq *cq; + int ret; + struct mlx4_context *mctx = to_mctx(context); + struct ibv_create_cq_attr_excq_attr_e; + int cqe; /* Sanity check CQ size before proceeding */ - if (cqe > 0x3f) + if (cq_attr->cqe > 0x3f) + return NULL; + + if (cq_attr->comp_mask & ~CREATE_CQ_SUPPORTED_COMP_MASK) { + errno = EINVAL; return NULL; + } + + if (cq_attr->comp_mask & IBV_CREATE_CQ_ATTR_FLAGS && + cq_attr->flags & ~CREATE_CQ_SUPPORTED_FLAGS) { + errno = EINVAL; + return NULL; + } + + if (cq_attr->wc_flags & ~CREATE_CQ_SUPPORTED_WC_FLAGS) { + errno = ENOTSUP; + return
[PATCH v1 libmlx4 1/7] Add support for extended version of ibv_query_device
The new mlx4_query_device_ex implementation uses the extended version of libibverbs/uverbs query_device command. In addition, it reads the hca_core_clock offset in the bar from the vendor specific part of ibv_query_device_ex command. Signed-off-by: Matan Barak --- src/mlx4-abi.h | 13 + src/mlx4.c | 1 + src/mlx4.h | 8 src/verbs.c| 54 +++--- 4 files changed, 73 insertions(+), 3 deletions(-) diff --git a/src/mlx4-abi.h b/src/mlx4-abi.h index b48f6fc..b348ce3 100644 --- a/src/mlx4-abi.h +++ b/src/mlx4-abi.h @@ -111,4 +111,17 @@ struct mlx4_create_qp { __u8reserved[5]; }; +enum query_device_resp_mask { + QUERY_DEVICE_RESP_MASK_TIMESTAMP = 1UL << 0, +}; + +struct query_device_ex_resp { + struct ibv_query_device_resp_ex core; + struct { + uint32_t comp_mask; + uint32_t response_length; + uint64_t hca_core_clock_offset; + }; +}; + #endif /* MLX4_ABI_H */ diff --git a/src/mlx4.c b/src/mlx4.c index c30f4bf..d41dff0 100644 --- a/src/mlx4.c +++ b/src/mlx4.c @@ -207,6 +207,7 @@ static int mlx4_init_context(struct verbs_device *v_device, verbs_set_ctx_op(verbs_ctx, open_qp, mlx4_open_qp); verbs_set_ctx_op(verbs_ctx, ibv_create_flow, ibv_cmd_create_flow); verbs_set_ctx_op(verbs_ctx, ibv_destroy_flow, ibv_cmd_destroy_flow); + verbs_set_ctx_op(verbs_ctx, query_device_ex, mlx4_query_device_ex); return 0; diff --git a/src/mlx4.h b/src/mlx4.h index 519d8f4..0f643bc 100644 --- a/src/mlx4.h +++ b/src/mlx4.h @@ -198,6 +198,7 @@ struct mlx4_context { uint8_t link_layer; enum ibv_port_cap_flags caps; } port_query_cache[MLX4_PORTS_NUM]; + uint64_tcore_clock_offset; }; struct mlx4_buf { @@ -378,6 +379,13 @@ void mlx4_free_db(struct mlx4_context *context, enum mlx4_db_type type, uint32_t int mlx4_query_device(struct ibv_context *context, struct ibv_device_attr *attr); +int _mlx4_query_device_ex(struct ibv_context *context, + const struct ibv_query_device_ex_input *input, + struct ibv_device_attr_ex *attr, size_t attr_size, + uint32_t *comp_mask); +int mlx4_query_device_ex(struct ibv_context *context, +const struct ibv_query_device_ex_input *input, +struct ibv_device_attr_ex *attr, size_t attr_size); int mlx4_query_port(struct ibv_context *context, uint8_t port, struct ibv_port_attr *attr); diff --git a/src/verbs.c b/src/verbs.c index 2cb1f8a..e93114b 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -45,6 +45,14 @@ #include "mlx4-abi.h" #include "wqe.h" +static void parse_raw_fw_ver(uint64_t raw_fw_ver, unsigned *major, +unsigned *minor, unsigned *sub_minor) +{ + *major = (raw_fw_ver >> 32) & 0x; + *minor = (raw_fw_ver >> 16) & 0x; + *sub_minor = raw_fw_ver & 0x; +} + int mlx4_query_device(struct ibv_context *context, struct ibv_device_attr *attr) { struct ibv_query_device cmd; @@ -56,9 +64,7 @@ int mlx4_query_device(struct ibv_context *context, struct ibv_device_attr *attr) if (ret) return ret; - major = (raw_fw_ver >> 32) & 0x; - minor = (raw_fw_ver >> 16) & 0x; - sub_minor = raw_fw_ver & 0x; + parse_raw_fw_ver(raw_fw_ver, &major, &minor, &sub_minor); snprintf(attr->fw_ver, sizeof attr->fw_ver, "%d.%d.%03d", major, minor, sub_minor); @@ -66,6 +72,48 @@ int mlx4_query_device(struct ibv_context *context, struct ibv_device_attr *attr) return 0; } +int _mlx4_query_device_ex(struct ibv_context *context, + const struct ibv_query_device_ex_input *input, + struct ibv_device_attr_ex *attr, size_t attr_size, + uint32_t *comp_mask) +{ + struct ibv_query_device_ex cmd; + struct query_device_ex_resp resp; + uint64_t raw_fw_ver; + unsigned major, minor, sub_minor; + int ret; + + memset(&resp, 0, sizeof(resp)); + + ret = ibv_cmd_query_device_ex(context, input, attr, attr_size, + &raw_fw_ver, &cmd, sizeof(cmd), + sizeof(cmd), &resp.core, + sizeof(resp.core), sizeof(resp)); + if (ret) + return ret; + + parse_raw_fw_ver(raw_fw_ver, &major, &minor, &sub_minor); + + snprintf(attr->orig_attr.fw_ver, sizeof(attr->orig_attr.fw_ver), +"%d.%d.%03d", major, minor, sub_minor); + + if (resp.comp_mask & QUERY_DEVICE_RESP_MASK_TIMESTAMP) + to_mctx(context)->core_clock_offset = + re
[PATCH v1 libmlx4 5/7] Add support for ibv_query_values_ex
Adding mlx4_query_values as implementation for ibv_query_values_ex. mlx4_query_values follows the standard extension verb mechanism. This function supports reading the hwclock via mmaping the required space from kernel. Signed-off-by: Matan Barak --- src/mlx4.c | 36 src/mlx4.h | 3 +++ src/verbs.c | 45 + 3 files changed, 84 insertions(+) diff --git a/src/mlx4.c b/src/mlx4.c index cc1211f..6d66cf0 100644 --- a/src/mlx4.c +++ b/src/mlx4.c @@ -116,6 +116,28 @@ static struct ibv_context_ops mlx4_ctx_ops = { .detach_mcast = ibv_cmd_detach_mcast }; +static int mlx4_map_internal_clock(struct mlx4_device *dev, + struct ibv_context *ibv_ctx) +{ + struct mlx4_context *context = to_mctx(ibv_ctx); + void *hca_clock_page; + + hca_clock_page = mmap(NULL, dev->page_size, PROT_READ, MAP_SHARED, + ibv_ctx->cmd_fd, dev->page_size * 3); + + if (hca_clock_page == MAP_FAILED) { + fprintf(stderr, PFX + "Warning: Timestamp available,\n" + "but failed to mmap() hca core clock page, errno=%d.\n", + errno); + return -1; + } + + context->hca_core_clock = hca_clock_page + + context->core_clock_offset % dev->page_size; + return 0; +} + static int mlx4_init_context(struct verbs_device *v_device, struct ibv_context *ibv_ctx, int cmd_fd) { @@ -127,6 +149,10 @@ static int mlx4_init_context(struct verbs_device *v_device, __u16 bf_reg_size; struct mlx4_device *dev = to_mdev(&v_device->device); struct verbs_context *verbs_ctx = verbs_get_ctx(ibv_ctx); + struct ibv_query_device_ex_input input_query_device = {.comp_mask = 0}; + struct ibv_device_attr_ex dev_attrs; + uint32_tdev_attrs_comp_mask; + int err; /* memory footprint of mlx4_context and verbs_context share * struct ibv_context. @@ -194,6 +220,12 @@ static int mlx4_init_context(struct verbs_device *v_device, context->bf_buf_size = 0; } + context->hca_core_clock = NULL; + err = _mlx4_query_device_ex(ibv_ctx, &input_query_device, &dev_attrs, + sizeof(dev_attrs), &dev_attrs_comp_mask); + if (!err && dev_attrs_comp_mask & QUERY_DEVICE_RESP_MASK_TIMESTAMP) + mlx4_map_internal_clock(dev, ibv_ctx); + pthread_spin_init(&context->uar_lock, PTHREAD_PROCESS_PRIVATE); ibv_ctx->ops = mlx4_ctx_ops; @@ -210,6 +242,7 @@ static int mlx4_init_context(struct verbs_device *v_device, verbs_set_ctx_op(verbs_ctx, query_device_ex, mlx4_query_device_ex); verbs_set_ctx_op(verbs_ctx, create_cq_ex, mlx4_create_cq_ex); verbs_set_ctx_op(verbs_ctx, poll_cq_ex, mlx4_poll_cq_ex); + verbs_set_ctx_op(verbs_ctx, query_values, mlx4_query_values); return 0; @@ -223,6 +256,9 @@ static void mlx4_uninit_context(struct verbs_device *v_device, munmap(context->uar, to_mdev(&v_device->device)->page_size); if (context->bf_page) munmap(context->bf_page, to_mdev(&v_device->device)->page_size); + if (context->hca_core_clock) + munmap(context->hca_core_clock - context->core_clock_offset, + to_mdev(&v_device->device)->page_size); } diff --git a/src/mlx4.h b/src/mlx4.h index 2465298..8e1935d 100644 --- a/src/mlx4.h +++ b/src/mlx4.h @@ -199,6 +199,7 @@ struct mlx4_context { enum ibv_port_cap_flags caps; } port_query_cache[MLX4_PORTS_NUM]; uint64_tcore_clock_offset; + void *hca_core_clock; }; struct mlx4_buf { @@ -403,6 +404,8 @@ int _mlx4_query_device_ex(struct ibv_context *context, int mlx4_query_device_ex(struct ibv_context *context, const struct ibv_query_device_ex_input *input, struct ibv_device_attr_ex *attr, size_t attr_size); +int mlx4_query_values(struct ibv_context *context, + struct ibv_values_ex *values); int mlx4_query_port(struct ibv_context *context, uint8_t port, struct ibv_port_attr *attr); diff --git a/src/verbs.c b/src/verbs.c index a8d6bd7..843ca1e 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -114,6 +114,51 @@ int mlx4_query_device_ex(struct ibv_context *context, return _mlx4_query_device_ex(context, input, attr, attr_size, NULL); } +#define READL(ptr) (*((uint32_t *)(ptr))) +static int mlx4_read_clock(struct ibv_context *context, uint64_t *cycles) +{ + unsigned int clockhi, clocklo, clockhi1; + int i; + struct mlx4_context *ctx = to_mctx(context); + + if (!ctx->hca_core_clo
[PATCH v1 libmlx4 6/7] Add support for different poll_one_ex functions
In order to opitimize the poll_one extended verb for different wc_flags, add support for poll_one_ex callback function. Signed-off-by: Matan Barak --- src/cq.c| 5 +++-- src/mlx4.h | 5 + src/verbs.c | 1 + 3 files changed, 9 insertions(+), 2 deletions(-) diff --git a/src/cq.c b/src/cq.c index 7f40f12..1f2d572 100644 --- a/src/cq.c +++ b/src/cq.c @@ -601,7 +601,8 @@ int mlx4_poll_cq_ex(struct ibv_cq *ibcq, int npolled; int err = CQ_OK; unsigned int ne = attr->max_entries; - uint64_t wc_flags = cq->wc_flags; + int (*poll_fn)(struct mlx4_cq *cq, struct mlx4_qp **cur_qp, + struct ibv_wc_ex **wc_ex) = cq->mlx4_poll_one; if (attr->comp_mask) return -EINVAL; @@ -609,7 +610,7 @@ int mlx4_poll_cq_ex(struct ibv_cq *ibcq, pthread_spin_lock(&cq->lock); for (npolled = 0; npolled < ne; ++npolled) { - err = _mlx4_poll_one_ex(cq, &qp, &wc, wc_flags); + err = poll_fn(cq, &qp, &wc); if (err != CQ_OK) break; } diff --git a/src/mlx4.h b/src/mlx4.h index 8e1935d..46a18d6 100644 --- a/src/mlx4.h +++ b/src/mlx4.h @@ -215,6 +215,8 @@ struct mlx4_pd { struct mlx4_cq { struct ibv_cq ibv_cq; uint64_twc_flags; + int (*mlx4_poll_one)(struct mlx4_cq *cq, struct mlx4_qp **cur_qp, +struct ibv_wc_ex **wc_ex); struct mlx4_buf buf; struct mlx4_buf resize_buf; pthread_spinlock_t lock; @@ -432,6 +434,9 @@ int mlx4_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc); int mlx4_poll_cq_ex(struct ibv_cq *ibcq, struct ibv_wc_ex *wc, struct ibv_poll_cq_ex_attr *attr); +int mlx4_poll_one_ex(struct mlx4_cq *cq, +struct mlx4_qp **cur_qp, +struct ibv_wc_ex **pwc_ex); int mlx4_arm_cq(struct ibv_cq *cq, int solicited); void mlx4_cq_event(struct ibv_cq *cq); void __mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq); diff --git a/src/verbs.c b/src/verbs.c index 843ca1e..62908c1 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -432,6 +432,7 @@ static struct ibv_cq *create_cq(struct ibv_context *context, if (ret) goto err_db; + cq->mlx4_poll_one = mlx4_poll_one_ex; cq->creation_flags = cmd_e.ibv_cmd.flags; cq->wc_flags = cq_attr->wc_flags; cq->cqn = resp.cqn; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v1 libmlx4 3/7] Implement ibv_poll_cq_ex extension verb
Add an implementation for verb_poll_cq extension verb. This patch implements the new API via the standard function mlx4_poll_one. Signed-off-by: Matan Barak --- src/cq.c| 307 ++-- src/mlx4.c | 1 + src/mlx4.h | 4 + src/verbs.c | 1 + 4 files changed, 284 insertions(+), 29 deletions(-) diff --git a/src/cq.c b/src/cq.c index 32c9070..c86e824 100644 --- a/src/cq.c +++ b/src/cq.c @@ -52,6 +52,7 @@ enum { }; enum { + CQ_CONTINUE = 1, CQ_OK = 0, CQ_EMPTY= -1, CQ_POLL_ERR = -2 @@ -121,7 +122,9 @@ static void update_cons_index(struct mlx4_cq *cq) *cq->set_ci_db = htonl(cq->cons_index & 0xff); } -static void mlx4_handle_error_cqe(struct mlx4_err_cqe *cqe, struct ibv_wc *wc) +static void mlx4_handle_error_cqe(struct mlx4_err_cqe *cqe, + enum ibv_wc_status *status, + enum ibv_wc_opcode *vendor_err) { if (cqe->syndrome == MLX4_CQE_SYNDROME_LOCAL_QP_OP_ERR) printf(PFX "local QP operation err " @@ -133,64 +136,68 @@ static void mlx4_handle_error_cqe(struct mlx4_err_cqe *cqe, struct ibv_wc *wc) switch (cqe->syndrome) { case MLX4_CQE_SYNDROME_LOCAL_LENGTH_ERR: - wc->status = IBV_WC_LOC_LEN_ERR; + *status = IBV_WC_LOC_LEN_ERR; break; case MLX4_CQE_SYNDROME_LOCAL_QP_OP_ERR: - wc->status = IBV_WC_LOC_QP_OP_ERR; + *status = IBV_WC_LOC_QP_OP_ERR; break; case MLX4_CQE_SYNDROME_LOCAL_PROT_ERR: - wc->status = IBV_WC_LOC_PROT_ERR; + *status = IBV_WC_LOC_PROT_ERR; break; case MLX4_CQE_SYNDROME_WR_FLUSH_ERR: - wc->status = IBV_WC_WR_FLUSH_ERR; + *status = IBV_WC_WR_FLUSH_ERR; break; case MLX4_CQE_SYNDROME_MW_BIND_ERR: - wc->status = IBV_WC_MW_BIND_ERR; + *status = IBV_WC_MW_BIND_ERR; break; case MLX4_CQE_SYNDROME_BAD_RESP_ERR: - wc->status = IBV_WC_BAD_RESP_ERR; + *status = IBV_WC_BAD_RESP_ERR; break; case MLX4_CQE_SYNDROME_LOCAL_ACCESS_ERR: - wc->status = IBV_WC_LOC_ACCESS_ERR; + *status = IBV_WC_LOC_ACCESS_ERR; break; case MLX4_CQE_SYNDROME_REMOTE_INVAL_REQ_ERR: - wc->status = IBV_WC_REM_INV_REQ_ERR; + *status = IBV_WC_REM_INV_REQ_ERR; break; case MLX4_CQE_SYNDROME_REMOTE_ACCESS_ERR: - wc->status = IBV_WC_REM_ACCESS_ERR; + *status = IBV_WC_REM_ACCESS_ERR; break; case MLX4_CQE_SYNDROME_REMOTE_OP_ERR: - wc->status = IBV_WC_REM_OP_ERR; + *status = IBV_WC_REM_OP_ERR; break; case MLX4_CQE_SYNDROME_TRANSPORT_RETRY_EXC_ERR: - wc->status = IBV_WC_RETRY_EXC_ERR; + *status = IBV_WC_RETRY_EXC_ERR; break; case MLX4_CQE_SYNDROME_RNR_RETRY_EXC_ERR: - wc->status = IBV_WC_RNR_RETRY_EXC_ERR; + *status = IBV_WC_RNR_RETRY_EXC_ERR; break; case MLX4_CQE_SYNDROME_REMOTE_ABORTED_ERR: - wc->status = IBV_WC_REM_ABORT_ERR; + *status = IBV_WC_REM_ABORT_ERR; break; default: - wc->status = IBV_WC_GENERAL_ERR; + *status = IBV_WC_GENERAL_ERR; break; } - wc->vendor_err = cqe->vendor_err; + *vendor_err = cqe->vendor_err; } -static int mlx4_poll_one(struct mlx4_cq *cq, -struct mlx4_qp **cur_qp, -struct ibv_wc *wc) +static inline int mlx4_handle_cq(struct mlx4_cq *cq, +struct mlx4_qp **cur_qp, +uint64_t *wc_wr_id, +enum ibv_wc_status *wc_status, +uint32_t *wc_vendor_err, +struct mlx4_cqe **pcqe, +uint32_t *pqpn, +int *pis_send) { struct mlx4_wq *wq; struct mlx4_cqe *cqe; struct mlx4_srq *srq; uint32_t qpn; - uint32_t g_mlpath_rqpn; - uint16_t wqe_index; int is_error; int is_send; + uint16_t wqe_index; cqe = next_cqe_sw(cq); if (!cqe) @@ -201,7 +208,7 @@ static int mlx4_poll_one(struct mlx4_cq *cq, ++cq->cons_index; - VALGRIND_MAKE_MEM_DEFINED(cqe, sizeof *cqe); + VALGRIND_MAKE_MEM_DEFINED(cqe, sizeof(*cqe)); /* * Make sure we read CQ entry contents after we've checked the @@ -210,7 +217,6 @@ st
RE: [PATCH 1/2] mlx4: Expose correct max_sge_rd limit
Just discussed the issue with Sagi. Sagi will follow up with a small correction. -Original Message- From: Sagi Grimberg [mailto:sa...@dev.mellanox.co.il] Sent: Tuesday, October 27, 2015 11:32 AM To: Bart Van Assche; linux-rdma@vger.kernel.org; target-de...@vger.kernel.org Cc: Steve Wise; Nicholas A. Bellinger; Or Gerlitz; Doug Ledford; Eli Cohen Subject: Re: [PATCH 1/2] mlx4: Expose correct max_sge_rd limit > Hello Sagi, > > Is this the same issue as what has been discussed in > http://www.spinics.net/lists/linux-rdma/msg21799.html ? Looks like it. I think this patch addresses this issue, but lets CC Eli to comment if I'm missing something. Thanks for digging this up... Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/sa: replace GFP_KERNEL with GFP_ATOMIC
On Tue, Oct 27, 2015 at 09:17:40PM +0530, Saurabh Sengar wrote: > replace GFP_KERNEL with GFP_ATOMIC, as code while holding a spinlock > should be atomic > GFP_KERNEL may sleep and can cause deadlock, where as GFP_ATOMIC may > fail but certainly avoids deadlock Great catch. Thanks! However, gfp_t is passed to send_mad and we should pass that down and use it. Compile tested only, suggestion below, Ira 14:09:12 > git di diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c index 8c014b33d8e0..54d454042b28 100644 --- a/drivers/infiniband/core/sa_query.c +++ b/drivers/infiniband/core/sa_query.c @@ -512,7 +512,7 @@ static int ib_nl_get_path_rec_attrs_len(ib_sa_comp_mask comp_mask) return len; } -static int ib_nl_send_msg(struct ib_sa_query *query) +static int ib_nl_send_msg(struct ib_sa_query *query, gfp_t gfp_mask) { struct sk_buff *skb = NULL; struct nlmsghdr *nlh; @@ -526,7 +526,7 @@ static int ib_nl_send_msg(struct ib_sa_query *query) if (len <= 0) return -EMSGSIZE; - skb = nlmsg_new(len, GFP_KERNEL); + skb = nlmsg_new(len, gfp_mask); if (!skb) return -ENOMEM; @@ -544,7 +544,7 @@ static int ib_nl_send_msg(struct ib_sa_query *query) /* Repair the nlmsg header length */ nlmsg_end(skb, nlh); - ret = ibnl_multicast(skb, nlh, RDMA_NL_GROUP_LS, GFP_KERNEL); + ret = ibnl_multicast(skb, nlh, RDMA_NL_GROUP_LS, gfp_mask); if (!ret) ret = len; else @@ -553,7 +553,7 @@ static int ib_nl_send_msg(struct ib_sa_query *query) return ret; } -static int ib_nl_make_request(struct ib_sa_query *query) +static int ib_nl_make_request(struct ib_sa_query *query, gfp_t gfp_mask) { unsigned long flags; unsigned long delay; @@ -563,7 +563,7 @@ static int ib_nl_make_request(struct ib_sa_query *query) query->seq = (u32)atomic_inc_return(&ib_nl_sa_request_seq); spin_lock_irqsave(&ib_nl_request_lock, flags); - ret = ib_nl_send_msg(query); + ret = ib_nl_send_msg(query, gfp_mask); if (ret <= 0) { ret = -EIO; goto request_out; @@ -1105,7 +1105,7 @@ static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask) if (query->flags & IB_SA_ENABLE_LOCAL_SERVICE) { if (!ibnl_chk_listeners(RDMA_NL_GROUP_LS)) { - if (!ib_nl_make_request(query)) + if (!ib_nl_make_request(query, gfp_mask)) return id; } ib_sa_disable_local_svc(query); -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/sa: replace GFP_KERNEL with GFP_ATOMIC
On Tue, Oct 27, 2015 at 02:12:36PM -0400, ira.weiny wrote: > On Tue, Oct 27, 2015 at 09:17:40PM +0530, Saurabh Sengar wrote: > > replace GFP_KERNEL with GFP_ATOMIC, as code while holding a spinlock > > should be atomic > > GFP_KERNEL may sleep and can cause deadlock, where as GFP_ATOMIC may > > fail but certainly avoids deadlock > > Great catch. Thanks! > > However, gfp_t is passed to send_mad and we should pass that down and use it. > spin_lock_irqsave(&ib_nl_request_lock, flags); > - ret = ib_nl_send_msg(query); > + ret = ib_nl_send_msg(query, gfp_mask); A spin lock is guarenteed held around ib_nl_send_msg, so it's allocations have to be atomic, can't use gfp_mask here.. I do wonder if it is a good idea to call ib_nl_send_msg with a spinlock held though.. Would be nice to see that go away. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/sa: replace GFP_KERNEL with GFP_ATOMIC
On Tue, Oct 27, 2015 at 12:16:52PM -0600, Jason Gunthorpe wrote: > On Tue, Oct 27, 2015 at 02:12:36PM -0400, ira.weiny wrote: > > On Tue, Oct 27, 2015 at 09:17:40PM +0530, Saurabh Sengar wrote: > > > replace GFP_KERNEL with GFP_ATOMIC, as code while holding a spinlock > > > should be atomic > > > GFP_KERNEL may sleep and can cause deadlock, where as GFP_ATOMIC may > > > fail but certainly avoids deadlock > > > > Great catch. Thanks! > > > > However, gfp_t is passed to send_mad and we should pass that down and use > > it. > > > spin_lock_irqsave(&ib_nl_request_lock, flags); > > - ret = ib_nl_send_msg(query); > > + ret = ib_nl_send_msg(query, gfp_mask); > > A spin lock is guarenteed held around ib_nl_send_msg, so it's > allocations have to be atomic, can't use gfp_mask here.. > > I do wonder if it is a good idea to call ib_nl_send_msg with a spinlock > held though.. Would be nice to see that go away. Ah, yea my bad. Ira > > Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH libibverbs] Expose QP block self multicast loopback creation flag
On Tue, Oct 27, 2015 at 02:53:01PM +0200, Eran Ben Elisha wrote: ... > +enum ibv_qp_create_flags { > + IBV_QP_CREATE_BLOCK_SELF_MCAST_LB = 1 << 1, > }; > I'm sure that I'm missing something important, but why did it start from shift 1 and not shift 0? > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] IB/sa: replace GFP_KERNEL with GFP_ATOMIC
On Tue, Oct 27, 2015 at 06:56:50PM +, Wan, Kaike wrote: > > I do wonder if it is a good idea to call ib_nl_send_msg with a spinlock held > > though.. Would be nice to see that go away. > > We have to hold the lock to protect against a race condition that a > quick response will try to free the request from the > ib_nl_request_list before we even put it on the list. Put is on the list first? Use a kref? Doesn't look like a big deal to clean this up. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 12/23] staging/rdma/hfi1: Macro code clean up
On Tue, Oct 27, 2015 at 05:19:10PM +0900, Greg KH wrote: > On Mon, Oct 26, 2015 at 10:28:38AM -0400, ira.we...@intel.com wrote: > > From: Mitko Haralanov > > > > Clean up the context and sdma macros and move them to a more logical place > > in > > hfi.h > > > > Signed-off-by: Mitko Haralanov > > Signed-off-by: Ira Weiny > > --- > > drivers/staging/rdma/hfi1/hfi.h | 22 ++ > > 1 file changed, 10 insertions(+), 12 deletions(-) > > > > diff --git a/drivers/staging/rdma/hfi1/hfi.h > > b/drivers/staging/rdma/hfi1/hfi.h > > index a35213e9b500..41ad9a30149b 100644 > > --- a/drivers/staging/rdma/hfi1/hfi.h > > +++ b/drivers/staging/rdma/hfi1/hfi.h > > @@ -1104,6 +1104,16 @@ struct hfi1_filedata { > > int rec_cpu_num; > > }; > > > > +/* for use in system calls, where we want to know device type, etc. */ > > +#define fp_to_fd(fp) ((struct hfi1_filedata *)(fp)->private_data) > > +#define ctxt_fp(fp) (fp_to_fd((fp))->uctxt) > > +#define subctxt_fp(fp) (fp_to_fd((fp))->subctxt) > > +#define tidcursor_fp(fp) (fp_to_fd((fp))->tidcursor) > > +#define user_sdma_pkt_fp(fp) (fp_to_fd((fp))->pq) > > +#define user_sdma_comp_fp(fp) (fp_to_fd((fp))->cq) > > +#define notifier_fp(fp) (fp_to_fd((fp))->mn) > > +#define rb_fp(fp) (fp_to_fd((fp))->tid_rb_root) > > Ick, no, don't do this, just spell it all out (odds are you will see tht > you can make the code simpler...) If you don't know what "cq" or "pq" > are, then name them properly. > > These need to be all removed. Ok. Can I add the removal of these macros to the TODO list and get this patch accepted in the interm? Many of the patches I am queueing up to submit as well as one in this series do not apply cleanly without this change. It will be much easier if I can get everything applied and then do a global clean up of these macros after the fact. Thanks, Ira -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 23/23] staging/rdma/hfi1: Update driver version string to 0.9-294
On Tue, Oct 27, 2015 at 05:46:41PM +0900, Greg KH wrote: > On Mon, Oct 26, 2015 at 10:28:49AM -0400, ira.we...@intel.com wrote: > > From: Jubin John > > > > Signed-off-by: Jubin John > > Signed-off-by: Ira Weiny > > --- > > drivers/staging/rdma/hfi1/common.h | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/staging/rdma/hfi1/common.h > > b/drivers/staging/rdma/hfi1/common.h > > index 7809093eb55e..5dd92720faae 100644 > > --- a/drivers/staging/rdma/hfi1/common.h > > +++ b/drivers/staging/rdma/hfi1/common.h > > @@ -205,7 +205,7 @@ > > * to the driver itself, not the software interfaces it supports. > > */ > > #ifndef HFI1_DRIVER_VERSION_BASE > > -#define HFI1_DRIVER_VERSION_BASE "0.9-248" > > +#define HFI1_DRIVER_VERSION_BASE "0.9-294" > > Patches like this make no sense at all, please drop it and only use the > kernel version. What do you mean by "only use the kernel version"? Do you mean #define HFI1_DRIVER_VERSION_BASE UTS_RELEASE Or just remove the macro entirely? > > Trust me, it's going to get messy really fast (hint, it > already did...) Did I base this on the wrong tree? Not sure how this could have messed you up. Thanks, Ira -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 12/23] staging/rdma/hfi1: Macro code clean up
On Tue, Oct 27, 2015 at 04:51:15PM -0400, ira.weiny wrote: > On Tue, Oct 27, 2015 at 05:19:10PM +0900, Greg KH wrote: > > On Mon, Oct 26, 2015 at 10:28:38AM -0400, ira.we...@intel.com wrote: > > > From: Mitko Haralanov > > > > > > Clean up the context and sdma macros and move them to a more logical > > > place in > > > hfi.h > > > > > > Signed-off-by: Mitko Haralanov > > > Signed-off-by: Ira Weiny > > > --- > > > drivers/staging/rdma/hfi1/hfi.h | 22 ++ > > > 1 file changed, 10 insertions(+), 12 deletions(-) > > > > > > diff --git a/drivers/staging/rdma/hfi1/hfi.h > > > b/drivers/staging/rdma/hfi1/hfi.h > > > index a35213e9b500..41ad9a30149b 100644 > > > --- a/drivers/staging/rdma/hfi1/hfi.h > > > +++ b/drivers/staging/rdma/hfi1/hfi.h > > > @@ -1104,6 +1104,16 @@ struct hfi1_filedata { > > > int rec_cpu_num; > > > }; > > > > > > +/* for use in system calls, where we want to know device type, etc. */ > > > +#define fp_to_fd(fp) ((struct hfi1_filedata *)(fp)->private_data) > > > +#define ctxt_fp(fp) (fp_to_fd((fp))->uctxt) > > > +#define subctxt_fp(fp) (fp_to_fd((fp))->subctxt) > > > +#define tidcursor_fp(fp) (fp_to_fd((fp))->tidcursor) > > > +#define user_sdma_pkt_fp(fp) (fp_to_fd((fp))->pq) > > > +#define user_sdma_comp_fp(fp) (fp_to_fd((fp))->cq) > > > +#define notifier_fp(fp) (fp_to_fd((fp))->mn) > > > +#define rb_fp(fp) (fp_to_fd((fp))->tid_rb_root) > > > > Ick, no, don't do this, just spell it all out (odds are you will see tht > > you can make the code simpler...) If you don't know what "cq" or "pq" > > are, then name them properly. > > > > These need to be all removed. > > Ok. > > Can I add the removal of these macros to the TODO list and get this patch > accepted in the interm? Nope, sorry, why would I accept a known-problem patch? Would you do such a thing? > Many of the patches I am queueing up to submit as well as one in this series > do > not apply cleanly without this change. It will be much easier if I can get > everything applied and then do a global clean up of these macros after the > fact. But you would have no incentive to do that if I take this patch now :) thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 23/23] staging/rdma/hfi1: Update driver version string to 0.9-294
On Tue, Oct 27, 2015 at 05:00:22PM -0400, ira.weiny wrote: > On Tue, Oct 27, 2015 at 05:46:41PM +0900, Greg KH wrote: > > On Mon, Oct 26, 2015 at 10:28:49AM -0400, ira.we...@intel.com wrote: > > > From: Jubin John > > > > > > Signed-off-by: Jubin John > > > Signed-off-by: Ira Weiny > > > --- > > > drivers/staging/rdma/hfi1/common.h | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/drivers/staging/rdma/hfi1/common.h > > > b/drivers/staging/rdma/hfi1/common.h > > > index 7809093eb55e..5dd92720faae 100644 > > > --- a/drivers/staging/rdma/hfi1/common.h > > > +++ b/drivers/staging/rdma/hfi1/common.h > > > @@ -205,7 +205,7 @@ > > > * to the driver itself, not the software interfaces it supports. > > > */ > > > #ifndef HFI1_DRIVER_VERSION_BASE > > > -#define HFI1_DRIVER_VERSION_BASE "0.9-248" > > > +#define HFI1_DRIVER_VERSION_BASE "0.9-294" > > > > Patches like this make no sense at all, please drop it and only use the > > kernel version. > > What do you mean by "only use the kernel version"? Do you mean > > #define HFI1_DRIVER_VERSION_BASE UTS_RELEASE > > Or just remove the macro entirely? Remove it entirely, it's pointless and makes no sense for in-kernel code. > > Trust me, it's going to get messy really fast (hint, it > > already did...) > > Did I base this on the wrong tree? Not sure how this could have messed you > up. Nope, the patch applied just fine, but think about it, I didn't take all of the patches you sent me, so what exactly does that version number now represent? Hint, absolutely nothing, or even worse, something completely wrong :) thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/7] Fix an infinite loop in the SRP initiator
Submitting a SCSI request through the SG_IO mechanism with a scatterlist that is longer than what is supported by the SRP initiator triggers an infinite loop. This patch series fixes that behavior. The individual patches in this series are as follows: 0001-IB-srp-Fix-a-spelling-error.patch 0002-IB-srp-Document-srp_map_data-return-value.patch 0003-IB-srp-Rename-work-request-ID-labels.patch 0004-IB-srp-Fix-a-potential-queue-overflow-in-an-error-pa.patch 0005-IB-srp-Fix-srp_map_data-error-paths.patch 0006-IB-srp-Introduce-target-mr_pool_size.patch 0007-IB-srp-Avoid-that-mapping-failure-triggers-an-infini.patch -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/7] IB/srp: Document srp_map_data() return value
Signed-off-by: Bart Van Assche Cc: Sagi Grimberg Cc: Sebastian Parschauer --- drivers/infiniband/ulp/srp/ib_srp.c | 9 + 1 file changed, 9 insertions(+) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 1c94d93..c1faf70 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1497,6 +1497,15 @@ out: return ret; } +/** + * srp_map_data() - map SCSI data buffer onto an SRP request + * @scmnd: SCSI command to map + * @ch: SRP RDMA channel + * @req: SRP request + * + * Returns the length in bytes of the SRP_CMD IU or a negative value if + * mapping failed. + */ static int srp_map_data(struct scsi_cmnd *scmnd, struct srp_rdma_ch *ch, struct srp_request *req) { -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/7] IB/srp: Fix a spelling error
Signed-off-by: Bart Van Assche Cc: Sagi Grimberg Cc: Sebastian Parschauer --- drivers/infiniband/ulp/srp/ib_srp.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index d01395b..1c94d93 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1407,7 +1407,7 @@ static int srp_map_sg_entry(struct srp_map_state *state, /* * If the last entry of the MR wasn't a full page, then we need to * close it out and start a new one -- we can only merge at page -* boundries. +* boundaries. */ ret = 0; if (len != dev->mr_page_size) -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/7] IB/srp: Avoid that mapping failure triggers an infinite loop
Signed-off-by: Bart Van Assche Cc: Sagi Grimberg Cc: Sebastian Parschauer --- drivers/infiniband/ulp/srp/ib_srp.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 47c3a72..59d3ff9 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1666,6 +1666,8 @@ map_complete: unmap: srp_unmap_data(scmnd, ch, req, true); + if (ret == -ENOMEM && req->nmdesc >= target->mr_pool_size) + ret = -E2BIG; return ret; } -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/7] IB/srp: Introduce target->mr_pool_size
This patch does not change any functionality. Signed-off-by: Bart Van Assche Cc: Sagi Grimberg Cc: Sebastian Parschauer --- drivers/infiniband/ulp/srp/ib_srp.c | 6 +++--- drivers/infiniband/ulp/srp/ib_srp.h | 1 + 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index fb6b654..47c3a72 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -315,7 +315,7 @@ static struct ib_fmr_pool *srp_alloc_fmr_pool(struct srp_target_port *target) struct ib_fmr_pool_param fmr_param; memset(&fmr_param, 0, sizeof(fmr_param)); - fmr_param.pool_size = target->scsi_host->can_queue; + fmr_param.pool_size = target->mr_pool_size; fmr_param.dirty_watermark = fmr_param.pool_size / 4; fmr_param.cache = 1; fmr_param.max_pages_per_fmr = dev->max_pages_per_mr; @@ -449,8 +449,7 @@ static struct srp_fr_pool *srp_alloc_fr_pool(struct srp_target_port *target) { struct srp_device *dev = target->srp_host->srp_dev; - return srp_create_fr_pool(dev->dev, dev->pd, - target->scsi_host->can_queue, + return srp_create_fr_pool(dev->dev, dev->pd, target->mr_pool_size, dev->max_pages_per_mr); } @@ -3247,6 +3246,7 @@ static ssize_t srp_create_target(struct device *dev, } target_host->sg_tablesize = target->sg_tablesize; + target->mr_pool_size = target->scsi_host->can_queue; target->indirect_size = target->sg_tablesize * sizeof (struct srp_direct_buf); target->max_iu_len = sizeof (struct srp_cmd) + diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h index 1c6a715..af084f7 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.h +++ b/drivers/infiniband/ulp/srp/ib_srp.h @@ -205,6 +205,7 @@ struct srp_target_port { chartarget_name[32]; unsigned intscsi_id; unsigned intsg_tablesize; + int mr_pool_size; int queue_size; int req_ring_size; int comp_vector; -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/7] IB/srp: Fix srp_map_data() error paths
Ensure that req->nmdesc is set correctly in srp_map_sg() if mapping fails. Avoid that mapping failure causes a memory descriptor leak. Report srp_map_sg() failure to the caller. Signed-off-by: Bart Van Assche Cc: Sagi Grimberg Cc: Sebastian Parschauer --- drivers/infiniband/ulp/srp/ib_srp.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 6d17fe2..fb6b654 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1473,7 +1473,6 @@ static int srp_map_sg(struct srp_map_state *state, struct srp_rdma_ch *ch, } } - req->nmdesc = state->nmdesc; ret = 0; out: @@ -1594,7 +1593,10 @@ static int srp_map_data(struct scsi_cmnd *scmnd, struct srp_rdma_ch *ch, target->indirect_size, DMA_TO_DEVICE); memset(&state, 0, sizeof(state)); - srp_map_sg(&state, ch, req, scat, count); + ret = srp_map_sg(&state, ch, req, scat, count); + req->nmdesc = state.nmdesc; + if (ret < 0) + goto unmap; /* We've mapped the request, now pull as much of the indirect * descriptor table as we can into the command buffer. If this @@ -1617,7 +1619,8 @@ static int srp_map_data(struct scsi_cmnd *scmnd, struct srp_rdma_ch *ch, !target->allow_ext_sg)) { shost_printk(KERN_ERR, target->scsi_host, "Could not fit S/G list into SRP_CMD\n"); - return -EIO; + ret = -EIO; + goto unmap; } count = min(state.ndesc, target->cmd_sg_cnt); @@ -1635,7 +1638,7 @@ static int srp_map_data(struct scsi_cmnd *scmnd, struct srp_rdma_ch *ch, ret = srp_map_idb(ch, req, state.gen.next, state.gen.end, idb_len, &idb_rkey); if (ret < 0) - return ret; + goto unmap; req->nmdesc++; } else { idb_rkey = target->global_mr->rkey; @@ -1661,6 +1664,10 @@ map_complete: cmd->buf_fmt = fmt; return len; + +unmap: + srp_unmap_data(scmnd, ch, req, true); + return ret; } /* -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/7] IB/srp: Rename work request ID labels
Work request IDs in the SRP initiator driver are either a pointer or a value that is not a valid pointer. Since the local invalidate and fast registration work requests IDs are not used as masks drop the suffix "mask" from their name. Signed-off-by: Bart Van Assche Cc: Sagi Grimberg Cc: Sebastian Parschauer --- drivers/infiniband/ulp/srp/ib_srp.c | 8 drivers/infiniband/ulp/srp/ib_srp.h | 7 +++ 2 files changed, 7 insertions(+), 8 deletions(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index c1faf70..1aa9a4c 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1049,7 +1049,7 @@ static int srp_inv_rkey(struct srp_rdma_ch *ch, u32 rkey) struct ib_send_wr *bad_wr; struct ib_send_wr wr = { .opcode = IB_WR_LOCAL_INV, - .wr_id = LOCAL_INV_WR_ID_MASK, + .wr_id = LOCAL_INV_WR_ID, .next = NULL, .num_sge= 0, .send_flags = 0, @@ -1325,7 +1325,7 @@ static int srp_map_finish_fr(struct srp_map_state *state, memset(&wr, 0, sizeof(wr)); wr.opcode = IB_WR_FAST_REG_MR; - wr.wr_id = FAST_REG_WR_ID_MASK; + wr.wr_id = FAST_REG_WR_ID; wr.wr.fast_reg.iova_start = state->base_dma_addr; wr.wr.fast_reg.page_list = desc->frpl; wr.wr.fast_reg.page_list_len = state->npages; @@ -1940,11 +1940,11 @@ static void srp_handle_qp_err(u64 wr_id, enum ib_wc_status wc_status, } if (ch->connected && !target->qp_in_error) { - if (wr_id & LOCAL_INV_WR_ID_MASK) { + if (wr_id == LOCAL_INV_WR_ID) { shost_printk(KERN_ERR, target->scsi_host, PFX "LOCAL_INV failed with status %s (%d)\n", ib_wc_status_msg(wc_status), wc_status); - } else if (wr_id & FAST_REG_WR_ID_MASK) { + } else if (wr_id == FAST_REG_WR_ID) { shost_printk(KERN_ERR, target->scsi_host, PFX "FAST_REG_MR failed status %s (%d)\n", ib_wc_status_msg(wc_status), wc_status); diff --git a/drivers/infiniband/ulp/srp/ib_srp.h b/drivers/infiniband/ulp/srp/ib_srp.h index 3608f2e..1c6a715 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.h +++ b/drivers/infiniband/ulp/srp/ib_srp.h @@ -67,10 +67,9 @@ enum { SRP_MAX_PAGES_PER_MR= 512, - LOCAL_INV_WR_ID_MASK= 1, - FAST_REG_WR_ID_MASK = 2, - - SRP_LAST_WR_ID = 0xfffcU, + LOCAL_INV_WR_ID = 1, + FAST_REG_WR_ID = 2, + SRP_LAST_WR_ID = 3, }; enum srp_target_state { -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/7] IB/srp: Fix a potential queue overflow in an error path
Wait until memory registration has finished in the srp_queuecommand() error path before invalidating memory regions to avoid a send queue overflow. Signed-off-by: Bart Van Assche Cc: Sagi Grimberg Cc: Sebastian Parschauer --- drivers/infiniband/ulp/srp/ib_srp.c | 41 ++--- 1 file changed, 34 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index 1aa9a4c..6d17fe2 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -1044,7 +1044,7 @@ static int srp_connect_ch(struct srp_rdma_ch *ch, bool multich) } } -static int srp_inv_rkey(struct srp_rdma_ch *ch, u32 rkey) +static int srp_inv_rkey(struct srp_rdma_ch *ch, u32 rkey, u32 send_flags) { struct ib_send_wr *bad_wr; struct ib_send_wr wr = { @@ -1052,16 +1052,32 @@ static int srp_inv_rkey(struct srp_rdma_ch *ch, u32 rkey) .wr_id = LOCAL_INV_WR_ID, .next = NULL, .num_sge= 0, - .send_flags = 0, + .send_flags = send_flags, .ex.invalidate_rkey = rkey, }; return ib_post_send(ch->qp, &wr, &bad_wr); } +static bool srp_wait_until_done(struct srp_rdma_ch *ch, int i, long timeout) +{ + WARN_ON_ONCE(timeout <= 0); + + for ( ; i > 0; i--) { + spin_lock_irq(&ch->lock); + srp_send_completion(ch->send_cq, ch); + spin_unlock_irq(&ch->lock); + + if (wait_for_completion_timeout(&ch->done, timeout) > 0) + return true; + } + return false; +} + static void srp_unmap_data(struct scsi_cmnd *scmnd, struct srp_rdma_ch *ch, - struct srp_request *req) + struct srp_request *req, + bool wait_for_first_unmap) { struct srp_target_port *target = ch->target; struct srp_device *dev = target->srp_host->srp_dev; @@ -1077,13 +1093,19 @@ static void srp_unmap_data(struct scsi_cmnd *scmnd, struct srp_fr_desc **pfr; for (i = req->nmdesc, pfr = req->fr_list; i > 0; i--, pfr++) { - res = srp_inv_rkey(ch, (*pfr)->mr->rkey); + res = srp_inv_rkey(ch, (*pfr)->mr->rkey, + wait_for_first_unmap ? + IB_SEND_SIGNALED : 0); if (res < 0) { shost_printk(KERN_ERR, target->scsi_host, PFX "Queueing INV WR for rkey %#x failed (%d)\n", (*pfr)->mr->rkey, res); queue_work(system_long_wq, &target->tl_err_work); + } else if (wait_for_first_unmap) { + wait_for_first_unmap = false; + WARN_ON_ONCE(!srp_wait_until_done(ch, 10, + msecs_to_jiffies(100))); } } if (req->nmdesc) @@ -1144,7 +1166,7 @@ static void srp_free_req(struct srp_rdma_ch *ch, struct srp_request *req, { unsigned long flags; - srp_unmap_data(scmnd, ch, req); + srp_unmap_data(scmnd, ch, req, false); spin_lock_irqsave(&ch->lock, flags); ch->req_lim += req_lim_delta; @@ -1982,7 +2004,12 @@ static void srp_send_completion(struct ib_cq *cq, void *ch_ptr) struct srp_iu *iu; while (ib_poll_cq(cq, 1, &wc) > 0) { - if (likely(wc.status == IB_WC_SUCCESS)) { + if (unlikely(wc.wr_id == LOCAL_INV_WR_ID)) { + complete(&ch->done); + if (wc.status != IB_WC_SUCCESS) + srp_handle_qp_err(wc.wr_id, wc.status, true, + ch); + } else if (likely(wc.status == IB_WC_SUCCESS)) { iu = (struct srp_iu *) (uintptr_t) wc.wr_id; list_add(&iu->list, &ch->free_tx); } else { @@ -2084,7 +2111,7 @@ unlock_rport: return ret; err_unmap: - srp_unmap_data(scmnd, ch, req); + srp_unmap_data(scmnd, ch, req, true); err_iu: srp_put_tx_iu(ch, iu, SRP_IU_CMD); -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] IB/mlx5: Publish mlx5 driver support for extended create QP
Signed-off-by: Eli Cohen --- drivers/infiniband/hw/mlx5/main.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index f1ccd40..634de84 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -1385,7 +1385,8 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev) (1ull << IB_USER_VERBS_CMD_CREATE_XSRQ) | (1ull << IB_USER_VERBS_CMD_OPEN_QP); dev->ib_dev.uverbs_ex_cmd_mask = - (1ull << IB_USER_VERBS_EX_CMD_QUERY_DEVICE); + (1ull << IB_USER_VERBS_EX_CMD_QUERY_DEVICE) | + (1ull << IB_USER_VERBS_EX_CMD_CREATE_QP); dev->ib_dev.query_device= mlx5_ib_query_device; dev->ib_dev.query_port = mlx5_ib_query_port; -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] Loopback prevention support
Hi Doug, This two patch series adds support for loopback prevention in mlx5_ib for userspace consumers. Eli *** Resending since may have had some problem with my subscription so the *** patchest did not make it to the rdma list Eli Cohen (2): IB/mlx5: Add debug print to signify if block multicast is used IB/mlx5: Publish mlx5 driver support for extended create QP drivers/infiniband/hw/mlx5/main.c |3 ++- drivers/infiniband/hw/mlx5/qp.c |4 2 files changed, 6 insertions(+), 1 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] IB/mlx5: Add debug print to signify if block multicast is used
Signed-off-by: Eli Cohen --- drivers/infiniband/hw/mlx5/qp.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c index 6f521a3..b80b2bd 100644 --- a/drivers/infiniband/hw/mlx5/qp.c +++ b/drivers/infiniband/hw/mlx5/qp.c @@ -1033,6 +1033,10 @@ static int create_qp_common(struct mlx5_ib_dev *dev, struct ib_pd *pd, qp->mqp.event = mlx5_ib_qp_event; + /* QP related debug prints go here */ + if (qp->flags & MLX5_IB_QP_BLOCK_MULTICAST_LOOPBACK) + mlx5_ib_dbg(dev, "QP 0x%x will block multicast\n", qp->mqp.qpn); + return 0; err_create: -- 1.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/25] IB/mthca, net/mlx4: remove counting semaphores
The mthca and mlx4 device drivers use the same method to switch between polling and event-driven command mode, abusing two semaphores to create a mutual exclusion between one polled command or multiple concurrent event driven commands. Since we want to make counting semaphores go away, this patch replaces the semaphore counting the event-driven commands with an open-coded wait-queue, which should be an equivalent transformation of the code, although it does not make it any nicer. As far as I can tell, there is a preexisting race condition regarding the cmd->use_events flag, which is not protected by any lock. When this flag is toggled while another command is being started, that command gets stuck until the mode is toggled back. A better solution that would solve the race condition and at the same time improve the code readability would create a new locking primitive that replaces both semaphores, like static int mlx4_use_events(struct mlx4_cmd *cmd) { int ret = -EAGAIN; spin_lock(&cmd->lock); if (cmd->use_events && cmd->commands < cmd->max_commands) { cmd->commands++; ret = 1; } else if (!cmd->use_events && cmd->commands == 0) { cmd->commands = 1; ret = 0; } spin_unlock(&cmd->lock); return ret; } static bool mlx4_use_events(struct mlx4_cmd *cmd) { int ret; wait_event(cmd->events_wq, ret = __mlx4_use_events(cmd) >= 0); return ret; } Cc: Roland Dreier Cc: Eli Cohen Cc: Yevgeny Petrilin Cc: net...@vger.kernel.org Cc: linux-rdma@vger.kernel.org Signed-off-by: Arnd Bergmann Conflicts: drivers/net/mlx4/cmd.c drivers/net/mlx4/mlx4.h --- drivers/infiniband/hw/mthca/mthca_cmd.c | 12 drivers/infiniband/hw/mthca/mthca_dev.h | 3 ++- drivers/net/ethernet/mellanox/mlx4/cmd.c | 12 drivers/net/ethernet/mellanox/mlx4/mlx4.h | 3 ++- 4 files changed, 20 insertions(+), 10 deletions(-) diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c b/drivers/infiniband/hw/mthca/mthca_cmd.c index 9d3e5c1ac60e..aad1852e8e10 100644 --- a/drivers/infiniband/hw/mthca/mthca_cmd.c +++ b/drivers/infiniband/hw/mthca/mthca_cmd.c @@ -417,7 +417,8 @@ static int mthca_cmd_wait(struct mthca_dev *dev, int err = 0; struct mthca_cmd_context *context; - down(&dev->cmd.event_sem); + wait_event(dev->cmd.event_wait, + atomic_add_unless(&dev->cmd.commands, -1, 0)); spin_lock(&dev->cmd.context_lock); BUG_ON(dev->cmd.free_head < 0); @@ -459,7 +460,8 @@ out: dev->cmd.free_head = context - dev->cmd.context; spin_unlock(&dev->cmd.context_lock); - up(&dev->cmd.event_sem); + atomic_inc(&dev->cmd.commands); + wake_up(&dev->cmd.event_wait); return err; } @@ -571,7 +573,8 @@ int mthca_cmd_use_events(struct mthca_dev *dev) dev->cmd.context[dev->cmd.max_cmds - 1].next = -1; dev->cmd.free_head = 0; - sema_init(&dev->cmd.event_sem, dev->cmd.max_cmds); + init_waitqueue_head(&dev->cmd.event_wait); + atomic_set(&dev->cmd.commands, dev->cmd.max_cmds); spin_lock_init(&dev->cmd.context_lock); for (dev->cmd.token_mask = 1; @@ -597,7 +600,8 @@ void mthca_cmd_use_polling(struct mthca_dev *dev) dev->cmd.flags &= ~MTHCA_CMD_USE_EVENTS; for (i = 0; i < dev->cmd.max_cmds; ++i) - down(&dev->cmd.event_sem); + wait_event(dev->cmd.event_wait, + atomic_add_unless(&dev->cmd.commands, -1, 0)); kfree(dev->cmd.context); diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h b/drivers/infiniband/hw/mthca/mthca_dev.h index 7e6a6d64ad4e..3055f5c12ac8 100644 --- a/drivers/infiniband/hw/mthca/mthca_dev.h +++ b/drivers/infiniband/hw/mthca/mthca_dev.h @@ -121,7 +121,8 @@ struct mthca_cmd { struct pci_pool *pool; struct mutex hcr_mutex; struct semaphore poll_sem; - struct semaphore event_sem; + wait_queue_head_t event_wait; + atomic_t commands; int max_cmds; spinlock_tcontext_lock; int free_head; diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c index 78f5a1a0b8c8..60134a4245ef 100644 --- a/drivers/net/ethernet/mellanox/mlx4/cmd.c +++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c @@ -273,7 +273,8 @@ static int mlx4_cmd_wait(struct mlx4_dev *dev, u64 in_param, u64 *out_param, struct mlx4_cmd_context *context; int err = 0; - down(&cmd->event_sem); + wait_event(cmd->event_wait, + atomic_add_unless(&cmd->commands, -1, 0)); spin_lock(&cmd->context_lock); BUG_ON(cmd->free_head < 0); @@ -305,7 +306,8 @@ out: cmd->free_head = context - cmd->context;
RE: [PATCH] IB/sa: replace GFP_KERNEL with GFP_ATOMIC
> > On Tue, Oct 27, 2015 at 06:56:50PM +, Wan, Kaike wrote: > > > > I do wonder if it is a good idea to call ib_nl_send_msg with a > > > spinlock held though.. Would be nice to see that go away. > > > > We have to hold the lock to protect against a race condition that a > > quick response will try to free the request from the > > ib_nl_request_list before we even put it on the list. > > Put is on the list first? Use a kref? Doesn't look like a big deal to clean > this up. > > Jason Not sure what "Put is on the list first?" means. I think it is valid to build the request, if success, add to the list, then send it. That would solve the problem you mention above. Was that what you hand in mind, Jason? I don't have time to work on this right now, not sure about Kaike. Until we can remove the spinlock the current proposed patch should be applied in the interim. Sorry for the noise before. Reviewed-By: Ira Weiny -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html