Re: [PATCH v3 12/23] staging/rdma/hfi1: Macro code clean up

2015-10-27 Thread Greg KH
On Mon, Oct 26, 2015 at 10:28:38AM -0400, ira.we...@intel.com wrote:
> From: Mitko Haralanov 
> 
> Clean up the context and sdma macros and move them to a more logical place in
> hfi.h
> 
> Signed-off-by: Mitko Haralanov 
> Signed-off-by: Ira Weiny 
> ---
>  drivers/staging/rdma/hfi1/hfi.h | 22 ++
>  1 file changed, 10 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/staging/rdma/hfi1/hfi.h b/drivers/staging/rdma/hfi1/hfi.h
> index a35213e9b500..41ad9a30149b 100644
> --- a/drivers/staging/rdma/hfi1/hfi.h
> +++ b/drivers/staging/rdma/hfi1/hfi.h
> @@ -1104,6 +1104,16 @@ struct hfi1_filedata {
>   int rec_cpu_num;
>  };
>  
> +/* for use in system calls, where we want to know device type, etc. */
> +#define fp_to_fd(fp) ((struct hfi1_filedata *)(fp)->private_data)
> +#define ctxt_fp(fp) (fp_to_fd((fp))->uctxt)
> +#define subctxt_fp(fp) (fp_to_fd((fp))->subctxt)
> +#define tidcursor_fp(fp) (fp_to_fd((fp))->tidcursor)
> +#define user_sdma_pkt_fp(fp) (fp_to_fd((fp))->pq)
> +#define user_sdma_comp_fp(fp) (fp_to_fd((fp))->cq)
> +#define notifier_fp(fp) (fp_to_fd((fp))->mn)
> +#define rb_fp(fp) (fp_to_fd((fp))->tid_rb_root)

Ick, no, don't do this, just spell it all out (odds are you will see tht
you can make the code simpler...)  If you don't know what "cq" or "pq"
are, then name them properly.

These need to be all removed.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 14/23] staging/rdma/hfi1: Implement Expected Receive TID caching

2015-10-27 Thread Greg KH
On Mon, Oct 26, 2015 at 10:28:40AM -0400, ira.we...@intel.com wrote:
> From: Mitko Haralanov 
> 
> Expected receives work by user-space libraries (PSM) calling into the
> driver with information about the user's receive buffer and have the driver
> DMA-map that buffer and program the HFI to receive data directly into it.
> 
> This is an expensive operation as it requires the driver to pin the pages
> which
> the user's buffer maps to, DMA-map them, and then program the HFI.
> 
> When the receive is complete, user-space libraries have to call into the 
> driver
> again so the buffer is removed from the HFI, un-mapped, and the pages 
> unpinned.
> 
> All of these operations are expensive, considering that a lot of applications
> (especially micro-benchmarks) use the same buffer over and over.
> 
> In order to get better performance for user-space applications, it is highly
> beneficial that they don't continuously call into the driver to register and
> unregister the same buffer. Rather, they can register the buffer and cache it
> for future work. The buffer can be unregistered when it is freed by the user.
> 
> This change implements such buffer caching by making use of the kernel's MMU
> notifier API. User-space libraries call into the driver only when the need to
> register a new buffer.
> 
> Once a buffer is registered, it stays programmed into the HFI until the kernel
> notifies the driver that the buffer has been freed by the user. At that time,
> the user-space library is notified and it can do the necessary work to remove
> the buffer from its cache.
> 
> Buffers which have been invalidated by the kernel are not automatically 
> removed
> from the HFI and do not have their pages unpinned. Buffers are only completely
> removed when the user-space libraries call into the driver to free them.  This
> is done to ensure that any ongoing transfers into that buffer are complete.
> This is important when a buffer is not completely freed but rather it is
> shrunk. The user-space library could still have uncompleted transfers into the
> remaining buffer.
> 
> With this feature, it is important that systems are setup with reasonable
> limits for the amount of lockable memory.  Keeping the limit at "unlimited" 
> (as
> we've done up to this point), may result in jobs being killed by the kernel's
> OOM due to them taking up excessive amounts of memory.
> 
> Reviewed-by: Arthur Kepner 
> Reviewed-by: Dennis Dalessandro 
> Signed-off-by: Mitko Haralanov 
> Signed-off-by: Ira Weiny 
> 
> ---
> Changes from V2:
>   Fix random Kconfig 0-day build error
>   Fix leak of random memory to user space caught by Dan Carpenter
>   Separate out pointer bug fix into a previous patch
>   Change error checks in case statement per Dan's comments
> 
>  drivers/staging/rdma/hfi1/Kconfig|1 +
>  drivers/staging/rdma/hfi1/Makefile   |2 +-
>  drivers/staging/rdma/hfi1/common.h   |   15 +-
>  drivers/staging/rdma/hfi1/file_ops.c |  490 ++---
>  drivers/staging/rdma/hfi1/hfi.h  |   43 +-
>  drivers/staging/rdma/hfi1/init.c |5 +-
>  drivers/staging/rdma/hfi1/trace.h|  132 ++--
>  drivers/staging/rdma/hfi1/user_exp_rcv.c | 1171 
> ++
>  drivers/staging/rdma/hfi1/user_exp_rcv.h |   82 +++
>  drivers/staging/rdma/hfi1/user_pages.c   |  110 +--
>  drivers/staging/rdma/hfi1/user_sdma.c|   13 +
>  drivers/staging/rdma/hfi1/user_sdma.h|   10 +-
>  include/uapi/rdma/hfi/hfi1_user.h|   42 +-
>  13 files changed, 1481 insertions(+), 635 deletions(-)
>  create mode 100644 drivers/staging/rdma/hfi1/user_exp_rcv.c
>  create mode 100644 drivers/staging/rdma/hfi1/user_exp_rcv.h

This is way too big to review properly, please break it up into
reviewable chunks.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 19/23] staging/rdma/hfi: modify workqueue for parallelism

2015-10-27 Thread Greg KH
On Mon, Oct 26, 2015 at 10:28:45AM -0400, ira.we...@intel.com wrote:
> From: Mike Marciniszyn 
> 
> The workqueue is currently single threaded per port which for a small number 
> of
> SDMA engines is ok.
> 
> For hfi1, the there are up to 16 SDMA engines that can be fed descriptors in
> parallel.
> 
> This patch:
> - Converts to use alloc_workqueue
> - Changes the workqueue limit from 1 to num_sdma
> - Makes the queue WQ_CPU_INTENSIVE and WQ_HIGHPRI
> - The sdma_engine now has a cpu that is initialized
>   as the MSI-X vectors are setup
> - Adjusts the post send logic to call a new scheduler
>   that doesn't get the s_lock
> - The new and old workqueue schedule now pass a
>   cpu
> - post send now uses the new scheduler
> - RC/UC QPs now pre-compute the sc, sde
> - The sde wq is eliminated since the new hfi1_wq is
>   multi-threaded

When you have to start enumerating all of the different things that your
patch does, that's a huge hint that you need to break it up into smaller
pieces.

Please break this up, it's not acceptable as-is.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 23/23] staging/rdma/hfi1: Update driver version string to 0.9-294

2015-10-27 Thread Greg KH
On Mon, Oct 26, 2015 at 10:28:49AM -0400, ira.we...@intel.com wrote:
> From: Jubin John 
> 
> Signed-off-by: Jubin John 
> Signed-off-by: Ira Weiny 
> ---
>  drivers/staging/rdma/hfi1/common.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/staging/rdma/hfi1/common.h 
> b/drivers/staging/rdma/hfi1/common.h
> index 7809093eb55e..5dd92720faae 100644
> --- a/drivers/staging/rdma/hfi1/common.h
> +++ b/drivers/staging/rdma/hfi1/common.h
> @@ -205,7 +205,7 @@
>   * to the driver itself, not the software interfaces it supports.
>   */
>  #ifndef HFI1_DRIVER_VERSION_BASE
> -#define HFI1_DRIVER_VERSION_BASE "0.9-248"
> +#define HFI1_DRIVER_VERSION_BASE "0.9-294"

Patches like this make no sense at all, please drop it and only use the
kernel version.  Trust me, it's going to get messy really fast (hint, it
already did...)

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4 v2] staging: ipath: ipath_driver: Use setup_timer

2015-10-27 Thread Dan Carpenter
On Sun, Oct 25, 2015 at 01:21:11PM +0200, Leon Romanovsky wrote:
> On Sun, Oct 25, 2015 at 12:17 PM, Muhammad Falak R Wani
>  wrote:
> Please follow standard naming convention for the patches.
> It should be [PATCH v2 1/4] and not [PATCH 1/4 v2].

Does this matter?  It's in a thread so it sorts fine either way.

regards,
dan carpenter

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] iser-target: Remove explicit mlx4 work-around

2015-10-27 Thread Sagi Grimberg
The driver now exposes sufficient limits so we can
avoid having mlx4 specific work-around.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/ulp/isert/ib_isert.c |   10 ++
 1 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/ulp/isert/ib_isert.c 
b/drivers/infiniband/ulp/isert/ib_isert.c
index 96336a9..303cea7 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -141,14 +141,8 @@ isert_create_qp(struct isert_conn *isert_conn,
attr.recv_cq = comp->cq;
attr.cap.max_send_wr = ISERT_QP_MAX_REQ_DTOS;
attr.cap.max_recv_wr = ISERT_QP_MAX_RECV_DTOS + 1;
-   /*
-* FIXME: Use devattr.max_sge - 2 for max_send_sge as
-* work-around for RDMA_READs with ConnectX-2.
-*
-* Also, still make sure to have at least two SGEs for
-* outgoing control PDU responses.
-*/
-   attr.cap.max_send_sge = max(2, device->ib_device->max_sge - 2);
+   attr.cap.max_send_sge = min(device->ib_device->max_sge,
+   device->ib_device->max_sge_rd);
isert_conn->max_sge = attr.cap.max_send_sge;
 
attr.cap.max_recv_sge = 1;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Expose max_sge_rd correctly

2015-10-27 Thread Sagi Grimberg
This addresses a specific mlx4 issue where the max_sge_rd
is actually smaller than max_sge (rdma reads with max_sge
entries completes with error).

The second patch removes the explicit work-around from the
iser target code.

This applies on top of Christoph's device attributes modification.

Sagi Grimberg (2):
  mlx4: Expose correct max_sge_rd limit
  iser-target: Remove explicit mlx4 work-around

 drivers/infiniband/hw/mlx4/main.c   |3 ++-
 drivers/infiniband/ulp/isert/ib_isert.c |   10 ++
 2 files changed, 4 insertions(+), 9 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] mlx4: Expose correct max_sge_rd limit

2015-10-27 Thread Sagi Grimberg
mlx4 devices (ConnectX-2, ConnectX-3) can not issue
max_sge in a single RDMA_READ request (resulting in
a completion error). Thus, expose lower max_sge_rd
to avoid this issue.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/hw/mlx4/main.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index 3889723..46305dc 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -499,7 +499,8 @@ static int mlx4_ib_init_device_flags(struct ib_device 
*ibdev)
ibdev->max_qp_wr   = dev->dev->caps.max_wqes - 
MLX4_IB_SQ_MAX_SPARE;
ibdev->max_sge = min(dev->dev->caps.max_sq_sg,
 dev->dev->caps.max_rq_sg);
-   ibdev->max_sge_rd  = ibdev->max_sge;
+   /* reserve 2 sge slots for rdma reads */
+   ibdev->max_sge_rd  = ibdev->max_sge - 2;
ibdev->max_cq  = dev->dev->quotas.cq;
ibdev->max_cqe = dev->dev->caps.max_cqes;
ibdev->max_mr  = dev->dev->quotas.mpt;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4 v2] staging: ipath: ipath_driver: Use setup_timer

2015-10-27 Thread Leon Romanovsky
On Tue, Oct 27, 2015 at 11:19 AM, Dan Carpenter
 wrote:
> On Sun, Oct 25, 2015 at 01:21:11PM +0200, Leon Romanovsky wrote:
>> On Sun, Oct 25, 2015 at 12:17 PM, Muhammad Falak R Wani
>>  wrote:
>> Please follow standard naming convention for the patches.
>> It should be [PATCH v2 1/4] and not [PATCH 1/4 v2].
>
> Does this matter?  It's in a thread so it sorts fine either way.
It will be wise if people read guides and follow examples.

[1] https://www.kernel.org/doc/Documentation/SubmittingPatches


>
> regards,
> dan carpenter
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4 v2] staging: ipath: ipath_driver: Use setup_timer

2015-10-27 Thread Dan Carpenter
On Tue, Oct 27, 2015 at 11:45:18AM +0200, Leon Romanovsky wrote:
> On Tue, Oct 27, 2015 at 11:19 AM, Dan Carpenter
>  wrote:
> > On Sun, Oct 25, 2015 at 01:21:11PM +0200, Leon Romanovsky wrote:
> >> On Sun, Oct 25, 2015 at 12:17 PM, Muhammad Falak R Wani
> >>  wrote:
> >> Please follow standard naming convention for the patches.
> >> It should be [PATCH v2 1/4] and not [PATCH 1/4 v2].
> >
> > Does this matter?  It's in a thread so it sorts fine either way.
> It will be wise if people read guides and follow examples.
> 
> [1] https://www.kernel.org/doc/Documentation/SubmittingPatches

That document doesn't really specify one way or the other.  And even if
it did then why would you care?  Stop being so picky for no reason.

regards,
dan carpenter

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH libibverbs] Expose QP block self multicast loopback creation flag

2015-10-27 Thread Eran Ben Elisha
Add QP creation flag which indicates that the QP will not receive self
multicast loopback traffic.

ibv_cmd_create_qp_ex was already defined but could not get extended, add
ibv_cmd_create_qp_ex2 which follows the extension scheme and hence could
be extendible in the future for more features.

Signed-off-by: Eran Ben Elisha 
Reviewed-by: Moshe Lazer 
---
Hi Doug,
This is the user space equivalent for the loopback prevention patches that were
acceptad into 4.4 ib-next.

 include/infiniband/driver.h   |   9 ++
 include/infiniband/kern-abi.h |  53 +++
 include/infiniband/verbs.h|   9 +-
 src/cmd.c | 200 ++
 src/libibverbs.map|   1 +
 5 files changed, 200 insertions(+), 72 deletions(-)

diff --git a/include/infiniband/driver.h b/include/infiniband/driver.h
index 8227df0..b7f1fae 100644
--- a/include/infiniband/driver.h
+++ b/include/infiniband/driver.h
@@ -179,6 +179,15 @@ int ibv_cmd_create_qp_ex(struct ibv_context *context,
 struct ibv_qp_init_attr_ex *attr_ex,
 struct ibv_create_qp *cmd, size_t cmd_size,
 struct ibv_create_qp_resp *resp, size_t resp_size);
+int ibv_cmd_create_qp_ex2(struct ibv_context *context,
+ struct verbs_qp *qp, int vqp_sz,
+ struct ibv_qp_init_attr_ex *qp_attr,
+ struct ibv_create_qp_ex *cmd,
+ size_t cmd_core_size,
+ size_t cmd_size,
+ struct ibv_create_qp_resp_ex *resp,
+ size_t resp_core_size,
+ size_t resp_size);
 int ibv_cmd_open_qp(struct ibv_context *context,
struct verbs_qp *qp,  int vqp_sz,
struct ibv_qp_open_attr *attr,
diff --git a/include/infiniband/kern-abi.h b/include/infiniband/kern-abi.h
index 800c5ab..2278f63 100644
--- a/include/infiniband/kern-abi.h
+++ b/include/infiniband/kern-abi.h
@@ -110,6 +110,8 @@ enum {
 enum {
IB_USER_VERBS_CMD_QUERY_DEVICE_EX = IB_USER_VERBS_CMD_EXTENDED_MASK |
IB_USER_VERBS_CMD_QUERY_DEVICE,
+   IB_USER_VERBS_CMD_CREATE_QP_EX = IB_USER_VERBS_CMD_EXTENDED_MASK |
+IB_USER_VERBS_CMD_CREATE_QP,
IB_USER_VERBS_CMD_CREATE_FLOW = IB_USER_VERBS_CMD_EXTENDED_MASK +
IB_USER_VERBS_CMD_THRESHOLD,
IB_USER_VERBS_CMD_DESTROY_FLOW
@@ -527,28 +529,35 @@ struct ibv_kern_qp_attr {
__u8reserved[5];
 };
 
+#define IBV_CREATE_QP_COMMON   \
+   __u64 user_handle;  \
+   __u32 pd_handle;\
+   __u32 send_cq_handle;   \
+   __u32 recv_cq_handle;   \
+   __u32 srq_handle;   \
+   __u32 max_send_wr;  \
+   __u32 max_recv_wr;  \
+   __u32 max_send_sge; \
+   __u32 max_recv_sge; \
+   __u32 max_inline_data;  \
+   __u8  sq_sig_all;   \
+   __u8  qp_type;  \
+   __u8  is_srq;   \
+   __u8  reserved
+
 struct ibv_create_qp {
__u32 command;
__u16 in_words;
__u16 out_words;
__u64 response;
-   __u64 user_handle;
-   __u32 pd_handle;
-   __u32 send_cq_handle;
-   __u32 recv_cq_handle;
-   __u32 srq_handle;
-   __u32 max_send_wr;
-   __u32 max_recv_wr;
-   __u32 max_send_sge;
-   __u32 max_recv_sge;
-   __u32 max_inline_data;
-   __u8  sq_sig_all;
-   __u8  qp_type;
-   __u8  is_srq;
-   __u8  reserved;
+   IBV_CREATE_QP_COMMON;
__u64 driver_data[0];
 };
 
+struct ibv_create_qp_common {
+   IBV_CREATE_QP_COMMON;
+};
+
 struct ibv_open_qp {
__u32 command;
__u16 in_words;
@@ -574,6 +583,19 @@ struct ibv_create_qp_resp {
__u32 reserved;
 };
 
+struct ibv_create_qp_ex {
+   struct ex_hdr   hdr;
+   struct ibv_create_qp_common base;
+   __u32 comp_mask;
+   __u32 create_flags;
+};
+
+struct ibv_create_qp_resp_ex {
+   struct ibv_create_qp_resp base;
+   __u32 comp_mask;
+   __u32 response_length;
+};
+
 struct ibv_qp_dest {
__u8  dgid[16];
__u32 flow_label;
@@ -1031,7 +1053,8 @@ enum {
IB_USER_VERBS_CMD_OPEN_QP_V2 = -1,
IB_USER_VERBS_CMD_CREATE_FLOW_V2 = -1,
IB_USER_VERBS_CMD_DESTROY_FLOW_V2 = -1,
-   IB_USER_VERBS_CMD_QUERY_DEVICE_EX_V2 = -1
+   IB_USER_VERBS_CMD_QUERY_DEVICE_EX_V2 = -1,
+   IB_USER_VERBS_CMD_CREATE_QP_EX_V2 = -1,
 };
 
 struct ibv_modify_srq_v3 {
diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index ae22768..941e5dc 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -580,7 +580,12 @@ struct ibv_qp_init_attr {
 enum ibv_qp_init_attr_mask {
IBV_QP_INIT_ATTR_PD = 1 << 0,
IBV_QP_INIT_ATTR_XRCD   = 1 << 1,
-   IBV_QP_INIT_ATTR_RESE

[PATCH libmlx4] Add support for ibv_cmd_create_qp_ex2

2015-10-27 Thread Eran Ben Elisha
Add an extension verb mlx4_cmd_create_qp_ex that follows the
standard extension verb mechanism.
This function is called from mlx4_create_qp_ex but supports the
extension verbs functions and stores the creation flags.

In addition, check that the comp_mask values of struct
ibv_qp_init_attr_ex are valid.


Signed-off-by: Eran Ben Elisha 
Signed-off-by: Yishai Hadas 
---
 src/mlx4-abi.h | 18 ++
 src/verbs.c| 51 +++
 2 files changed, 65 insertions(+), 4 deletions(-)

diff --git a/src/mlx4-abi.h b/src/mlx4-abi.h
index b48f6fc..ac21fa8 100644
--- a/src/mlx4-abi.h
+++ b/src/mlx4-abi.h
@@ -111,4 +111,22 @@ struct mlx4_create_qp {
__u8reserved[5];
 };
 
+struct mlx4_create_qp_drv_ex {
+   __u64   buf_addr;
+   __u64   db_addr;
+   __u8log_sq_bb_count;
+   __u8log_sq_stride;
+   __u8sq_no_prefetch; /* was reserved in ABI 2 */
+   __u8reserved[5];
+};
+
+struct mlx4_create_qp_ex {
+   struct ibv_create_qp_ex ibv_cmd;
+   struct mlx4_create_qp_drv_exdrv_ex;
+};
+
+struct mlx4_create_qp_resp_ex {
+   struct ibv_create_qp_resp_exibv_resp;
+};
+
 #endif /* MLX4_ABI_H */
diff --git a/src/verbs.c b/src/verbs.c
index 2cb1f8a..2cf240d 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -458,6 +458,43 @@ int mlx4_destroy_srq(struct ibv_srq *srq)
return 0;
 }
 
+static int mlx4_cmd_create_qp_ex(struct ibv_context *context,
+struct ibv_qp_init_attr_ex *attr,
+struct mlx4_create_qp *cmd,
+struct mlx4_qp *qp)
+{
+   struct mlx4_create_qp_ex cmd_ex;
+   struct mlx4_create_qp_resp_ex resp;
+   int ret;
+
+   memset(&cmd_ex, 0, sizeof(cmd_ex));
+   memcpy(&cmd_ex.ibv_cmd.base, &cmd->ibv_cmd.user_handle,
+  offsetof(typeof(cmd->ibv_cmd), is_srq) +
+  sizeof(cmd->ibv_cmd.is_srq) -
+  offsetof(typeof(cmd->ibv_cmd), user_handle));
+
+   memcpy(&cmd_ex.drv_ex, &cmd->buf_addr,
+  offsetof(typeof(*cmd), sq_no_prefetch) +
+  sizeof(cmd->sq_no_prefetch) - sizeof(cmd->ibv_cmd));
+
+   ret = ibv_cmd_create_qp_ex2(context, &qp->verbs_qp,
+   sizeof(qp->verbs_qp), attr,
+   &cmd_ex.ibv_cmd, sizeof(cmd_ex.ibv_cmd),
+   sizeof(cmd_ex), &resp.ibv_resp,
+   sizeof(resp.ibv_resp), sizeof(resp));
+   return ret;
+}
+
+enum {
+   MLX4_CREATE_QP_SUP_COMP_MASK = (IBV_QP_INIT_ATTR_PD |
+   IBV_QP_INIT_ATTR_XRCD |
+   IBV_QP_INIT_ATTR_CREATE_FLAGS),
+};
+
+enum {
+   MLX4_CREATE_QP_EX2_COMP_MASK = (IBV_QP_INIT_ATTR_CREATE_FLAGS),
+};
+
 struct ibv_qp *mlx4_create_qp_ex(struct ibv_context *context,
 struct ibv_qp_init_attr_ex *attr)
 {
@@ -474,6 +511,9 @@ struct ibv_qp *mlx4_create_qp_ex(struct ibv_context 
*context,
attr->cap.max_inline_data > 1024)
return NULL;
 
+   if (attr->comp_mask & ~MLX4_CREATE_QP_SUP_COMP_MASK)
+   return NULL;
+
qp = calloc(1, sizeof *qp);
if (!qp)
return NULL;
@@ -529,12 +569,15 @@ struct ibv_qp *mlx4_create_qp_ex(struct ibv_context 
*context,
; /* nothing */
cmd.sq_no_prefetch = 0; /* OK for ABI 2: just a reserved field */
memset(cmd.reserved, 0, sizeof cmd.reserved);
-
pthread_mutex_lock(&to_mctx(context)->qp_table_mutex);
 
-   ret = ibv_cmd_create_qp_ex(context, &qp->verbs_qp,
-  sizeof(qp->verbs_qp), attr,
-  &cmd.ibv_cmd, sizeof cmd, &resp, sizeof 
resp);
+   if (attr->comp_mask & MLX4_CREATE_QP_EX2_COMP_MASK)
+   ret = mlx4_cmd_create_qp_ex(context, attr, &cmd, qp);
+   else
+   ret = ibv_cmd_create_qp_ex(context, &qp->verbs_qp,
+  sizeof(qp->verbs_qp), attr,
+  &cmd.ibv_cmd, sizeof(cmd), &resp,
+  sizeof(resp));
if (ret)
goto err_rq_db;
 
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH libibverbs] Expose QP block self multicast loopback creation flag

2015-10-27 Thread Or Gerlitz

On 10/27/2015 2:53 PM, Eran Ben Elisha wrote:

Add QP creation flag which indicates that the QP will not receive self
multicast loopback traffic.

ibv_cmd_create_qp_ex was already defined but could not get extended, add
ibv_cmd_create_qp_ex2 which follows the extension scheme and hence could
be extendible in the future for more features.

Signed-off-by: Eran Ben Elisha 
Reviewed-by: Moshe Lazer 



Eran,

If there's a V1, I would use

"Add QP creation flags, support blocking self multicast loopback"

for the title, b/c this better reflects what the patch is doing.



Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] iser-target: Remove an unused variable

2015-10-27 Thread Sagi Grimberg



On 22/10/2015 21:14, Bart Van Assche wrote:

Detected this by compiling with W=1.

Signed-off-by: Bart Van Assche 
Cc: Sagi Grimberg 


FWIW,

Reviewed-by: Sagi Grimberg 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/iser: Remove an unused variable

2015-10-27 Thread Sagi Grimberg

Detected this by compiling with W=1.

Signed-off-by: Bart Van Assche 
Cc: Sagi Grimberg 


FWIW,

Reviewed-by: Sagi Grimberg 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: merge struct ib_device_attr into struct ib_device V2

2015-10-27 Thread Sagi Grimberg

Did we converge on this?

Just a heads up to Doug, this conflicts with
[PATCH v4 11/16] xprtrdma: Pre-allocate Work Requests for backchannel

but it's trivial to sort out...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4 v2] staging: ipath: ipath_driver: Use setup_timer

2015-10-27 Thread Muhammad Falak R Wani


On October 27, 2015 4:40:42 PM GMT+05:30, Dan Carpenter 
 wrote:
>On Tue, Oct 27, 2015 at 11:45:18AM +0200, Leon Romanovsky wrote:
>> On Tue, Oct 27, 2015 at 11:19 AM, Dan Carpenter
>>  wrote:
>> > On Sun, Oct 25, 2015 at 01:21:11PM +0200, Leon Romanovsky wrote:
>> >> On Sun, Oct 25, 2015 at 12:17 PM, Muhammad Falak R Wani
>> >>  wrote:
>> >> Please follow standard naming convention for the patches.
>> >> It should be [PATCH v2 1/4] and not [PATCH 1/4 v2].
>> >
>> > Does this matter?  It's in a thread so it sorts fine either way.
>> It will be wise if people read guides and follow examples.
>> 
>> [1] https://www.kernel.org/doc/Documentation/SubmittingPatches
>
>That document doesn't really specify one way or the other.  And even if
>it did then why would you care?  Stop being so picky for no reason.
>
>regards,
>dan carpenter

Sorry, my bad . Won't repeat such mistakes.
-- 
mfrw
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 0/2] Expose max_sge_rd correctly

2015-10-27 Thread Steve Wise


> -Original Message-
> From: Sagi Grimberg [mailto:sa...@mellanox.com]
> Sent: Tuesday, October 27, 2015 4:41 AM
> To: linux-rdma@vger.kernel.org; target-de...@vger.kernel.org
> Cc: Steve Wise; Nicholas A. Bellinger; Or Gerlitz; Doug Ledford
> Subject: [PATCH 0/2] Expose max_sge_rd correctly
> 
> This addresses a specific mlx4 issue where the max_sge_rd
> is actually smaller than max_sge (rdma reads with max_sge
> entries completes with error).
> 
> The second patch removes the explicit work-around from the
> iser target code.
> 
> This applies on top of Christoph's device attributes modification.
> 


Looks correct to me.

Series Reviewed-by: Steve Wise 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] mlx4: Expose correct max_sge_rd limit

2015-10-27 Thread Or Gerlitz

On 10/27/2015 11:40 AM, Sagi Grimberg wrote:

mlx4 devices (ConnectX-2, ConnectX-3) can not issue
max_sge in a single RDMA_READ request (resulting in
a completion error). Thus, expose lower max_sge_rd
to avoid this issue.


Sagi,

I can hear your pain when wearing the iser target driver maintainer hat.

Still, this patch is currently pure WA b/c we didn't do RCA (Root Cause 
Analysis)


Lets wait for RCA (which might yield the same patch, BTW) and keep 
suffering in LIO


Or.




Signed-off-by: Sagi Grimberg 
---
  drivers/infiniband/hw/mlx4/main.c |3 ++-
  1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index 3889723..46305dc 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -499,7 +499,8 @@ static int mlx4_ib_init_device_flags(struct ib_device 
*ibdev)
ibdev->max_qp_wr= dev->dev->caps.max_wqes - 
MLX4_IB_SQ_MAX_SPARE;
ibdev->max_sge  = min(dev->dev->caps.max_sq_sg,
 dev->dev->caps.max_rq_sg);
-   ibdev->max_sge_rd   = ibdev->max_sge;
+   /* reserve 2 sge slots for rdma reads */
+   ibdev->max_sge_rd   = ibdev->max_sge - 2;
ibdev->max_cq   = dev->dev->quotas.cq;
ibdev->max_cqe  = dev->dev->caps.max_cqes;
ibdev->max_mr   = dev->dev->quotas.mpt;


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/sa: replace GFP_KERNEL with GFP_ATOMIC

2015-10-27 Thread Saurabh Sengar
replace GFP_KERNEL with GFP_ATOMIC, as code while holding a spinlock
should be atomic
GFP_KERNEL may sleep and can cause deadlock, where as GFP_ATOMIC may
fail but certainly avoids deadlock

Signed-off-by: Saurabh Sengar 
---
 drivers/infiniband/core/sa_query.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/sa_query.c 
b/drivers/infiniband/core/sa_query.c
index 8c014b3..cd1f911 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -526,7 +526,7 @@ static int ib_nl_send_msg(struct ib_sa_query *query)
if (len <= 0)
return -EMSGSIZE;
 
-   skb = nlmsg_new(len, GFP_KERNEL);
+   skb = nlmsg_new(len, GFP_ATOMIC);
if (!skb)
return -ENOMEM;
 
@@ -544,7 +544,7 @@ static int ib_nl_send_msg(struct ib_sa_query *query)
/* Repair the nlmsg header length */
nlmsg_end(skb, nlh);
 
-   ret = ibnl_multicast(skb, nlh, RDMA_NL_GROUP_LS, GFP_KERNEL);
+   ret = ibnl_multicast(skb, nlh, RDMA_NL_GROUP_LS, GFP_ATOMIC);
if (!ret)
ret = len;
else
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] mlx4: Expose correct max_sge_rd limit

2015-10-27 Thread Sagi Grimberg



On 27/10/2015 16:39, Or Gerlitz wrote:

On 10/27/2015 11:40 AM, Sagi Grimberg wrote:

mlx4 devices (ConnectX-2, ConnectX-3) can not issue
max_sge in a single RDMA_READ request (resulting in
a completion error). Thus, expose lower max_sge_rd
to avoid this issue.


Sagi,


Hey Or,


Still, this patch is currently pure WA b/c we didn't do RCA (Root Cause
Analysis)


So from my discussions with the HW folks a RDMA_READ wqe cannot exceed
512B. The wqe control segment is 16 bytes, the rdma section is 12 bytes
(rkey + raddr) and each sge is 16 bytes so the computation is:

(512B-16B-12B)/16B = 30.

The reason is that the HW needs to fetch the rdma_read wqe on the RX
path (rdma_read response) and it has a limited buffer at that point.

Perhaps a dedicated #define for that is needed here.

I'll add that in the change log in v1.

Cheers,
Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] mlx4: Expose correct max_sge_rd limit

2015-10-27 Thread Or Gerlitz

On 10/27/2015 6:03 PM, Sagi Grimberg wrote:

So from my discussions with the HW folks a RDMA_READ wqe cannot exceed
512B. The wqe control segment is 16 bytes, the rdma section is 12 bytes
(rkey + raddr) and each sge is 16 bytes so the computation is:

(512B-16B-12B)/16B = 30. 


But AFAIR, the magic number was 28... how this goes hand in hand with 
your findings?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] mlx4: Expose correct max_sge_rd limit

2015-10-27 Thread Sagi Grimberg

But AFAIR, the magic number was 28... how this goes hand in hand with
your findings?


mlx4 max_sge is 32, and isert does max_sge - 2 = 30.
So it always used 30... and I run it reliably with this for a while now.

This thing exists before I was involved so I might not be familiar with
all the details...

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] mlx4: Expose correct max_sge_rd limit

2015-10-27 Thread Bart Van Assche

On 10/27/2015 02:40 AM, Sagi Grimberg wrote:

mlx4 devices (ConnectX-2, ConnectX-3) can not issue
max_sge in a single RDMA_READ request (resulting in
a completion error). Thus, expose lower max_sge_rd
to avoid this issue.

Signed-off-by: Sagi Grimberg 
---
  drivers/infiniband/hw/mlx4/main.c |3 ++-
  1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index 3889723..46305dc 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -499,7 +499,8 @@ static int mlx4_ib_init_device_flags(struct ib_device 
*ibdev)
ibdev->max_qp_wr= dev->dev->caps.max_wqes - 
MLX4_IB_SQ_MAX_SPARE;
ibdev->max_sge  = min(dev->dev->caps.max_sq_sg,
 dev->dev->caps.max_rq_sg);
-   ibdev->max_sge_rd   = ibdev->max_sge;
+   /* reserve 2 sge slots for rdma reads */
+   ibdev->max_sge_rd   = ibdev->max_sge - 2;
ibdev->max_cq   = dev->dev->quotas.cq;
ibdev->max_cqe  = dev->dev->caps.max_cqes;
ibdev->max_mr   = dev->dev->quotas.mpt;


Hello Sagi,

Is this the same issue as what has been discussed in 
http://www.spinics.net/lists/linux-rdma/msg21799.html ?


Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] mlx4: Expose correct max_sge_rd limit

2015-10-27 Thread Sagi Grimberg

Hello Sagi,

Is this the same issue as what has been discussed in
http://www.spinics.net/lists/linux-rdma/msg21799.html ?


Looks like it.

I think this patch addresses this issue, but lets CC Eli
to comment if I'm missing something.

Thanks for digging this up...

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 0/2] Handle mlx4 max_sge_rd correctly

2015-10-27 Thread Sagi Grimberg
This addresses a specific mlx4 issue where the max_sge_rd
is actually smaller than max_sge (rdma reads with max_sge
entries completes with error).

The second patch removes the explicit work-around from the
iser target code.

Changes from v0:
- Used a dedicated enumeration MLX4_MAX_SGE_RD and added
  a root cause analysis to patch change log.

- Fixed isert qp creation to be max_sge but construct rdma
  work request with the minimum of max_sge and max_sge_rd
  as non-rdma sends (login rsp) take 2 sges (and some devices
  have max_sge_rd = 1.

Sagi Grimberg (2):
  mlx4: Expose correct max_sge_rd limit
  iser-target: Remove explicit mlx4 work-around

 drivers/infiniband/hw/mlx4/main.c   |2 +-
 drivers/infiniband/ulp/isert/ib_isert.c |   13 +++--
 include/linux/mlx4/device.h |   11 +++
 3 files changed, 15 insertions(+), 11 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 2/2] iser-target: Remove explicit mlx4 work-around

2015-10-27 Thread Sagi Grimberg
The driver now exposes sufficient limits so we can
avoid having mlx4 specific work-around.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/ulp/isert/ib_isert.c |   13 +++--
 1 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/ulp/isert/ib_isert.c 
b/drivers/infiniband/ulp/isert/ib_isert.c
index 96336a9..eb985f9 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -141,16 +141,9 @@ isert_create_qp(struct isert_conn *isert_conn,
attr.recv_cq = comp->cq;
attr.cap.max_send_wr = ISERT_QP_MAX_REQ_DTOS;
attr.cap.max_recv_wr = ISERT_QP_MAX_RECV_DTOS + 1;
-   /*
-* FIXME: Use devattr.max_sge - 2 for max_send_sge as
-* work-around for RDMA_READs with ConnectX-2.
-*
-* Also, still make sure to have at least two SGEs for
-* outgoing control PDU responses.
-*/
-   attr.cap.max_send_sge = max(2, device->ib_device->max_sge - 2);
-   isert_conn->max_sge = attr.cap.max_send_sge;
-
+   attr.cap.max_send_sge = device->ib_device->max_sge;
+   isert_conn->max_sge = min(device->ib_device->max_sge,
+ device->ib_device->max_sge_rd);
attr.cap.max_recv_sge = 1;
attr.sq_sig_type = IB_SIGNAL_REQ_WR;
attr.qp_type = IB_QPT_RC;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 1/2] mlx4: Expose correct max_sge_rd limit

2015-10-27 Thread Sagi Grimberg
mlx4 devices (ConnectX-2, ConnectX-3) has a limitation
where rdma read work queue entries cannot exceed 512 bytes.
A rdma_read wqe needs to fit in 512 bytes:
- wqe control segment (16 bytes)
- rdma segment (12 bytes)
- scatter elements (16 bytes each)

So max_sge_rd should be: (512 - 16 - 12) / 16 = 30.

Signed-off-by: Sagi Grimberg 
---
 drivers/infiniband/hw/mlx4/main.c |2 +-
 include/linux/mlx4/device.h   |   11 +++
 2 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index 3889723..d8453f1 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -499,7 +499,7 @@ static int mlx4_ib_init_device_flags(struct ib_device 
*ibdev)
ibdev->max_qp_wr   = dev->dev->caps.max_wqes - 
MLX4_IB_SQ_MAX_SPARE;
ibdev->max_sge = min(dev->dev->caps.max_sq_sg,
 dev->dev->caps.max_rq_sg);
-   ibdev->max_sge_rd  = ibdev->max_sge;
+   ibdev->max_sge_rd  = MLX4_MAX_SGE_RD;
ibdev->max_cq  = dev->dev->quotas.cq;
ibdev->max_cqe = dev->dev->caps.max_cqes;
ibdev->max_mr  = dev->dev->quotas.mpt;
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index baad4cb..90c12f0 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -425,6 +425,17 @@ enum {
 };
 
 enum {
+   /*
+* Max wqe size for rdma read is 512 bytes, so this
+* limits our max_sge_rd as the wqe needs to fit:
+* - ctrl segment (16 bytes)
+* - rdma segment (12 bytes)
+* - scatter elements (16 bytes each)
+*/
+   MLX4_MAX_SGE_RD = (512 - 16 - 12) / 16
+};
+
+enum {
MLX4_DEV_PMC_SUBTYPE_GUID_INFO   = 0x14,
MLX4_DEV_PMC_SUBTYPE_PORT_INFO   = 0x15,
MLX4_DEV_PMC_SUBTYPE_PKEY_TABLE  = 0x16,
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH libibverbs 0/5] Completion timestamping

2015-10-27 Thread Matan Barak
Hi Doug,

This series adds completion timestamp for libibverbs.
In order to do so, we add an extensible poll cq. The problem with
extending the WC is that you could run out of the current cache
line when adding new features and degrade performance. This is solved
by introducing a custom WC.

The user creates a CQ using ibv_create_cq_ex, stating which WC fields
should be returned by this CQ. When the user calls ibv_poll_cq_ex,
this custom WC is returned. The fields orders and sizes are declared
in advanced (we avoid alignment rules by putting the fields starting
from the 64bit fields --> 8bit fields). Each WC has a wc_flags field
representing which fields are valid in this WC.
The vendor drivers could optimize those calls extensively.

Completion timestamp is added on top of these extended ibv_create_cq_ex
verb and ibv_poll_cq_ex verb. The user should call ibv_create_cq_ex
stating that this CQ should support reporting completion timestamp.
ibv_poll_cq_ex reports this raw completion timestamp value in every
packet.

In the future, a verb like the following could be added in order to
transform this time into system time:
ibv_get_timestamp(struct ibv_context *context, uint64_t raw_time,
  struct timespec *ts, int flags);

The timestamp mask (number of supported bits) and the HCA's frequency
are given in ibv_query_device_ex verb.

We also give the user an ability to read the HCA's current clock.
This is done via ibv_query_values_ex. This verb could be extended
in the future for other interesting information.

Thanks,
Matan

Matan Barak (5):
  Add ibv_poll_cq_ex verb
  Add timestamp_mask and hca_core_clock to ibv_query_device_ex
  Add support for extended ibv_create_cq
  Add completion timestmap support for ibv_poll_cq_ex
  Add ibv_query_values_ex

 Makefile.am   |   6 +-
 examples/devinfo.c|  10 ++
 include/infiniband/compiler.h |  89 
 include/infiniband/driver.h   |   9 ++
 include/infiniband/kern-abi.h |  26 +++-
 include/infiniband/verbs.h| 318 ++
 man/ibv_create_cq_ex.3|  71 ++
 man/ibv_poll_cq_ex.3  | 173 +++
 man/ibv_query_device_ex.3 |   6 +-
 src/cmd.c |  63 +
 src/device.c  |  44 ++
 src/ibverbs.h |  12 ++
 src/libibverbs.map|   1 +
 13 files changed, 822 insertions(+), 6 deletions(-)
 create mode 100644 include/infiniband/compiler.h
 create mode 100644 man/ibv_create_cq_ex.3
 create mode 100644 man/ibv_poll_cq_ex.3

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH libibverbs 1/5] Add ibv_poll_cq_ex verb

2015-10-27 Thread Matan Barak
This is an extension verb for ibv_poll_cq. It allows the user to poll
the cq for specific wc fields only, while allowing to extend the wc.
The verb calls the provider in order to fill the WC with the required
information.

Signed-off-by: Matan Barak 
---
 Makefile.am   |   5 +-
 include/infiniband/compiler.h |  89 +
 include/infiniband/verbs.h| 215 ++
 man/ibv_poll_cq_ex.3  | 171 +
 4 files changed, 478 insertions(+), 2 deletions(-)
 create mode 100644 include/infiniband/compiler.h
 create mode 100644 man/ibv_poll_cq_ex.3

diff --git a/Makefile.am b/Makefile.am
index c85e98a..339bcec 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -44,7 +44,8 @@ libibverbsincludedir = $(includedir)/infiniband
 
 libibverbsinclude_HEADERS = include/infiniband/arch.h 
include/infiniband/driver.h \
 include/infiniband/kern-abi.h include/infiniband/opcode.h 
include/infiniband/verbs.h \
-include/infiniband/sa-kern-abi.h include/infiniband/sa.h 
include/infiniband/marshall.h
+include/infiniband/sa-kern-abi.h include/infiniband/sa.h 
include/infiniband/marshall.h \
+include/infiniband/compiler.h
 
 man_MANS = man/ibv_asyncwatch.1 man/ibv_devices.1 man/ibv_devinfo.1\
 man/ibv_rc_pingpong.1 man/ibv_uc_pingpong.1 man/ibv_ud_pingpong.1  \
@@ -63,7 +64,7 @@ man_MANS = man/ibv_asyncwatch.1 man/ibv_devices.1 
man/ibv_devinfo.1   \
 man/ibv_req_notify_cq.3 man/ibv_resize_cq.3 man/ibv_rate_to_mbps.3  \
 man/ibv_create_qp_ex.3 man/ibv_create_srq_ex.3 man/ibv_open_xrcd.3  \
 man/ibv_get_srq_num.3 man/ibv_open_qp.3 \
-man/ibv_query_device_ex.3
+man/ibv_query_device_ex.3 man/ibv_poll_cq_ex.3
 
 DEBIAN = debian/changelog debian/compat debian/control debian/copyright \
 debian/ibverbs-utils.install debian/libibverbs1.install \
diff --git a/include/infiniband/compiler.h b/include/infiniband/compiler.h
new file mode 100644
index 000..b4bab98
--- /dev/null
+++ b/include/infiniband/compiler.h
@@ -0,0 +1,89 @@
+/*
+ * Copyright (c) 2015 Mellanox, Ltd.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#ifndef _COMPILER_
+#define _COMPILER_
+
+#if (__GNUC__ > 3 || (__GNUC__ == 3 && __GNUC_MINOR__ >= 4))
+#define ibv_popcount64 __builtin_popcountll
+#endif
+
+#ifndef __has_builtin
+   #define __has_builtin(x) 0 /* Compatibility with non-clang compilers. */
+#endif
+
+#if __has_builtin(__builtin_popcountll) && !defined(ibv_popcount64)
+   #define ibv_popcount64  __builtin_popcountll
+#endif
+
+#ifndef ibv_popcount64
+/* From FreeBSD:
+ * Copyright (C) 1995, 1996, 1997, and 1998 WIDE Project.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in the
+ *documentation and/or other materials provided with the distribution.
+ * 3. Neither the name of the project nor the names of its contributors
+ *may be used to endorse or promote products derived from this software
+ *without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE PROJECT AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITN

[PATCH libibverbs 4/5] Add completion timestmap support for ibv_poll_cq_ex

2015-10-27 Thread Matan Barak
Add support for raw completion timestamp through ibv_poll_cq_ex.

Signed-off-by: Matan Barak 
---
 include/infiniband/verbs.h | 7 ++-
 man/ibv_poll_cq_ex.3   | 2 ++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index f80126a..3d66726 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -391,6 +391,7 @@ enum ibv_wc_flags_ex {
IBV_WC_EX_WITH_SLID = 1 << 7,
IBV_WC_EX_WITH_SL   = 1 << 8,
IBV_WC_EX_WITH_DLID_PATH_BITS   = 1 << 9,
+   IBV_WC_EX_WITH_COMPLETION_TIMESTAMP = 1 << 10,
 };
 
 enum {
@@ -409,6 +410,10 @@ enum {
 };
 
 /* fields order in wc_ex
+ * // Raw timestamp of completion. A raw timestamp is implementation
+ * // defined and can not be relied upon to have any ordering value
+ * // between more than one HCA or driver.
+ * uint64_tcompletion_timestamp;
  * uint32_tbyte_len,
  * uint32_timm_data;   // in network byte order
  * uint32_tqp_num;
@@ -420,7 +425,7 @@ enum {
  */
 
 enum {
-   IBV_WC_EX_WITH_64BIT_FIELDS = 0
+   IBV_WC_EX_WITH_64BIT_FIELDS = IBV_WC_EX_WITH_COMPLETION_TIMESTAMP
 };
 
 enum {
diff --git a/man/ibv_poll_cq_ex.3 b/man/ibv_poll_cq_ex.3
index 8f336bc..3eb9bc0 100644
--- a/man/ibv_poll_cq_ex.3
+++ b/man/ibv_poll_cq_ex.3
@@ -54,12 +54,14 @@ IBV_WC_EX_WITH_PKEY_INDEX   = 1 << 6,  /* The 
returned wc_ex contain
 IBV_WC_EX_WITH_SLID = 1 << 7,  /* The returned wc_ex 
contains slid field */
 IBV_WC_EX_WITH_SL   = 1 << 8,  /* The returned wc_ex 
contains sl field */
 IBV_WC_EX_WITH_DLID_PATH_BITS   = 1 << 9,  /* The returned wc_ex 
contains dlid_path_bits field */
+IBV_WC_EX_WITH_COMPLETION_TIMESTAMP = 1 << 10, /* The returned wc_ex 
contains completion_timestmap field */
 .in -8
 };
 
 .fi
 wc_flags describes which of the fields in buffer[0] have a valid value. The 
order of these fields and sizes are always the following:
 .nf
+uint64_tcompletion_timestamp; /* Raw timestamp of completion. 
Implementation defined. Can't be relied upon to have any ordering value between 
more than one driver/hca */
 uint32_tbyte_len,
 uint32_timm_data; /* in network byte order */
 uint32_tqp_num;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH libibverbs 5/5] Add ibv_query_values_ex

2015-10-27 Thread Matan Barak
Add an extension verb to query certain values of device.
Currently, only IBV_VALUES_HW_CLOCK is supported, but this
verb could support other flags like IBV_VALUES_TEMP_SENSOR,
IBV_VALUES_CORE_FREQ, etc.
This extension verb only calls the provider.
The provider has to query this value somehow and mark the queried
values in comp_mask.

Signed-off-by: Matan Barak 
---
 include/infiniband/verbs.h | 33 +
 1 file changed, 33 insertions(+)

diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 3d66726..4829dac 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -1234,6 +1234,16 @@ struct ibv_create_cq_attr_ex {
uint32_tflags;
 };
 
+enum ibv_values_mask {
+   IBV_VALUES_MASK_RAW_CLOCK   = 1 << 0,
+   IBV_VALUES_MASK_RESERVED= 1 << 1
+};
+
+struct ibv_values_ex {
+   uint32_tcomp_mask;
+   struct timespec raw_clock;
+};
+
 enum verbs_context_mask {
VERBS_CONTEXT_XRCD  = 1 << 0,
VERBS_CONTEXT_SRQ   = 1 << 1,
@@ -1250,6 +1260,8 @@ struct ibv_poll_cq_ex_attr {
 
 struct verbs_context {
/*  "grows up" - new fields go here */
+   int (*query_values)(struct ibv_context *context,
+   struct ibv_values_ex *values);
struct ibv_cq *(*create_cq_ex)(struct ibv_context *context,
   struct ibv_create_cq_attr_ex *);
void *priv;
@@ -1730,6 +1742,27 @@ ibv_create_qp_ex(struct ibv_context *context, struct 
ibv_qp_init_attr_ex *qp_ini
 }
 
 /**
+ * ibv_query_values_ex - Get current @q_values of device,
+ * @q_values is mask (Or's bits of enum ibv_values_mask) of the attributes
+ * we need to query.
+ */
+static inline int
+ibv_query_values_ex(struct ibv_context *context,
+   struct ibv_values_ex *values)
+{
+   struct verbs_context *vctx;
+
+   vctx = verbs_get_ctx_op(context, query_values);
+   if (!vctx)
+   return ENOSYS;
+
+   if (values->comp_mask & ~(IBV_VALUES_MASK_RESERVED - 1))
+   return EINVAL;
+
+   return vctx->query_values(context, values);
+}
+
+/**
  * ibv_query_device_ex - Get extended device properties
  */
 static inline int
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH libibverbs 3/5] Add support for extended ibv_create_cq

2015-10-27 Thread Matan Barak
Adding ibv_create_cq_ex. This extended verbs follows
the extension verbs scheme and hence could be
extendible in the future for more features.
The new command supports creation flags with timestamp.

Signed-off-by: Matan Barak 
---
 Makefile.am   |  3 +-
 include/infiniband/driver.h   |  9 ++
 include/infiniband/kern-abi.h | 24 +--
 include/infiniband/verbs.h| 63 ++
 man/ibv_create_cq_ex.3| 71 +++
 src/cmd.c | 42 +
 src/device.c  | 44 +++
 src/ibverbs.h |  5 +++
 src/libibverbs.map|  1 +
 9 files changed, 259 insertions(+), 3 deletions(-)
 create mode 100644 man/ibv_create_cq_ex.3

diff --git a/Makefile.am b/Makefile.am
index 339bcec..b6399d6 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -64,7 +64,8 @@ man_MANS = man/ibv_asyncwatch.1 man/ibv_devices.1 
man/ibv_devinfo.1   \
 man/ibv_req_notify_cq.3 man/ibv_resize_cq.3 man/ibv_rate_to_mbps.3  \
 man/ibv_create_qp_ex.3 man/ibv_create_srq_ex.3 man/ibv_open_xrcd.3  \
 man/ibv_get_srq_num.3 man/ibv_open_qp.3 \
-man/ibv_query_device_ex.3 man/ibv_poll_cq_ex.3
+man/ibv_query_device_ex.3 man/ibv_poll_cq_ex.3 \
+man/ibv_create_cq_ex.3
 
 DEBIAN = debian/changelog debian/compat debian/control debian/copyright \
 debian/ibverbs-utils.install debian/libibverbs1.install \
diff --git a/include/infiniband/driver.h b/include/infiniband/driver.h
index 8227df0..0d53554 100644
--- a/include/infiniband/driver.h
+++ b/include/infiniband/driver.h
@@ -144,6 +144,15 @@ int ibv_cmd_create_cq(struct ibv_context *context, int cqe,
  int comp_vector, struct ibv_cq *cq,
  struct ibv_create_cq *cmd, size_t cmd_size,
  struct ibv_create_cq_resp *resp, size_t resp_size);
+int ibv_cmd_create_cq_ex(struct ibv_context *context,
+struct ibv_create_cq_attr_ex *cq_attr,
+struct ibv_cq *cq,
+struct ibv_create_cq_ex *cmd,
+size_t cmd_core_size,
+size_t cmd_size,
+struct ibv_create_cq_resp_ex *resp,
+size_t resp_core_size,
+size_t resp_size);
 int ibv_cmd_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc *wc);
 int ibv_cmd_req_notify_cq(struct ibv_cq *cq, int solicited_only);
 #define IBV_CMD_RESIZE_CQ_HAS_RESP_PARAMS
diff --git a/include/infiniband/kern-abi.h b/include/infiniband/kern-abi.h
index cce6ade..b2dcda6 100644
--- a/include/infiniband/kern-abi.h
+++ b/include/infiniband/kern-abi.h
@@ -110,9 +110,11 @@ enum {
 enum {
IB_USER_VERBS_CMD_QUERY_DEVICE_EX = IB_USER_VERBS_CMD_EXTENDED_MASK |
IB_USER_VERBS_CMD_QUERY_DEVICE,
+   IB_USER_VERBS_CMD_CREATE_CQ_EX = IB_USER_VERBS_CMD_EXTENDED_MASK |
+   IB_USER_VERBS_CMD_CREATE_CQ,
IB_USER_VERBS_CMD_CREATE_FLOW = IB_USER_VERBS_CMD_EXTENDED_MASK +
IB_USER_VERBS_CMD_THRESHOLD,
-   IB_USER_VERBS_CMD_DESTROY_FLOW
+   IB_USER_VERBS_CMD_DESTROY_FLOW,
 };
 
 /*
@@ -400,6 +402,23 @@ struct ibv_create_cq_resp {
__u32 cqe;
 };
 
+struct ibv_create_cq_ex {
+   struct ex_hdr   hdr;
+   __u64   user_handle;
+   __u32   cqe;
+   __u32   comp_vector;
+   __s32   comp_channel;
+   __u32   comp_mask;
+   __u32   flags;
+   __u32   reserved;
+};
+
+struct ibv_create_cq_resp_ex {
+   struct ibv_create_cq_resp   base;
+   __u32   comp_mask;
+   __u32   response_length;
+};
+
 struct ibv_kern_wc {
__u64  wr_id;
__u32  status;
@@ -1033,7 +1052,8 @@ enum {
IB_USER_VERBS_CMD_OPEN_QP_V2 = -1,
IB_USER_VERBS_CMD_CREATE_FLOW_V2 = -1,
IB_USER_VERBS_CMD_DESTROY_FLOW_V2 = -1,
-   IB_USER_VERBS_CMD_QUERY_DEVICE_EX_V2 = -1
+   IB_USER_VERBS_CMD_QUERY_DEVICE_EX_V2 = -1,
+   IB_USER_VERBS_CMD_CREATE_CQ_EX_V2 = -1,
 };
 
 struct ibv_modify_srq_v3 {
diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 51b880b..f80126a 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -1193,6 +1193,42 @@ struct ibv_context {
void   *abi_compat;
 };
 
+enum ibv_create_cq_attr {
+   IBV_CREATE_CQ_ATTR_FLAGS= 1 << 0,
+   IBV_CREATE_CQ_ATTR_RESERVED = 1 << 1
+};
+
+enum ibv_create_cq_attr_flags {
+   IBV_CREATE_CQ_ATTR_COMPLETION_TIMESTAMP = 1 << 0,
+};
+
+struct ibv_create_cq_attr_ex {
+   /* Minimum number of entries required for CQ */
+   int cqe;
+   /* Consumer-supplied context returned for completion events */
+   

[PATCH libibverbs 2/5] Add timestamp_mask and hca_core_clock to ibv_query_device_ex

2015-10-27 Thread Matan Barak
The fields timestamp_mask and hca_core_clock were added
to the extended version of ibv_query_device verb.
timestamp_mask represents the allowed mask of the timestamp.
Users could infer the accuracy of the reported possible
timestamp.
hca_core_clock represents the frequency of the HCA (in HZ).
Since timestamp and reading the HCA's core clock is given
in hardware cycles, knowing the frequency is mandatory in order
to convert this number into seconds.

Signed-off-by: Matan Barak 
---
 examples/devinfo.c| 10 ++
 include/infiniband/kern-abi.h |  2 ++
 include/infiniband/verbs.h| 28 +++-
 man/ibv_query_device_ex.3 |  6 --
 src/cmd.c | 21 +
 src/ibverbs.h |  7 +++
 6 files changed, 59 insertions(+), 15 deletions(-)

diff --git a/examples/devinfo.c b/examples/devinfo.c
index a8de982..0af8c3b 100644
--- a/examples/devinfo.c
+++ b/examples/devinfo.c
@@ -339,6 +339,16 @@ static int print_hca_cap(struct ibv_device *ib_dev, 
uint8_t ib_port)
printf("\tlocal_ca_ack_delay:\t\t%d\n", 
device_attr.orig_attr.local_ca_ack_delay);
 
print_odp_caps(&device_attr.odp_caps);
+   if (device_attr.completion_timestamp_mask)
+   printf("\tcompletion timestamp_mask:\t\t\t0x%016lx\n",
+  device_attr.completion_timestamp_mask);
+   else
+   printf("\tcompletion_timestamp_mask not supported\n");
+
+   if (device_attr.hca_core_clock)
+   printf("\thca_core_clock:\t\t\t%lukHZ\n", 
device_attr.hca_core_clock);
+   else
+   printf("\tcore clock not supported\n");
}
 
for (port = 1; port <= device_attr.orig_attr.phys_port_cnt; ++port) {
diff --git a/include/infiniband/kern-abi.h b/include/infiniband/kern-abi.h
index 800c5ab..cce6ade 100644
--- a/include/infiniband/kern-abi.h
+++ b/include/infiniband/kern-abi.h
@@ -267,6 +267,8 @@ struct ibv_query_device_resp_ex {
__u32 comp_mask;
__u32 response_length;
struct ibv_odp_caps_resp odp_caps;
+   __u64 timestamp_mask;
+   __u64 hca_core_clock;
 };
 
 struct ibv_query_port {
diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 479bfca..51b880b 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -204,6 +204,8 @@ struct ibv_device_attr_ex {
struct ibv_device_attr  orig_attr;
uint32_tcomp_mask;
struct ibv_odp_caps odp_caps;
+   uint64_tcompletion_timestamp_mask;
+   uint64_thca_core_clock;
 };
 
 enum ibv_mtu {
@@ -378,6 +380,19 @@ struct ibv_wc {
uint8_t dlid_path_bits;
 };
 
+enum ibv_wc_flags_ex {
+   IBV_WC_EX_GRH   = 1 << 0,
+   IBV_WC_EX_IMM   = 1 << 1,
+   IBV_WC_EX_WITH_BYTE_LEN = 1 << 2,
+   IBV_WC_EX_WITH_IMM  = 1 << 3,
+   IBV_WC_EX_WITH_QP_NUM   = 1 << 4,
+   IBV_WC_EX_WITH_SRC_QP   = 1 << 5,
+   IBV_WC_EX_WITH_PKEY_INDEX   = 1 << 6,
+   IBV_WC_EX_WITH_SLID = 1 << 7,
+   IBV_WC_EX_WITH_SL   = 1 << 8,
+   IBV_WC_EX_WITH_DLID_PATH_BITS   = 1 << 9,
+};
+
 enum {
IBV_WC_FEATURE_FLAGS = IBV_WC_EX_GRH | IBV_WC_EX_IMM
 };
@@ -393,19 +408,6 @@ enum {
 IBV_WC_EX_WITH_DLID_PATH_BITS
 };
 
-enum ibv_wc_flags_ex {
-   IBV_WC_EX_GRH   = 1 << 0,
-   IBV_WC_EX_IMM   = 1 << 1,
-   IBV_WC_EX_WITH_BYTE_LEN = 1 << 2,
-   IBV_WC_EX_WITH_IMM  = 1 << 3,
-   IBV_WC_EX_WITH_QP_NUM   = 1 << 4,
-   IBV_WC_EX_WITH_SRC_QP   = 1 << 5,
-   IBV_WC_EX_WITH_PKEY_INDEX   = 1 << 6,
-   IBV_WC_EX_WITH_SLID = 1 << 7,
-   IBV_WC_EX_WITH_SL   = 1 << 8,
-   IBV_WC_EX_WITH_DLID_PATH_BITS   = 1 << 9,
-};
-
 /* fields order in wc_ex
  * uint32_tbyte_len,
  * uint32_timm_data;   // in network byte order
diff --git a/man/ibv_query_device_ex.3 b/man/ibv_query_device_ex.3
index 1f483d2..db12c2b 100644
--- a/man/ibv_query_device_ex.3
+++ b/man/ibv_query_device_ex.3
@@ -22,8 +22,10 @@ is a pointer to an ibv_device_attr_ex struct, as defined in 

 struct ibv_device_attr_ex {
 .in +8
 struct ibv_device_attr orig_attr;
-uint32_t   comp_mask;  /* Compatibility mask that 
defines which of the following variables are valid */
-struct ibv_odp_capsodp_caps;   /* On-Demand Paging 
capabilities */
+uint32_t   comp_mask;  /* Compatibility mask that 
defines which of the following variables are valid */
+struct ibv_odp_capsodp_caps;   /* On-Demand Paging 
capabilities */
+uint64_t   completion_timestamp_mas

[PATCH libibverbs 3/7] Implement ibv_poll_cq_ex extension verb

2015-10-27 Thread Matan Barak
Add an implementation for verb_poll_cq extension verb.
This patch implements the new API via the standard
function mlx4_poll_one.

Signed-off-by: Matan Barak 
---
 src/cq.c| 307 ++--
 src/mlx4.c  |   1 +
 src/mlx4.h  |   4 +
 src/verbs.c |   1 +
 4 files changed, 284 insertions(+), 29 deletions(-)

diff --git a/src/cq.c b/src/cq.c
index 32c9070..c86e824 100644
--- a/src/cq.c
+++ b/src/cq.c
@@ -52,6 +52,7 @@ enum {
 };
 
 enum {
+   CQ_CONTINUE =  1,
CQ_OK   =  0,
CQ_EMPTY= -1,
CQ_POLL_ERR = -2
@@ -121,7 +122,9 @@ static void update_cons_index(struct mlx4_cq *cq)
*cq->set_ci_db = htonl(cq->cons_index & 0xff);
 }
 
-static void mlx4_handle_error_cqe(struct mlx4_err_cqe *cqe, struct ibv_wc *wc)
+static void mlx4_handle_error_cqe(struct mlx4_err_cqe *cqe,
+ enum ibv_wc_status *status,
+ enum ibv_wc_opcode *vendor_err)
 {
if (cqe->syndrome == MLX4_CQE_SYNDROME_LOCAL_QP_OP_ERR)
printf(PFX "local QP operation err "
@@ -133,64 +136,68 @@ static void mlx4_handle_error_cqe(struct mlx4_err_cqe 
*cqe, struct ibv_wc *wc)
 
switch (cqe->syndrome) {
case MLX4_CQE_SYNDROME_LOCAL_LENGTH_ERR:
-   wc->status = IBV_WC_LOC_LEN_ERR;
+   *status = IBV_WC_LOC_LEN_ERR;
break;
case MLX4_CQE_SYNDROME_LOCAL_QP_OP_ERR:
-   wc->status = IBV_WC_LOC_QP_OP_ERR;
+   *status = IBV_WC_LOC_QP_OP_ERR;
break;
case MLX4_CQE_SYNDROME_LOCAL_PROT_ERR:
-   wc->status = IBV_WC_LOC_PROT_ERR;
+   *status = IBV_WC_LOC_PROT_ERR;
break;
case MLX4_CQE_SYNDROME_WR_FLUSH_ERR:
-   wc->status = IBV_WC_WR_FLUSH_ERR;
+   *status = IBV_WC_WR_FLUSH_ERR;
break;
case MLX4_CQE_SYNDROME_MW_BIND_ERR:
-   wc->status = IBV_WC_MW_BIND_ERR;
+   *status = IBV_WC_MW_BIND_ERR;
break;
case MLX4_CQE_SYNDROME_BAD_RESP_ERR:
-   wc->status = IBV_WC_BAD_RESP_ERR;
+   *status = IBV_WC_BAD_RESP_ERR;
break;
case MLX4_CQE_SYNDROME_LOCAL_ACCESS_ERR:
-   wc->status = IBV_WC_LOC_ACCESS_ERR;
+   *status = IBV_WC_LOC_ACCESS_ERR;
break;
case MLX4_CQE_SYNDROME_REMOTE_INVAL_REQ_ERR:
-   wc->status = IBV_WC_REM_INV_REQ_ERR;
+   *status = IBV_WC_REM_INV_REQ_ERR;
break;
case MLX4_CQE_SYNDROME_REMOTE_ACCESS_ERR:
-   wc->status = IBV_WC_REM_ACCESS_ERR;
+   *status = IBV_WC_REM_ACCESS_ERR;
break;
case MLX4_CQE_SYNDROME_REMOTE_OP_ERR:
-   wc->status = IBV_WC_REM_OP_ERR;
+   *status = IBV_WC_REM_OP_ERR;
break;
case MLX4_CQE_SYNDROME_TRANSPORT_RETRY_EXC_ERR:
-   wc->status = IBV_WC_RETRY_EXC_ERR;
+   *status = IBV_WC_RETRY_EXC_ERR;
break;
case MLX4_CQE_SYNDROME_RNR_RETRY_EXC_ERR:
-   wc->status = IBV_WC_RNR_RETRY_EXC_ERR;
+   *status = IBV_WC_RNR_RETRY_EXC_ERR;
break;
case MLX4_CQE_SYNDROME_REMOTE_ABORTED_ERR:
-   wc->status = IBV_WC_REM_ABORT_ERR;
+   *status = IBV_WC_REM_ABORT_ERR;
break;
default:
-   wc->status = IBV_WC_GENERAL_ERR;
+   *status = IBV_WC_GENERAL_ERR;
break;
}
 
-   wc->vendor_err = cqe->vendor_err;
+   *vendor_err = cqe->vendor_err;
 }
 
-static int mlx4_poll_one(struct mlx4_cq *cq,
-struct mlx4_qp **cur_qp,
-struct ibv_wc *wc)
+static inline int mlx4_handle_cq(struct mlx4_cq *cq,
+struct mlx4_qp **cur_qp,
+uint64_t *wc_wr_id,
+enum ibv_wc_status *wc_status,
+uint32_t *wc_vendor_err,
+struct mlx4_cqe **pcqe,
+uint32_t *pqpn,
+int *pis_send)
 {
struct mlx4_wq *wq;
struct mlx4_cqe *cqe;
struct mlx4_srq *srq;
uint32_t qpn;
-   uint32_t g_mlpath_rqpn;
-   uint16_t wqe_index;
int is_error;
int is_send;
+   uint16_t wqe_index;
 
cqe = next_cqe_sw(cq);
if (!cqe)
@@ -201,7 +208,7 @@ static int mlx4_poll_one(struct mlx4_cq *cq,
 
++cq->cons_index;
 
-   VALGRIND_MAKE_MEM_DEFINED(cqe, sizeof *cqe);
+   VALGRIND_MAKE_MEM_DEFINED(cqe, sizeof(*cqe));
 
/*
 * Make sure we read CQ entry contents after we've checked the
@@ -210,7 +217,6 @@ st

[PATCH libibverbs 5/7] Add support for ibv_query_values_ex

2015-10-27 Thread Matan Barak
Adding mlx4_query_values as implementation for
ibv_query_values_ex. mlx4_query_values follows the
standard extension verb mechanism.
This function supports reading the hwclock via mmaping
the required space from kernel.

Signed-off-by: Matan Barak 
---
 src/mlx4.c  | 36 
 src/mlx4.h  |  3 +++
 src/verbs.c | 45 +
 3 files changed, 84 insertions(+)

diff --git a/src/mlx4.c b/src/mlx4.c
index cc1211f..6d66cf0 100644
--- a/src/mlx4.c
+++ b/src/mlx4.c
@@ -116,6 +116,28 @@ static struct ibv_context_ops mlx4_ctx_ops = {
.detach_mcast  = ibv_cmd_detach_mcast
 };
 
+static int mlx4_map_internal_clock(struct mlx4_device *dev,
+  struct ibv_context *ibv_ctx)
+{
+   struct mlx4_context *context = to_mctx(ibv_ctx);
+   void *hca_clock_page;
+
+   hca_clock_page = mmap(NULL, dev->page_size, PROT_READ, MAP_SHARED,
+ ibv_ctx->cmd_fd, dev->page_size * 3);
+
+   if (hca_clock_page == MAP_FAILED) {
+   fprintf(stderr, PFX
+   "Warning: Timestamp available,\n"
+   "but failed to mmap() hca core clock page, errno=%d.\n",
+   errno);
+   return -1;
+   }
+
+   context->hca_core_clock = hca_clock_page +
+   context->core_clock_offset % dev->page_size;
+   return 0;
+}
+
 static int mlx4_init_context(struct verbs_device *v_device,
struct ibv_context *ibv_ctx, int cmd_fd)
 {
@@ -127,6 +149,10 @@ static int mlx4_init_context(struct verbs_device *v_device,
__u16   bf_reg_size;
struct mlx4_device  *dev = to_mdev(&v_device->device);
struct verbs_context *verbs_ctx = verbs_get_ctx(ibv_ctx);
+   struct ibv_query_device_ex_input input_query_device = {.comp_mask = 0};
+   struct ibv_device_attr_ex   dev_attrs;
+   uint32_tdev_attrs_comp_mask;
+   int err;
 
/* memory footprint of mlx4_context and verbs_context share
* struct ibv_context.
@@ -194,6 +220,12 @@ static int mlx4_init_context(struct verbs_device *v_device,
context->bf_buf_size = 0;
}
 
+   context->hca_core_clock = NULL;
+   err = _mlx4_query_device_ex(ibv_ctx, &input_query_device, &dev_attrs,
+   sizeof(dev_attrs), &dev_attrs_comp_mask);
+   if (!err && dev_attrs_comp_mask & QUERY_DEVICE_RESP_MASK_TIMESTAMP)
+   mlx4_map_internal_clock(dev, ibv_ctx);
+
pthread_spin_init(&context->uar_lock, PTHREAD_PROCESS_PRIVATE);
ibv_ctx->ops = mlx4_ctx_ops;
 
@@ -210,6 +242,7 @@ static int mlx4_init_context(struct verbs_device *v_device,
verbs_set_ctx_op(verbs_ctx, query_device_ex, mlx4_query_device_ex);
verbs_set_ctx_op(verbs_ctx, create_cq_ex, mlx4_create_cq_ex);
verbs_set_ctx_op(verbs_ctx, poll_cq_ex, mlx4_poll_cq_ex);
+   verbs_set_ctx_op(verbs_ctx, query_values, mlx4_query_values);
 
return 0;
 
@@ -223,6 +256,9 @@ static void mlx4_uninit_context(struct verbs_device 
*v_device,
munmap(context->uar, to_mdev(&v_device->device)->page_size);
if (context->bf_page)
munmap(context->bf_page, to_mdev(&v_device->device)->page_size);
+   if (context->hca_core_clock)
+   munmap(context->hca_core_clock - context->core_clock_offset,
+  to_mdev(&v_device->device)->page_size);
 
 }
 
diff --git a/src/mlx4.h b/src/mlx4.h
index 2465298..8e1935d 100644
--- a/src/mlx4.h
+++ b/src/mlx4.h
@@ -199,6 +199,7 @@ struct mlx4_context {
enum ibv_port_cap_flags caps;
} port_query_cache[MLX4_PORTS_NUM];
uint64_tcore_clock_offset;
+   void   *hca_core_clock;
 };
 
 struct mlx4_buf {
@@ -403,6 +404,8 @@ int _mlx4_query_device_ex(struct ibv_context *context,
 int mlx4_query_device_ex(struct ibv_context *context,
 const struct ibv_query_device_ex_input *input,
 struct ibv_device_attr_ex *attr, size_t attr_size);
+int mlx4_query_values(struct ibv_context *context,
+ struct ibv_values_ex *values);
 int mlx4_query_port(struct ibv_context *context, uint8_t port,
 struct ibv_port_attr *attr);
 
diff --git a/src/verbs.c b/src/verbs.c
index a8d6bd7..843ca1e 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -114,6 +114,51 @@ int mlx4_query_device_ex(struct ibv_context *context,
return _mlx4_query_device_ex(context, input, attr, attr_size, NULL);
 }
 
+#define READL(ptr) (*((uint32_t *)(ptr)))
+static int mlx4_read_clock(struct ibv_context *context, uint64_t *cycles)
+{
+   unsigned int clockhi, clocklo, clockhi1;
+   int i;
+   struct mlx4_context *ctx = to_mctx(context);
+
+   if (!ctx->hca_core_clo

[PATCH libibverbs 4/7] Add timestmap support to extended poll_cq verb

2015-10-27 Thread Matan Barak
Adding support to the extended version of poll_cq verb to read
completion timestamp. Reading timestamp isn't supported with reading
IBV_WC_EX_WITH_SL and IBV_WC_EX_WITH_SLID.

Signed-off-by: Matan Barak 
---
 src/cq.c| 10 ++
 src/mlx4.h  | 25 -
 src/verbs.c |  3 ++-
 3 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/src/cq.c b/src/cq.c
index c86e824..7f40f12 100644
--- a/src/cq.c
+++ b/src/cq.c
@@ -399,6 +399,16 @@ static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq,
if (err != CQ_CONTINUE)
return err;
 
+   if (wc_flags & IBV_WC_EX_WITH_COMPLETION_TIMESTAMP) {
+   uint16_t timestamp_0_15 = cqe->timestamp_0_7 |
+   cqe->timestamp_8_15 << 8;
+
+   wc_flags_out |= IBV_WC_EX_WITH_COMPLETION_TIMESTAMP;
+   *wc_buffer.b64++ = (((uint64_t)ntohl(cqe->timestamp_16_47)
++ !timestamp_0_15) << 16) |
+  (uint64_t)timestamp_0_15;
+   }
+
if (is_send) {
switch (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) {
case MLX4_OPCODE_RDMA_WRITE_IMM:
diff --git a/src/mlx4.h b/src/mlx4.h
index e22f879..2465298 100644
--- a/src/mlx4.h
+++ b/src/mlx4.h
@@ -312,14 +312,29 @@ struct mlx4_cqe {
uint32_tvlan_my_qpn;
uint32_timmed_rss_invalid;
uint32_tg_mlpath_rqpn;
-   uint8_t sl_vid;
-   uint8_t reserved1;
-   uint16_trlid;
-   uint32_tstatus;
+   union {
+   struct {
+   union {
+   struct {
+   uint8_t   sl_vid;
+   uint8_t   reserved1;
+   uint16_t  rlid;
+   };
+   uint32_t  timestamp_16_47;
+   };
+   uint32_t  status;
+   };
+   struct {
+   uint16_t reserved2;
+   uint8_t  smac[6];
+   };
+   };
uint32_tbyte_cnt;
uint16_twqe_index;
uint16_tchecksum;
-   uint8_t reserved3[3];
+   uint8_t reserved3;
+   uint8_t timestamp_8_15;
+   uint8_t timestamp_0_7;
uint8_t owner_sr_opcode;
 };
 
diff --git a/src/verbs.c b/src/verbs.c
index 0dcdc87..a8d6bd7 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -286,7 +286,8 @@ enum {
 };
 
 enum {
-   CREATE_CQ_SUPPORTED_WC_FLAGS = IBV_WC_STANDARD_FLAGS
+   CREATE_CQ_SUPPORTED_WC_FLAGS = IBV_WC_STANDARD_FLAGS|
+  IBV_WC_EX_WITH_COMPLETION_TIMESTAMP
 };
 
 static struct ibv_cq *create_cq(struct ibv_context *context,
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH libibverbs 1/7] Add support for extended version of ibv_query_device

2015-10-27 Thread Matan Barak
The new mlx4_query_device_ex implementation uses the
extended version of libibverbs/uverbs query_device command.
In addition, it reads the hca_core_clock offset in the bar
from the vendor specific part of ibv_query_device_ex command.

Signed-off-by: Matan Barak 
---
 src/mlx4-abi.h | 13 +
 src/mlx4.c |  1 +
 src/mlx4.h |  8 
 src/verbs.c| 54 +++---
 4 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/src/mlx4-abi.h b/src/mlx4-abi.h
index b48f6fc..b348ce3 100644
--- a/src/mlx4-abi.h
+++ b/src/mlx4-abi.h
@@ -111,4 +111,17 @@ struct mlx4_create_qp {
__u8reserved[5];
 };
 
+enum query_device_resp_mask {
+   QUERY_DEVICE_RESP_MASK_TIMESTAMP = 1UL << 0,
+};
+
+struct query_device_ex_resp {
+   struct ibv_query_device_resp_ex core;
+   struct {
+   uint32_t comp_mask;
+   uint32_t response_length;
+   uint64_t hca_core_clock_offset;
+   };
+};
+
 #endif /* MLX4_ABI_H */
diff --git a/src/mlx4.c b/src/mlx4.c
index c30f4bf..d41dff0 100644
--- a/src/mlx4.c
+++ b/src/mlx4.c
@@ -207,6 +207,7 @@ static int mlx4_init_context(struct verbs_device *v_device,
verbs_set_ctx_op(verbs_ctx, open_qp, mlx4_open_qp);
verbs_set_ctx_op(verbs_ctx, ibv_create_flow, ibv_cmd_create_flow);
verbs_set_ctx_op(verbs_ctx, ibv_destroy_flow, ibv_cmd_destroy_flow);
+   verbs_set_ctx_op(verbs_ctx, query_device_ex, mlx4_query_device_ex);
 
return 0;
 
diff --git a/src/mlx4.h b/src/mlx4.h
index 519d8f4..0f643bc 100644
--- a/src/mlx4.h
+++ b/src/mlx4.h
@@ -198,6 +198,7 @@ struct mlx4_context {
uint8_t link_layer;
enum ibv_port_cap_flags caps;
} port_query_cache[MLX4_PORTS_NUM];
+   uint64_tcore_clock_offset;
 };
 
 struct mlx4_buf {
@@ -378,6 +379,13 @@ void mlx4_free_db(struct mlx4_context *context, enum 
mlx4_db_type type, uint32_t
 
 int mlx4_query_device(struct ibv_context *context,
   struct ibv_device_attr *attr);
+int _mlx4_query_device_ex(struct ibv_context *context,
+ const struct ibv_query_device_ex_input *input,
+ struct ibv_device_attr_ex *attr, size_t attr_size,
+ uint32_t *comp_mask);
+int mlx4_query_device_ex(struct ibv_context *context,
+const struct ibv_query_device_ex_input *input,
+struct ibv_device_attr_ex *attr, size_t attr_size);
 int mlx4_query_port(struct ibv_context *context, uint8_t port,
 struct ibv_port_attr *attr);
 
diff --git a/src/verbs.c b/src/verbs.c
index 2cb1f8a..e93114b 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -45,6 +45,14 @@
 #include "mlx4-abi.h"
 #include "wqe.h"
 
+static void parse_raw_fw_ver(uint64_t raw_fw_ver, unsigned *major,
+unsigned *minor, unsigned *sub_minor)
+{
+   *major = (raw_fw_ver >> 32) & 0x;
+   *minor = (raw_fw_ver >> 16) & 0x;
+   *sub_minor = raw_fw_ver & 0x;
+}
+
 int mlx4_query_device(struct ibv_context *context, struct ibv_device_attr 
*attr)
 {
struct ibv_query_device cmd;
@@ -56,9 +64,7 @@ int mlx4_query_device(struct ibv_context *context, struct 
ibv_device_attr *attr)
if (ret)
return ret;
 
-   major = (raw_fw_ver >> 32) & 0x;
-   minor = (raw_fw_ver >> 16) & 0x;
-   sub_minor = raw_fw_ver & 0x;
+   parse_raw_fw_ver(raw_fw_ver, &major, &minor, &sub_minor);
 
snprintf(attr->fw_ver, sizeof attr->fw_ver,
 "%d.%d.%03d", major, minor, sub_minor);
@@ -66,6 +72,48 @@ int mlx4_query_device(struct ibv_context *context, struct 
ibv_device_attr *attr)
return 0;
 }
 
+int _mlx4_query_device_ex(struct ibv_context *context,
+ const struct ibv_query_device_ex_input *input,
+ struct ibv_device_attr_ex *attr, size_t attr_size,
+ uint32_t *comp_mask)
+{
+   struct ibv_query_device_ex cmd;
+   struct query_device_ex_resp resp;
+   uint64_t raw_fw_ver;
+   unsigned major, minor, sub_minor;
+   int ret;
+
+   memset(&resp, 0, sizeof(resp));
+
+   ret = ibv_cmd_query_device_ex(context, input, attr, attr_size,
+ &raw_fw_ver, &cmd, sizeof(cmd),
+ sizeof(cmd), &resp.core,
+ sizeof(resp.core), sizeof(resp));
+   if (ret)
+   return ret;
+
+   parse_raw_fw_ver(raw_fw_ver, &major, &minor, &sub_minor);
+
+   snprintf(attr->orig_attr.fw_ver, sizeof(attr->orig_attr.fw_ver),
+"%d.%d.%03d", major, minor, sub_minor);
+
+   if (resp.comp_mask & QUERY_DEVICE_RESP_MASK_TIMESTAMP)
+   to_mctx(context)->core_clock_offset =
+   re

[PATCH libibverbs 6/7] Add support for different poll_one_ex functions

2015-10-27 Thread Matan Barak
In order to opitimize the poll_one extended verb for different
wc_flags, add support for poll_one_ex callback function.

Signed-off-by: Matan Barak 
---
 src/cq.c| 5 +++--
 src/mlx4.h  | 5 +
 src/verbs.c | 1 +
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/src/cq.c b/src/cq.c
index 7f40f12..1f2d572 100644
--- a/src/cq.c
+++ b/src/cq.c
@@ -601,7 +601,8 @@ int mlx4_poll_cq_ex(struct ibv_cq *ibcq,
int npolled;
int err = CQ_OK;
unsigned int ne = attr->max_entries;
-   uint64_t wc_flags = cq->wc_flags;
+   int (*poll_fn)(struct mlx4_cq *cq, struct mlx4_qp **cur_qp,
+  struct ibv_wc_ex **wc_ex) = cq->mlx4_poll_one;
 
if (attr->comp_mask)
return -EINVAL;
@@ -609,7 +610,7 @@ int mlx4_poll_cq_ex(struct ibv_cq *ibcq,
pthread_spin_lock(&cq->lock);
 
for (npolled = 0; npolled < ne; ++npolled) {
-   err = _mlx4_poll_one_ex(cq, &qp, &wc, wc_flags);
+   err = poll_fn(cq, &qp, &wc);
if (err != CQ_OK)
break;
}
diff --git a/src/mlx4.h b/src/mlx4.h
index 8e1935d..46a18d6 100644
--- a/src/mlx4.h
+++ b/src/mlx4.h
@@ -215,6 +215,8 @@ struct mlx4_pd {
 struct mlx4_cq {
struct ibv_cq   ibv_cq;
uint64_twc_flags;
+   int (*mlx4_poll_one)(struct mlx4_cq *cq, struct mlx4_qp **cur_qp,
+struct ibv_wc_ex **wc_ex);
struct mlx4_buf buf;
struct mlx4_buf resize_buf;
pthread_spinlock_t  lock;
@@ -432,6 +434,9 @@ int mlx4_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc 
*wc);
 int mlx4_poll_cq_ex(struct ibv_cq *ibcq,
struct ibv_wc_ex *wc,
struct ibv_poll_cq_ex_attr *attr);
+int mlx4_poll_one_ex(struct mlx4_cq *cq,
+struct mlx4_qp **cur_qp,
+struct ibv_wc_ex **pwc_ex);
 int mlx4_arm_cq(struct ibv_cq *cq, int solicited);
 void mlx4_cq_event(struct ibv_cq *cq);
 void __mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq);
diff --git a/src/verbs.c b/src/verbs.c
index 843ca1e..62908c1 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -432,6 +432,7 @@ static struct ibv_cq *create_cq(struct ibv_context *context,
if (ret)
goto err_db;
 
+   cq->mlx4_poll_one = mlx4_poll_one_ex;
cq->creation_flags = cmd_e.ibv_cmd.flags;
cq->wc_flags = cq_attr->wc_flags;
cq->cqn = resp.cqn;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH libibverbs 7/7] Optimize ibv_poll_cq_ex for common scenarios

2015-10-27 Thread Matan Barak
The current ibv_poll_cq_ex mechanism needs to query every field
for its existence. In order to avoid this penalty at runtime,
add optimized functions for special cases.

Signed-off-by: Matan Barak 
---
 configure.ac |  17 
 src/cq.c | 268 ++-
 src/mlx4.h   |  20 -
 src/verbs.c  |  10 +--
 4 files changed, 271 insertions(+), 44 deletions(-)

diff --git a/configure.ac b/configure.ac
index 6e98f20..9dbbb4b 100644
--- a/configure.ac
+++ b/configure.ac
@@ -45,6 +45,23 @@ AC_CHECK_MEMBER([struct verbs_context.ibv_create_flow], [],
 [AC_MSG_ERROR([libmlx4 requires libibverbs >= 1.2.0])],
 [[#include ]])
 
+AC_MSG_CHECKING("always inline")
+CFLAGS_BAK="$CFLAGS"
+CFLAGS="$CFLAGS -Werror"
+AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
+   static inline int f(void)
+   __attribute((always_inline));
+   static inline int f(void)
+   {
+   return 1;
+   }
+]],[[
+   int a = f();
+   a = a;
+]])], [AC_MSG_RESULT([yes]) AC_DEFINE([HAVE_ALWAYS_INLINE], [1], [Define if 
__attribute((always_inline)).])],
+[AC_MSG_RESULT([no])])
+CFLAGS="$CFLAGS_BAK"
+
 dnl Checks for typedefs, structures, and compiler characteristics.
 AC_C_CONST
 AC_CHECK_SIZEOF(long)
diff --git a/src/cq.c b/src/cq.c
index 1f2d572..56c0fa4 100644
--- a/src/cq.c
+++ b/src/cq.c
@@ -377,10 +377,22 @@ union wc_buffer {
uint64_t*b64;
 };
 
+#define IS_IN_WC_FLAGS(yes, no, maybe, flag) (((yes) & (flag)) ||\
+ (!((no) & (flag)) && \
+  ((maybe) & (flag
 static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq,
struct mlx4_qp **cur_qp,
struct ibv_wc_ex **pwc_ex,
-   uint64_t wc_flags)
+   uint64_t wc_flags,
+   uint64_t yes_wc_flags,
+   uint64_t no_wc_flags)
+   ALWAYS_INLINE;
+static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq,
+   struct mlx4_qp **cur_qp,
+   struct ibv_wc_ex **pwc_ex,
+   uint64_t wc_flags,
+   uint64_t wc_flags_yes,
+   uint64_t wc_flags_no)
 {
struct mlx4_cqe *cqe;
uint32_t qpn;
@@ -392,14 +404,14 @@ static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq,
uint64_t wc_flags_out = 0;
 
wc_buffer.b64 = (uint64_t *)&wc_ex->buffer;
-   wc_ex->wc_flags = 0;
wc_ex->reserved = 0;
err = mlx4_handle_cq(cq, cur_qp, &wc_ex->wr_id, &wc_ex->status,
 &wc_ex->vendor_err, &cqe, &qpn, &is_send);
if (err != CQ_CONTINUE)
return err;
 
-   if (wc_flags & IBV_WC_EX_WITH_COMPLETION_TIMESTAMP) {
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
+  IBV_WC_EX_WITH_COMPLETION_TIMESTAMP)) {
uint16_t timestamp_0_15 = cqe->timestamp_0_7 |
cqe->timestamp_8_15 << 8;
 
@@ -415,80 +427,101 @@ static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq,
wc_flags_out |= IBV_WC_EX_IMM;
case MLX4_OPCODE_RDMA_WRITE:
wc_ex->opcode= IBV_WC_RDMA_WRITE;
-   if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN)
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
+  IBV_WC_EX_WITH_BYTE_LEN))
wc_buffer.b32++;
-   if (wc_flags & IBV_WC_EX_WITH_IMM)
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
+  IBV_WC_EX_WITH_IMM))
wc_buffer.b32++;
break;
case MLX4_OPCODE_SEND_IMM:
wc_flags_out |= IBV_WC_EX_IMM;
case MLX4_OPCODE_SEND:
wc_ex->opcode= IBV_WC_SEND;
-   if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN)
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
+  IBV_WC_EX_WITH_BYTE_LEN))
wc_buffer.b32++;
-   if (wc_flags & IBV_WC_EX_WITH_IMM)
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
+  IBV_WC_EX_WITH_IMM))
wc_buffer.b32++;
break;
case MLX4_OPCODE_RDMA_READ:
wc_ex->opcode= IBV_WC_RDMA_READ;
-   if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) {
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_fla

[PATCH libibverbs 0/7] Completion timestamping

2015-10-27 Thread Matan Barak
Hi Yishai,

This series adds support for completion timestamp. In order to
support this feature, several extended verbs were implemented
(as instructed in libibverbs).

ibv_query_device_ex was extended to support reading the
hca_core_clock and timestamp mask. The same verb was extended
with vendor dependant data which is used in order to map the
HCA's free running clock register.
When libmlx4 initializes, it tried to mmap this free running
clock register. This mapping is used in order to implement
ibv_query_values_ex efficiently.

In order to support CQ completion timestmap reporting, we implement
ibv_create_cq_ex verb. This verb is used both for creating a CQ
which supports timestamp and in order to state which fields should
be returned via WC. Returning this data is done via implementing
ibv_poll_cq_ex. We query the CQ requested wc_flags for every field
the user has requested and populate it according to the carried
network operation and WC status.

Last but not least, ibv_poll_cq_ex was optimized in order to eliminate
the if statements and or operations for common combinations of wc
fields. This is done by inlining and using a custom poll_one_ex
function for these fields.

Thanks,
Matan

Matan Barak (7):
  Add support for extended version of ibv_query_device
  Add support for ibv_create_cq_ex
  Implement ibv_poll_cq_ex extension verb
  Add timestmap support to extended poll_cq verb
  Add support for ibv_query_values_ex
  Add support for different poll_one_ex functions
  Optimize ibv_poll_cq_ex for common scenarios

 configure.ac   |  17 ++
 src/cq.c   | 512 +
 src/mlx4-abi.h |  25 +++
 src/mlx4.c |  39 +
 src/mlx4.h |  64 +++-
 src/verbs.c| 219 +---
 6 files changed, 823 insertions(+), 53 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH libibverbs 2/7] Add support for ibv_create_cq_ex

2015-10-27 Thread Matan Barak
Add an extension verb mlx4_create_cq_ex that follows the
standard extension verb mechanism.
This function is similar to mlx4_create_cq but supports the
extension verbs functions and stores the creation flags
for later use (for example, timestamp flag is used in poll_cq).
The function fails if the user passes unsupported WC attributes.

Signed-off-by: Matan Barak 
---
 src/mlx4-abi.h |  12 ++
 src/mlx4.c |   1 +
 src/mlx4.h |   3 ++
 src/verbs.c| 117 +
 4 files changed, 117 insertions(+), 16 deletions(-)

diff --git a/src/mlx4-abi.h b/src/mlx4-abi.h
index b348ce3..9b765e4 100644
--- a/src/mlx4-abi.h
+++ b/src/mlx4-abi.h
@@ -72,12 +72,24 @@ struct mlx4_create_cq {
__u64   db_addr;
 };
 
+struct mlx4_create_cq_ex {
+   struct ibv_create_cq_ex ibv_cmd;
+   __u64   buf_addr;
+   __u64   db_addr;
+};
+
 struct mlx4_create_cq_resp {
struct ibv_create_cq_resp   ibv_resp;
__u32   cqn;
__u32   reserved;
 };
 
+struct mlx4_create_cq_resp_ex {
+   struct ibv_create_cq_resp_exibv_resp;
+   __u32   cqn;
+   __u32   reserved;
+};
+
 struct mlx4_resize_cq {
struct ibv_resize_cqibv_cmd;
__u64   buf_addr;
diff --git a/src/mlx4.c b/src/mlx4.c
index d41dff0..9cfd013 100644
--- a/src/mlx4.c
+++ b/src/mlx4.c
@@ -208,6 +208,7 @@ static int mlx4_init_context(struct verbs_device *v_device,
verbs_set_ctx_op(verbs_ctx, ibv_create_flow, ibv_cmd_create_flow);
verbs_set_ctx_op(verbs_ctx, ibv_destroy_flow, ibv_cmd_destroy_flow);
verbs_set_ctx_op(verbs_ctx, query_device_ex, mlx4_query_device_ex);
+   verbs_set_ctx_op(verbs_ctx, create_cq_ex, mlx4_create_cq_ex);
 
return 0;
 
diff --git a/src/mlx4.h b/src/mlx4.h
index 0f643bc..91eb79c 100644
--- a/src/mlx4.h
+++ b/src/mlx4.h
@@ -222,6 +222,7 @@ struct mlx4_cq {
uint32_t   *arm_db;
int arm_sn;
int cqe_size;
+   int creation_flags;
 };
 
 struct mlx4_srq {
@@ -402,6 +403,8 @@ int mlx4_dereg_mr(struct ibv_mr *mr);
 struct ibv_cq *mlx4_create_cq(struct ibv_context *context, int cqe,
   struct ibv_comp_channel *channel,
   int comp_vector);
+struct ibv_cq *mlx4_create_cq_ex(struct ibv_context *context,
+struct ibv_create_cq_attr_ex *cq_attr);
 int mlx4_alloc_cq_buf(struct mlx4_device *dev, struct mlx4_buf *buf, int nent,
  int entry_size);
 int mlx4_resize_cq(struct ibv_cq *cq, int cqe);
diff --git a/src/verbs.c b/src/verbs.c
index e93114b..3290b86 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -272,19 +272,69 @@ int align_queue_size(int req)
return nent;
 }
 
-struct ibv_cq *mlx4_create_cq(struct ibv_context *context, int cqe,
-  struct ibv_comp_channel *channel,
-  int comp_vector)
+enum cmd_type {
+   MLX4_CMD_TYPE_BASIC,
+   MLX4_CMD_TYPE_EXTENDED
+};
+
+enum {
+   CREATE_CQ_SUPPORTED_COMP_MASK = IBV_CREATE_CQ_ATTR_FLAGS
+};
+
+enum {
+   CREATE_CQ_SUPPORTED_FLAGS = IBV_CREATE_CQ_ATTR_COMPLETION_TIMESTAMP
+};
+
+enum {
+   CREATE_CQ_SUPPORTED_WC_FLAGS = IBV_WC_STANDARD_FLAGS
+};
+
+static struct ibv_cq *create_cq(struct ibv_context *context,
+   struct ibv_create_cq_attr_ex *cq_attr,
+   enum cmd_type cmd_type)
 {
-   struct mlx4_create_cq  cmd;
-   struct mlx4_create_cq_resp resp;
-   struct mlx4_cq*cq;
-   intret;
-   struct mlx4_context   *mctx = to_mctx(context);
+   struct mlx4_create_cq   cmd;
+   struct mlx4_create_cq_excmd_e;
+   struct mlx4_create_cq_resp  resp;
+   struct mlx4_create_cq_resp_ex   resp_e;
+   struct mlx4_cq  *cq;
+   int ret;
+   struct mlx4_context *mctx = to_mctx(context);
+   struct ibv_create_cq_attr_excq_attr_e;
+   int cqe;
 
/* Sanity check CQ size before proceeding */
-   if (cqe > 0x3f)
+   if (cq_attr->cqe > 0x3f)
+   return NULL;
+
+   if (cq_attr->comp_mask & ~CREATE_CQ_SUPPORTED_COMP_MASK) {
+   errno = EINVAL;
return NULL;
+   }
+
+   if (cq_attr->comp_mask & IBV_CREATE_CQ_ATTR_FLAGS &&
+   cq_attr->flags & ~CREATE_CQ_SUPPORTED_FLAGS) {
+   errno = EINVAL;
+   return NULL;
+   }
+
+   if (cq_attr->wc_flags & ~CREATE_CQ_SUPPORTED_WC_FLAGS) {
+   errno = ENOTSUP;
+   return 

Re: [PATCH libibverbs 6/7] Add support for different poll_one_ex functions

2015-10-27 Thread Matan Barak
On Tue, Oct 27, 2015 at 6:52 PM, Matan Barak  wrote:
> In order to opitimize the poll_one extended verb for different
> wc_flags, add support for poll_one_ex callback function.
>
> Signed-off-by: Matan Barak 
> ---
>  src/cq.c| 5 +++--
>  src/mlx4.h  | 5 +
>  src/verbs.c | 1 +
>  3 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/src/cq.c b/src/cq.c
> index 7f40f12..1f2d572 100644
> --- a/src/cq.c
> +++ b/src/cq.c
> @@ -601,7 +601,8 @@ int mlx4_poll_cq_ex(struct ibv_cq *ibcq,
> int npolled;
> int err = CQ_OK;
> unsigned int ne = attr->max_entries;
> -   uint64_t wc_flags = cq->wc_flags;
> +   int (*poll_fn)(struct mlx4_cq *cq, struct mlx4_qp **cur_qp,
> +  struct ibv_wc_ex **wc_ex) = cq->mlx4_poll_one;
>
> if (attr->comp_mask)
> return -EINVAL;
> @@ -609,7 +610,7 @@ int mlx4_poll_cq_ex(struct ibv_cq *ibcq,
> pthread_spin_lock(&cq->lock);
>
> for (npolled = 0; npolled < ne; ++npolled) {
> -   err = _mlx4_poll_one_ex(cq, &qp, &wc, wc_flags);
> +   err = poll_fn(cq, &qp, &wc);
> if (err != CQ_OK)
> break;
> }
> diff --git a/src/mlx4.h b/src/mlx4.h
> index 8e1935d..46a18d6 100644
> --- a/src/mlx4.h
> +++ b/src/mlx4.h
> @@ -215,6 +215,8 @@ struct mlx4_pd {
>  struct mlx4_cq {
> struct ibv_cq   ibv_cq;
> uint64_twc_flags;
> +   int (*mlx4_poll_one)(struct mlx4_cq *cq, struct mlx4_qp **cur_qp,
> +struct ibv_wc_ex **wc_ex);
> struct mlx4_buf buf;
> struct mlx4_buf resize_buf;
> pthread_spinlock_t  lock;
> @@ -432,6 +434,9 @@ int mlx4_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc 
> *wc);
>  int mlx4_poll_cq_ex(struct ibv_cq *ibcq,
> struct ibv_wc_ex *wc,
> struct ibv_poll_cq_ex_attr *attr);
> +int mlx4_poll_one_ex(struct mlx4_cq *cq,
> +struct mlx4_qp **cur_qp,
> +struct ibv_wc_ex **pwc_ex);
>  int mlx4_arm_cq(struct ibv_cq *cq, int solicited);
>  void mlx4_cq_event(struct ibv_cq *cq);
>  void __mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq);
> diff --git a/src/verbs.c b/src/verbs.c
> index 843ca1e..62908c1 100644
> --- a/src/verbs.c
> +++ b/src/verbs.c
> @@ -432,6 +432,7 @@ static struct ibv_cq *create_cq(struct ibv_context 
> *context,
> if (ret)
> goto err_db;
>
> +   cq->mlx4_poll_one = mlx4_poll_one_ex;
> cq->creation_flags = cmd_e.ibv_cmd.flags;
> cq->wc_flags = cq_attr->wc_flags;
> cq->cqn = resp.cqn;
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

This should have libmlx4 prefix.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH libibverbs 1/7] Add support for extended version of ibv_query_device

2015-10-27 Thread Matan Barak
On Tue, Oct 27, 2015 at 6:52 PM, Matan Barak  wrote:
> The new mlx4_query_device_ex implementation uses the
> extended version of libibverbs/uverbs query_device command.
> In addition, it reads the hca_core_clock offset in the bar
> from the vendor specific part of ibv_query_device_ex command.
>
> Signed-off-by: Matan Barak 
> ---
>  src/mlx4-abi.h | 13 +
>  src/mlx4.c |  1 +
>  src/mlx4.h |  8 
>  src/verbs.c| 54 +++---
>  4 files changed, 73 insertions(+), 3 deletions(-)
>
> diff --git a/src/mlx4-abi.h b/src/mlx4-abi.h
> index b48f6fc..b348ce3 100644
> --- a/src/mlx4-abi.h
> +++ b/src/mlx4-abi.h
> @@ -111,4 +111,17 @@ struct mlx4_create_qp {
> __u8reserved[5];
>  };
>
> +enum query_device_resp_mask {
> +   QUERY_DEVICE_RESP_MASK_TIMESTAMP = 1UL << 0,
> +};
> +
> +struct query_device_ex_resp {
> +   struct ibv_query_device_resp_ex core;
> +   struct {
> +   uint32_t comp_mask;
> +   uint32_t response_length;
> +   uint64_t hca_core_clock_offset;
> +   };
> +};
> +
>  #endif /* MLX4_ABI_H */
> diff --git a/src/mlx4.c b/src/mlx4.c
> index c30f4bf..d41dff0 100644
> --- a/src/mlx4.c
> +++ b/src/mlx4.c
> @@ -207,6 +207,7 @@ static int mlx4_init_context(struct verbs_device 
> *v_device,
> verbs_set_ctx_op(verbs_ctx, open_qp, mlx4_open_qp);
> verbs_set_ctx_op(verbs_ctx, ibv_create_flow, ibv_cmd_create_flow);
> verbs_set_ctx_op(verbs_ctx, ibv_destroy_flow, ibv_cmd_destroy_flow);
> +   verbs_set_ctx_op(verbs_ctx, query_device_ex, mlx4_query_device_ex);
>
> return 0;
>
> diff --git a/src/mlx4.h b/src/mlx4.h
> index 519d8f4..0f643bc 100644
> --- a/src/mlx4.h
> +++ b/src/mlx4.h
> @@ -198,6 +198,7 @@ struct mlx4_context {
> uint8_t link_layer;
> enum ibv_port_cap_flags caps;
> } port_query_cache[MLX4_PORTS_NUM];
> +   uint64_tcore_clock_offset;
>  };
>
>  struct mlx4_buf {
> @@ -378,6 +379,13 @@ void mlx4_free_db(struct mlx4_context *context, enum 
> mlx4_db_type type, uint32_t
>
>  int mlx4_query_device(struct ibv_context *context,
>struct ibv_device_attr *attr);
> +int _mlx4_query_device_ex(struct ibv_context *context,
> + const struct ibv_query_device_ex_input *input,
> + struct ibv_device_attr_ex *attr, size_t attr_size,
> + uint32_t *comp_mask);
> +int mlx4_query_device_ex(struct ibv_context *context,
> +const struct ibv_query_device_ex_input *input,
> +struct ibv_device_attr_ex *attr, size_t attr_size);
>  int mlx4_query_port(struct ibv_context *context, uint8_t port,
>  struct ibv_port_attr *attr);
>
> diff --git a/src/verbs.c b/src/verbs.c
> index 2cb1f8a..e93114b 100644
> --- a/src/verbs.c
> +++ b/src/verbs.c
> @@ -45,6 +45,14 @@
>  #include "mlx4-abi.h"
>  #include "wqe.h"
>
> +static void parse_raw_fw_ver(uint64_t raw_fw_ver, unsigned *major,
> +unsigned *minor, unsigned *sub_minor)
> +{
> +   *major = (raw_fw_ver >> 32) & 0x;
> +   *minor = (raw_fw_ver >> 16) & 0x;
> +   *sub_minor = raw_fw_ver & 0x;
> +}
> +
>  int mlx4_query_device(struct ibv_context *context, struct ibv_device_attr 
> *attr)
>  {
> struct ibv_query_device cmd;
> @@ -56,9 +64,7 @@ int mlx4_query_device(struct ibv_context *context, struct 
> ibv_device_attr *attr)
> if (ret)
> return ret;
>
> -   major = (raw_fw_ver >> 32) & 0x;
> -   minor = (raw_fw_ver >> 16) & 0x;
> -   sub_minor = raw_fw_ver & 0x;
> +   parse_raw_fw_ver(raw_fw_ver, &major, &minor, &sub_minor);
>
> snprintf(attr->fw_ver, sizeof attr->fw_ver,
>  "%d.%d.%03d", major, minor, sub_minor);
> @@ -66,6 +72,48 @@ int mlx4_query_device(struct ibv_context *context, struct 
> ibv_device_attr *attr)
> return 0;
>  }
>
> +int _mlx4_query_device_ex(struct ibv_context *context,
> + const struct ibv_query_device_ex_input *input,
> + struct ibv_device_attr_ex *attr, size_t attr_size,
> + uint32_t *comp_mask)
> +{
> +   struct ibv_query_device_ex cmd;
> +   struct query_device_ex_resp resp;
> +   uint64_t raw_fw_ver;
> +   unsigned major, minor, sub_minor;
> +   int ret;
> +
> +   memset(&resp, 0, sizeof(resp));
> +
> +   ret = ibv_cmd_query_device_ex(context, input, attr, attr_size,
> + &raw_fw_ver, &cmd, sizeof(cmd),
> + sizeof(cmd), &resp.core,
> + sizeof(resp.core), sizeof(resp));
> +   if (ret)
> +   return ret;
> +
> +   parse_raw_fw_ver(raw_fw_ver, &major, 

Re: [PATCH libibverbs 4/7] Add timestmap support to extended poll_cq verb

2015-10-27 Thread Matan Barak
On Tue, Oct 27, 2015 at 6:52 PM, Matan Barak  wrote:
> Adding support to the extended version of poll_cq verb to read
> completion timestamp. Reading timestamp isn't supported with reading
> IBV_WC_EX_WITH_SL and IBV_WC_EX_WITH_SLID.
>
> Signed-off-by: Matan Barak 
> ---
>  src/cq.c| 10 ++
>  src/mlx4.h  | 25 -
>  src/verbs.c |  3 ++-
>  3 files changed, 32 insertions(+), 6 deletions(-)
>
> diff --git a/src/cq.c b/src/cq.c
> index c86e824..7f40f12 100644
> --- a/src/cq.c
> +++ b/src/cq.c
> @@ -399,6 +399,16 @@ static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq,
> if (err != CQ_CONTINUE)
> return err;
>
> +   if (wc_flags & IBV_WC_EX_WITH_COMPLETION_TIMESTAMP) {
> +   uint16_t timestamp_0_15 = cqe->timestamp_0_7 |
> +   cqe->timestamp_8_15 << 8;
> +
> +   wc_flags_out |= IBV_WC_EX_WITH_COMPLETION_TIMESTAMP;
> +   *wc_buffer.b64++ = (((uint64_t)ntohl(cqe->timestamp_16_47)
> ++ !timestamp_0_15) << 16) |
> +  (uint64_t)timestamp_0_15;
> +   }
> +
> if (is_send) {
> switch (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) {
> case MLX4_OPCODE_RDMA_WRITE_IMM:
> diff --git a/src/mlx4.h b/src/mlx4.h
> index e22f879..2465298 100644
> --- a/src/mlx4.h
> +++ b/src/mlx4.h
> @@ -312,14 +312,29 @@ struct mlx4_cqe {
> uint32_tvlan_my_qpn;
> uint32_timmed_rss_invalid;
> uint32_tg_mlpath_rqpn;
> -   uint8_t sl_vid;
> -   uint8_t reserved1;
> -   uint16_trlid;
> -   uint32_tstatus;
> +   union {
> +   struct {
> +   union {
> +   struct {
> +   uint8_t   sl_vid;
> +   uint8_t   reserved1;
> +   uint16_t  rlid;
> +   };
> +   uint32_t  timestamp_16_47;
> +   };
> +   uint32_t  status;
> +   };
> +   struct {
> +   uint16_t reserved2;
> +   uint8_t  smac[6];
> +   };
> +   };
> uint32_tbyte_cnt;
> uint16_twqe_index;
> uint16_tchecksum;
> -   uint8_t reserved3[3];
> +   uint8_t reserved3;
> +   uint8_t timestamp_8_15;
> +   uint8_t timestamp_0_7;
> uint8_t owner_sr_opcode;
>  };
>
> diff --git a/src/verbs.c b/src/verbs.c
> index 0dcdc87..a8d6bd7 100644
> --- a/src/verbs.c
> +++ b/src/verbs.c
> @@ -286,7 +286,8 @@ enum {
>  };
>
>  enum {
> -   CREATE_CQ_SUPPORTED_WC_FLAGS = IBV_WC_STANDARD_FLAGS
> +   CREATE_CQ_SUPPORTED_WC_FLAGS = IBV_WC_STANDARD_FLAGS|
> +  IBV_WC_EX_WITH_COMPLETION_TIMESTAMP
>  };
>
>  static struct ibv_cq *create_cq(struct ibv_context *context,
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

This should have libmlx4 prefix.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH libibverbs 2/7] Add support for ibv_create_cq_ex

2015-10-27 Thread Matan Barak
On Tue, Oct 27, 2015 at 6:52 PM, Matan Barak  wrote:
> Add an extension verb mlx4_create_cq_ex that follows the
> standard extension verb mechanism.
> This function is similar to mlx4_create_cq but supports the
> extension verbs functions and stores the creation flags
> for later use (for example, timestamp flag is used in poll_cq).
> The function fails if the user passes unsupported WC attributes.
>
> Signed-off-by: Matan Barak 
> ---
>  src/mlx4-abi.h |  12 ++
>  src/mlx4.c |   1 +
>  src/mlx4.h |   3 ++
>  src/verbs.c| 117 
> +
>  4 files changed, 117 insertions(+), 16 deletions(-)
>
> diff --git a/src/mlx4-abi.h b/src/mlx4-abi.h
> index b348ce3..9b765e4 100644
> --- a/src/mlx4-abi.h
> +++ b/src/mlx4-abi.h
> @@ -72,12 +72,24 @@ struct mlx4_create_cq {
> __u64   db_addr;
>  };
>
> +struct mlx4_create_cq_ex {
> +   struct ibv_create_cq_ex ibv_cmd;
> +   __u64   buf_addr;
> +   __u64   db_addr;
> +};
> +
>  struct mlx4_create_cq_resp {
> struct ibv_create_cq_resp   ibv_resp;
> __u32   cqn;
> __u32   reserved;
>  };
>
> +struct mlx4_create_cq_resp_ex {
> +   struct ibv_create_cq_resp_exibv_resp;
> +   __u32   cqn;
> +   __u32   reserved;
> +};
> +
>  struct mlx4_resize_cq {
> struct ibv_resize_cqibv_cmd;
> __u64   buf_addr;
> diff --git a/src/mlx4.c b/src/mlx4.c
> index d41dff0..9cfd013 100644
> --- a/src/mlx4.c
> +++ b/src/mlx4.c
> @@ -208,6 +208,7 @@ static int mlx4_init_context(struct verbs_device 
> *v_device,
> verbs_set_ctx_op(verbs_ctx, ibv_create_flow, ibv_cmd_create_flow);
> verbs_set_ctx_op(verbs_ctx, ibv_destroy_flow, ibv_cmd_destroy_flow);
> verbs_set_ctx_op(verbs_ctx, query_device_ex, mlx4_query_device_ex);
> +   verbs_set_ctx_op(verbs_ctx, create_cq_ex, mlx4_create_cq_ex);
>
> return 0;
>
> diff --git a/src/mlx4.h b/src/mlx4.h
> index 0f643bc..91eb79c 100644
> --- a/src/mlx4.h
> +++ b/src/mlx4.h
> @@ -222,6 +222,7 @@ struct mlx4_cq {
> uint32_t   *arm_db;
> int arm_sn;
> int cqe_size;
> +   int creation_flags;
>  };
>
>  struct mlx4_srq {
> @@ -402,6 +403,8 @@ int mlx4_dereg_mr(struct ibv_mr *mr);
>  struct ibv_cq *mlx4_create_cq(struct ibv_context *context, int cqe,
>struct ibv_comp_channel *channel,
>int comp_vector);
> +struct ibv_cq *mlx4_create_cq_ex(struct ibv_context *context,
> +struct ibv_create_cq_attr_ex *cq_attr);
>  int mlx4_alloc_cq_buf(struct mlx4_device *dev, struct mlx4_buf *buf, int 
> nent,
>   int entry_size);
>  int mlx4_resize_cq(struct ibv_cq *cq, int cqe);
> diff --git a/src/verbs.c b/src/verbs.c
> index e93114b..3290b86 100644
> --- a/src/verbs.c
> +++ b/src/verbs.c
> @@ -272,19 +272,69 @@ int align_queue_size(int req)
> return nent;
>  }
>
> -struct ibv_cq *mlx4_create_cq(struct ibv_context *context, int cqe,
> -  struct ibv_comp_channel *channel,
> -  int comp_vector)
> +enum cmd_type {
> +   MLX4_CMD_TYPE_BASIC,
> +   MLX4_CMD_TYPE_EXTENDED
> +};
> +
> +enum {
> +   CREATE_CQ_SUPPORTED_COMP_MASK = IBV_CREATE_CQ_ATTR_FLAGS
> +};
> +
> +enum {
> +   CREATE_CQ_SUPPORTED_FLAGS = IBV_CREATE_CQ_ATTR_COMPLETION_TIMESTAMP
> +};
> +
> +enum {
> +   CREATE_CQ_SUPPORTED_WC_FLAGS = IBV_WC_STANDARD_FLAGS
> +};
> +
> +static struct ibv_cq *create_cq(struct ibv_context *context,
> +   struct ibv_create_cq_attr_ex *cq_attr,
> +   enum cmd_type cmd_type)
>  {
> -   struct mlx4_create_cq  cmd;
> -   struct mlx4_create_cq_resp resp;
> -   struct mlx4_cq*cq;
> -   intret;
> -   struct mlx4_context   *mctx = to_mctx(context);
> +   struct mlx4_create_cq   cmd;
> +   struct mlx4_create_cq_excmd_e;
> +   struct mlx4_create_cq_resp  resp;
> +   struct mlx4_create_cq_resp_ex   resp_e;
> +   struct mlx4_cq  *cq;
> +   int ret;
> +   struct mlx4_context *mctx = to_mctx(context);
> +   struct ibv_create_cq_attr_excq_attr_e;
> +   int cqe;
>
> /* Sanity check CQ size before proceeding */
> -   if (cqe > 0x3f)
> +   if (cq_attr->cqe > 0x3f)
> +   return NULL;
> +
> +   if (cq_attr->comp_mask & ~CREATE_CQ_SUPPORTED_COMP_MASK) {
> +   errno = EINVAL;
> return NULL;
> +   }
> +

Re: [PATCH libibverbs 7/7] Optimize ibv_poll_cq_ex for common scenarios

2015-10-27 Thread Matan Barak
On Tue, Oct 27, 2015 at 6:52 PM, Matan Barak  wrote:
> The current ibv_poll_cq_ex mechanism needs to query every field
> for its existence. In order to avoid this penalty at runtime,
> add optimized functions for special cases.
>
> Signed-off-by: Matan Barak 
> ---
>  configure.ac |  17 
>  src/cq.c | 268 
> ++-
>  src/mlx4.h   |  20 -
>  src/verbs.c  |  10 +--
>  4 files changed, 271 insertions(+), 44 deletions(-)
>
> diff --git a/configure.ac b/configure.ac
> index 6e98f20..9dbbb4b 100644
> --- a/configure.ac
> +++ b/configure.ac
> @@ -45,6 +45,23 @@ AC_CHECK_MEMBER([struct verbs_context.ibv_create_flow], [],
>  [AC_MSG_ERROR([libmlx4 requires libibverbs >= 1.2.0])],
>  [[#include ]])
>
> +AC_MSG_CHECKING("always inline")
> +CFLAGS_BAK="$CFLAGS"
> +CFLAGS="$CFLAGS -Werror"
> +AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
> +   static inline int f(void)
> +   __attribute((always_inline));
> +   static inline int f(void)
> +   {
> +   return 1;
> +   }
> +]],[[
> +   int a = f();
> +   a = a;
> +]])], [AC_MSG_RESULT([yes]) AC_DEFINE([HAVE_ALWAYS_INLINE], [1], [Define if 
> __attribute((always_inline)).])],
> +[AC_MSG_RESULT([no])])
> +CFLAGS="$CFLAGS_BAK"
> +
>  dnl Checks for typedefs, structures, and compiler characteristics.
>  AC_C_CONST
>  AC_CHECK_SIZEOF(long)
> diff --git a/src/cq.c b/src/cq.c
> index 1f2d572..56c0fa4 100644
> --- a/src/cq.c
> +++ b/src/cq.c
> @@ -377,10 +377,22 @@ union wc_buffer {
> uint64_t*b64;
>  };
>
> +#define IS_IN_WC_FLAGS(yes, no, maybe, flag) (((yes) & (flag)) ||\
> + (!((no) & (flag)) && \
> +  ((maybe) & (flag
>  static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq,
> struct mlx4_qp **cur_qp,
> struct ibv_wc_ex **pwc_ex,
> -   uint64_t wc_flags)
> +   uint64_t wc_flags,
> +   uint64_t yes_wc_flags,
> +   uint64_t no_wc_flags)
> +   ALWAYS_INLINE;
> +static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq,
> +   struct mlx4_qp **cur_qp,
> +   struct ibv_wc_ex **pwc_ex,
> +   uint64_t wc_flags,
> +   uint64_t wc_flags_yes,
> +   uint64_t wc_flags_no)
>  {
> struct mlx4_cqe *cqe;
> uint32_t qpn;
> @@ -392,14 +404,14 @@ static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq,
> uint64_t wc_flags_out = 0;
>
> wc_buffer.b64 = (uint64_t *)&wc_ex->buffer;
> -   wc_ex->wc_flags = 0;
> wc_ex->reserved = 0;
> err = mlx4_handle_cq(cq, cur_qp, &wc_ex->wr_id, &wc_ex->status,
>  &wc_ex->vendor_err, &cqe, &qpn, &is_send);
> if (err != CQ_CONTINUE)
> return err;
>
> -   if (wc_flags & IBV_WC_EX_WITH_COMPLETION_TIMESTAMP) {
> +   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
> +  IBV_WC_EX_WITH_COMPLETION_TIMESTAMP)) {
> uint16_t timestamp_0_15 = cqe->timestamp_0_7 |
> cqe->timestamp_8_15 << 8;
>
> @@ -415,80 +427,101 @@ static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq,
> wc_flags_out |= IBV_WC_EX_IMM;
> case MLX4_OPCODE_RDMA_WRITE:
> wc_ex->opcode= IBV_WC_RDMA_WRITE;
> -   if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN)
> +   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, 
> wc_flags,
> +  IBV_WC_EX_WITH_BYTE_LEN))
> wc_buffer.b32++;
> -   if (wc_flags & IBV_WC_EX_WITH_IMM)
> +   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, 
> wc_flags,
> +  IBV_WC_EX_WITH_IMM))
> wc_buffer.b32++;
> break;
> case MLX4_OPCODE_SEND_IMM:
> wc_flags_out |= IBV_WC_EX_IMM;
> case MLX4_OPCODE_SEND:
> wc_ex->opcode= IBV_WC_SEND;
> -   if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN)
> +   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, 
> wc_flags,
> +  IBV_WC_EX_WITH_BYTE_LEN))
> wc_buffer.b32++;
> -   if (wc_flags & IBV_WC_EX_WITH_IMM)
> +   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, 
> wc_flags,
> +  IBV_WC_EX_WITH_IMM))
> wc_b

Re: [PATCH libibverbs 5/7] Add support for ibv_query_values_ex

2015-10-27 Thread Matan Barak
On Tue, Oct 27, 2015 at 6:52 PM, Matan Barak  wrote:
> Adding mlx4_query_values as implementation for
> ibv_query_values_ex. mlx4_query_values follows the
> standard extension verb mechanism.
> This function supports reading the hwclock via mmaping
> the required space from kernel.
>
> Signed-off-by: Matan Barak 
> ---
>  src/mlx4.c  | 36 
>  src/mlx4.h  |  3 +++
>  src/verbs.c | 45 +
>  3 files changed, 84 insertions(+)
>
> diff --git a/src/mlx4.c b/src/mlx4.c
> index cc1211f..6d66cf0 100644
> --- a/src/mlx4.c
> +++ b/src/mlx4.c
> @@ -116,6 +116,28 @@ static struct ibv_context_ops mlx4_ctx_ops = {
> .detach_mcast  = ibv_cmd_detach_mcast
>  };
>
> +static int mlx4_map_internal_clock(struct mlx4_device *dev,
> +  struct ibv_context *ibv_ctx)
> +{
> +   struct mlx4_context *context = to_mctx(ibv_ctx);
> +   void *hca_clock_page;
> +
> +   hca_clock_page = mmap(NULL, dev->page_size, PROT_READ, MAP_SHARED,
> + ibv_ctx->cmd_fd, dev->page_size * 3);
> +
> +   if (hca_clock_page == MAP_FAILED) {
> +   fprintf(stderr, PFX
> +   "Warning: Timestamp available,\n"
> +   "but failed to mmap() hca core clock page, 
> errno=%d.\n",
> +   errno);
> +   return -1;
> +   }
> +
> +   context->hca_core_clock = hca_clock_page +
> +   context->core_clock_offset % dev->page_size;
> +   return 0;
> +}
> +
>  static int mlx4_init_context(struct verbs_device *v_device,
> struct ibv_context *ibv_ctx, int cmd_fd)
>  {
> @@ -127,6 +149,10 @@ static int mlx4_init_context(struct verbs_device 
> *v_device,
> __u16   bf_reg_size;
> struct mlx4_device  *dev = to_mdev(&v_device->device);
> struct verbs_context *verbs_ctx = verbs_get_ctx(ibv_ctx);
> +   struct ibv_query_device_ex_input input_query_device = {.comp_mask = 
> 0};
> +   struct ibv_device_attr_ex   dev_attrs;
> +   uint32_tdev_attrs_comp_mask;
> +   int err;
>
> /* memory footprint of mlx4_context and verbs_context share
> * struct ibv_context.
> @@ -194,6 +220,12 @@ static int mlx4_init_context(struct verbs_device 
> *v_device,
> context->bf_buf_size = 0;
> }
>
> +   context->hca_core_clock = NULL;
> +   err = _mlx4_query_device_ex(ibv_ctx, &input_query_device, &dev_attrs,
> +   sizeof(dev_attrs), &dev_attrs_comp_mask);
> +   if (!err && dev_attrs_comp_mask & QUERY_DEVICE_RESP_MASK_TIMESTAMP)
> +   mlx4_map_internal_clock(dev, ibv_ctx);
> +
> pthread_spin_init(&context->uar_lock, PTHREAD_PROCESS_PRIVATE);
> ibv_ctx->ops = mlx4_ctx_ops;
>
> @@ -210,6 +242,7 @@ static int mlx4_init_context(struct verbs_device 
> *v_device,
> verbs_set_ctx_op(verbs_ctx, query_device_ex, mlx4_query_device_ex);
> verbs_set_ctx_op(verbs_ctx, create_cq_ex, mlx4_create_cq_ex);
> verbs_set_ctx_op(verbs_ctx, poll_cq_ex, mlx4_poll_cq_ex);
> +   verbs_set_ctx_op(verbs_ctx, query_values, mlx4_query_values);
>
> return 0;
>
> @@ -223,6 +256,9 @@ static void mlx4_uninit_context(struct verbs_device 
> *v_device,
> munmap(context->uar, to_mdev(&v_device->device)->page_size);
> if (context->bf_page)
> munmap(context->bf_page, 
> to_mdev(&v_device->device)->page_size);
> +   if (context->hca_core_clock)
> +   munmap(context->hca_core_clock - context->core_clock_offset,
> +  to_mdev(&v_device->device)->page_size);
>
>  }
>
> diff --git a/src/mlx4.h b/src/mlx4.h
> index 2465298..8e1935d 100644
> --- a/src/mlx4.h
> +++ b/src/mlx4.h
> @@ -199,6 +199,7 @@ struct mlx4_context {
> enum ibv_port_cap_flags caps;
> } port_query_cache[MLX4_PORTS_NUM];
> uint64_tcore_clock_offset;
> +   void   *hca_core_clock;
>  };
>
>  struct mlx4_buf {
> @@ -403,6 +404,8 @@ int _mlx4_query_device_ex(struct ibv_context *context,
>  int mlx4_query_device_ex(struct ibv_context *context,
>  const struct ibv_query_device_ex_input *input,
>  struct ibv_device_attr_ex *attr, size_t attr_size);
> +int mlx4_query_values(struct ibv_context *context,
> + struct ibv_values_ex *values);
>  int mlx4_query_port(struct ibv_context *context, uint8_t port,
>  struct ibv_port_attr *attr);
>
> diff --git a/src/verbs.c b/src/verbs.c
> index a8d6bd7..843ca1e 100644
> --- a/src/verbs.c
> +++ b/src/verbs.c
> @@ -114,6 +114,51 @@ int mlx4_query_device_ex(struct ibv_context *context,
> return _mlx4_query_device_ex(context, input, attr, attr_size, N

Re: [PATCH libibverbs 3/7] Implement ibv_poll_cq_ex extension verb

2015-10-27 Thread Matan Barak
On Tue, Oct 27, 2015 at 6:52 PM, Matan Barak  wrote:
> Add an implementation for verb_poll_cq extension verb.
> This patch implements the new API via the standard
> function mlx4_poll_one.
>
> Signed-off-by: Matan Barak 
> ---
>  src/cq.c| 307 
> ++--
>  src/mlx4.c  |   1 +
>  src/mlx4.h  |   4 +
>  src/verbs.c |   1 +
>  4 files changed, 284 insertions(+), 29 deletions(-)
>
> diff --git a/src/cq.c b/src/cq.c
> index 32c9070..c86e824 100644
> --- a/src/cq.c
> +++ b/src/cq.c
> @@ -52,6 +52,7 @@ enum {
>  };
>
>  enum {
> +   CQ_CONTINUE =  1,
> CQ_OK   =  0,
> CQ_EMPTY= -1,
> CQ_POLL_ERR = -2
> @@ -121,7 +122,9 @@ static void update_cons_index(struct mlx4_cq *cq)
> *cq->set_ci_db = htonl(cq->cons_index & 0xff);
>  }
>
> -static void mlx4_handle_error_cqe(struct mlx4_err_cqe *cqe, struct ibv_wc 
> *wc)
> +static void mlx4_handle_error_cqe(struct mlx4_err_cqe *cqe,
> + enum ibv_wc_status *status,
> + enum ibv_wc_opcode *vendor_err)
>  {
> if (cqe->syndrome == MLX4_CQE_SYNDROME_LOCAL_QP_OP_ERR)
> printf(PFX "local QP operation err "
> @@ -133,64 +136,68 @@ static void mlx4_handle_error_cqe(struct mlx4_err_cqe 
> *cqe, struct ibv_wc *wc)
>
> switch (cqe->syndrome) {
> case MLX4_CQE_SYNDROME_LOCAL_LENGTH_ERR:
> -   wc->status = IBV_WC_LOC_LEN_ERR;
> +   *status = IBV_WC_LOC_LEN_ERR;
> break;
> case MLX4_CQE_SYNDROME_LOCAL_QP_OP_ERR:
> -   wc->status = IBV_WC_LOC_QP_OP_ERR;
> +   *status = IBV_WC_LOC_QP_OP_ERR;
> break;
> case MLX4_CQE_SYNDROME_LOCAL_PROT_ERR:
> -   wc->status = IBV_WC_LOC_PROT_ERR;
> +   *status = IBV_WC_LOC_PROT_ERR;
> break;
> case MLX4_CQE_SYNDROME_WR_FLUSH_ERR:
> -   wc->status = IBV_WC_WR_FLUSH_ERR;
> +   *status = IBV_WC_WR_FLUSH_ERR;
> break;
> case MLX4_CQE_SYNDROME_MW_BIND_ERR:
> -   wc->status = IBV_WC_MW_BIND_ERR;
> +   *status = IBV_WC_MW_BIND_ERR;
> break;
> case MLX4_CQE_SYNDROME_BAD_RESP_ERR:
> -   wc->status = IBV_WC_BAD_RESP_ERR;
> +   *status = IBV_WC_BAD_RESP_ERR;
> break;
> case MLX4_CQE_SYNDROME_LOCAL_ACCESS_ERR:
> -   wc->status = IBV_WC_LOC_ACCESS_ERR;
> +   *status = IBV_WC_LOC_ACCESS_ERR;
> break;
> case MLX4_CQE_SYNDROME_REMOTE_INVAL_REQ_ERR:
> -   wc->status = IBV_WC_REM_INV_REQ_ERR;
> +   *status = IBV_WC_REM_INV_REQ_ERR;
> break;
> case MLX4_CQE_SYNDROME_REMOTE_ACCESS_ERR:
> -   wc->status = IBV_WC_REM_ACCESS_ERR;
> +   *status = IBV_WC_REM_ACCESS_ERR;
> break;
> case MLX4_CQE_SYNDROME_REMOTE_OP_ERR:
> -   wc->status = IBV_WC_REM_OP_ERR;
> +   *status = IBV_WC_REM_OP_ERR;
> break;
> case MLX4_CQE_SYNDROME_TRANSPORT_RETRY_EXC_ERR:
> -   wc->status = IBV_WC_RETRY_EXC_ERR;
> +   *status = IBV_WC_RETRY_EXC_ERR;
> break;
> case MLX4_CQE_SYNDROME_RNR_RETRY_EXC_ERR:
> -   wc->status = IBV_WC_RNR_RETRY_EXC_ERR;
> +   *status = IBV_WC_RNR_RETRY_EXC_ERR;
> break;
> case MLX4_CQE_SYNDROME_REMOTE_ABORTED_ERR:
> -   wc->status = IBV_WC_REM_ABORT_ERR;
> +   *status = IBV_WC_REM_ABORT_ERR;
> break;
> default:
> -   wc->status = IBV_WC_GENERAL_ERR;
> +   *status = IBV_WC_GENERAL_ERR;
> break;
> }
>
> -   wc->vendor_err = cqe->vendor_err;
> +   *vendor_err = cqe->vendor_err;
>  }
>
> -static int mlx4_poll_one(struct mlx4_cq *cq,
> -struct mlx4_qp **cur_qp,
> -struct ibv_wc *wc)
> +static inline int mlx4_handle_cq(struct mlx4_cq *cq,
> +struct mlx4_qp **cur_qp,
> +uint64_t *wc_wr_id,
> +enum ibv_wc_status *wc_status,
> +uint32_t *wc_vendor_err,
> +struct mlx4_cqe **pcqe,
> +uint32_t *pqpn,
> +int *pis_send)
>  {
> struct mlx4_wq *wq;
> struct mlx4_cqe *cqe;
> struct mlx4_srq *srq;
> uint32_t qpn;
> -   uint32_t g_mlpath_rqpn;
> -   uint16_t wqe_index;
> int is_error;
> int is_send;
> +   uint16_t wqe_index;
>
> cqe = next_cqe_sw(cq);
> if (!cqe)
> @@ -201,7 +

[PATCH v1 libmlx4 0/7] Completion timestamping

2015-10-27 Thread Matan Barak
Hi Yishai,

This series adds support for completion timestamp. In order to
support this feature, several extended verbs were implemented
(as instructed in libibverbs).

ibv_query_device_ex was extended to support reading the
hca_core_clock and timestamp mask. The same verb was extended
with vendor dependant data which is used in order to map the
HCA's free running clock register.
When libmlx4 initializes, it tried to mmap this free running
clock register. This mapping is used in order to implement
ibv_query_values_ex efficiently.

In order to support CQ completion timestmap reporting, we implement
ibv_create_cq_ex verb. This verb is used both for creating a CQ
which supports timestamp and in order to state which fields should
be returned via WC. Returning this data is done via implementing
ibv_poll_cq_ex. We query the CQ requested wc_flags for every field
the user has requested and populate it according to the carried
network operation and WC status.

Last but not least, ibv_poll_cq_ex was optimized in order to eliminate
the if statements and or operations for common combinations of wc
fields. This is done by inlining and using a custom poll_one_ex
function for these fields.

Thanks,
Matan

Changes from v0:
* Changed patch-set to correct prefix.

Matan Barak (7):
  Add support for extended version of ibv_query_device
  Add support for ibv_create_cq_ex
  Implement ibv_poll_cq_ex extension verb
  Add timestmap support to extended poll_cq verb
  Add support for ibv_query_values_ex
  Add support for different poll_one_ex functions
  Optimize ibv_poll_cq_ex for common scenarios

 configure.ac   |  17 ++
 src/cq.c   | 512 +
 src/mlx4-abi.h |  25 +++
 src/mlx4.c |  39 +
 src/mlx4.h |  64 +++-
 src/verbs.c| 219 +---
 6 files changed, 823 insertions(+), 53 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 libmlx4 4/7] Add timestmap support to extended poll_cq verb

2015-10-27 Thread Matan Barak
Adding support to the extended version of poll_cq verb to read
completion timestamp. Reading timestamp isn't supported with reading
IBV_WC_EX_WITH_SL and IBV_WC_EX_WITH_SLID.

Signed-off-by: Matan Barak 
---
 src/cq.c| 10 ++
 src/mlx4.h  | 25 -
 src/verbs.c |  3 ++-
 3 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/src/cq.c b/src/cq.c
index c86e824..7f40f12 100644
--- a/src/cq.c
+++ b/src/cq.c
@@ -399,6 +399,16 @@ static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq,
if (err != CQ_CONTINUE)
return err;
 
+   if (wc_flags & IBV_WC_EX_WITH_COMPLETION_TIMESTAMP) {
+   uint16_t timestamp_0_15 = cqe->timestamp_0_7 |
+   cqe->timestamp_8_15 << 8;
+
+   wc_flags_out |= IBV_WC_EX_WITH_COMPLETION_TIMESTAMP;
+   *wc_buffer.b64++ = (((uint64_t)ntohl(cqe->timestamp_16_47)
++ !timestamp_0_15) << 16) |
+  (uint64_t)timestamp_0_15;
+   }
+
if (is_send) {
switch (cqe->owner_sr_opcode & MLX4_CQE_OPCODE_MASK) {
case MLX4_OPCODE_RDMA_WRITE_IMM:
diff --git a/src/mlx4.h b/src/mlx4.h
index e22f879..2465298 100644
--- a/src/mlx4.h
+++ b/src/mlx4.h
@@ -312,14 +312,29 @@ struct mlx4_cqe {
uint32_tvlan_my_qpn;
uint32_timmed_rss_invalid;
uint32_tg_mlpath_rqpn;
-   uint8_t sl_vid;
-   uint8_t reserved1;
-   uint16_trlid;
-   uint32_tstatus;
+   union {
+   struct {
+   union {
+   struct {
+   uint8_t   sl_vid;
+   uint8_t   reserved1;
+   uint16_t  rlid;
+   };
+   uint32_t  timestamp_16_47;
+   };
+   uint32_t  status;
+   };
+   struct {
+   uint16_t reserved2;
+   uint8_t  smac[6];
+   };
+   };
uint32_tbyte_cnt;
uint16_twqe_index;
uint16_tchecksum;
-   uint8_t reserved3[3];
+   uint8_t reserved3;
+   uint8_t timestamp_8_15;
+   uint8_t timestamp_0_7;
uint8_t owner_sr_opcode;
 };
 
diff --git a/src/verbs.c b/src/verbs.c
index 0dcdc87..a8d6bd7 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -286,7 +286,8 @@ enum {
 };
 
 enum {
-   CREATE_CQ_SUPPORTED_WC_FLAGS = IBV_WC_STANDARD_FLAGS
+   CREATE_CQ_SUPPORTED_WC_FLAGS = IBV_WC_STANDARD_FLAGS|
+  IBV_WC_EX_WITH_COMPLETION_TIMESTAMP
 };
 
 static struct ibv_cq *create_cq(struct ibv_context *context,
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 libmlx4 7/7] Optimize ibv_poll_cq_ex for common scenarios

2015-10-27 Thread Matan Barak
The current ibv_poll_cq_ex mechanism needs to query every field
for its existence. In order to avoid this penalty at runtime,
add optimized functions for special cases.

Signed-off-by: Matan Barak 
---
 configure.ac |  17 
 src/cq.c | 268 ++-
 src/mlx4.h   |  20 -
 src/verbs.c  |  10 +--
 4 files changed, 271 insertions(+), 44 deletions(-)

diff --git a/configure.ac b/configure.ac
index 6e98f20..9dbbb4b 100644
--- a/configure.ac
+++ b/configure.ac
@@ -45,6 +45,23 @@ AC_CHECK_MEMBER([struct verbs_context.ibv_create_flow], [],
 [AC_MSG_ERROR([libmlx4 requires libibverbs >= 1.2.0])],
 [[#include ]])
 
+AC_MSG_CHECKING("always inline")
+CFLAGS_BAK="$CFLAGS"
+CFLAGS="$CFLAGS -Werror"
+AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
+   static inline int f(void)
+   __attribute((always_inline));
+   static inline int f(void)
+   {
+   return 1;
+   }
+]],[[
+   int a = f();
+   a = a;
+]])], [AC_MSG_RESULT([yes]) AC_DEFINE([HAVE_ALWAYS_INLINE], [1], [Define if 
__attribute((always_inline)).])],
+[AC_MSG_RESULT([no])])
+CFLAGS="$CFLAGS_BAK"
+
 dnl Checks for typedefs, structures, and compiler characteristics.
 AC_C_CONST
 AC_CHECK_SIZEOF(long)
diff --git a/src/cq.c b/src/cq.c
index 1f2d572..56c0fa4 100644
--- a/src/cq.c
+++ b/src/cq.c
@@ -377,10 +377,22 @@ union wc_buffer {
uint64_t*b64;
 };
 
+#define IS_IN_WC_FLAGS(yes, no, maybe, flag) (((yes) & (flag)) ||\
+ (!((no) & (flag)) && \
+  ((maybe) & (flag
 static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq,
struct mlx4_qp **cur_qp,
struct ibv_wc_ex **pwc_ex,
-   uint64_t wc_flags)
+   uint64_t wc_flags,
+   uint64_t yes_wc_flags,
+   uint64_t no_wc_flags)
+   ALWAYS_INLINE;
+static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq,
+   struct mlx4_qp **cur_qp,
+   struct ibv_wc_ex **pwc_ex,
+   uint64_t wc_flags,
+   uint64_t wc_flags_yes,
+   uint64_t wc_flags_no)
 {
struct mlx4_cqe *cqe;
uint32_t qpn;
@@ -392,14 +404,14 @@ static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq,
uint64_t wc_flags_out = 0;
 
wc_buffer.b64 = (uint64_t *)&wc_ex->buffer;
-   wc_ex->wc_flags = 0;
wc_ex->reserved = 0;
err = mlx4_handle_cq(cq, cur_qp, &wc_ex->wr_id, &wc_ex->status,
 &wc_ex->vendor_err, &cqe, &qpn, &is_send);
if (err != CQ_CONTINUE)
return err;
 
-   if (wc_flags & IBV_WC_EX_WITH_COMPLETION_TIMESTAMP) {
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
+  IBV_WC_EX_WITH_COMPLETION_TIMESTAMP)) {
uint16_t timestamp_0_15 = cqe->timestamp_0_7 |
cqe->timestamp_8_15 << 8;
 
@@ -415,80 +427,101 @@ static inline int _mlx4_poll_one_ex(struct mlx4_cq *cq,
wc_flags_out |= IBV_WC_EX_IMM;
case MLX4_OPCODE_RDMA_WRITE:
wc_ex->opcode= IBV_WC_RDMA_WRITE;
-   if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN)
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
+  IBV_WC_EX_WITH_BYTE_LEN))
wc_buffer.b32++;
-   if (wc_flags & IBV_WC_EX_WITH_IMM)
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
+  IBV_WC_EX_WITH_IMM))
wc_buffer.b32++;
break;
case MLX4_OPCODE_SEND_IMM:
wc_flags_out |= IBV_WC_EX_IMM;
case MLX4_OPCODE_SEND:
wc_ex->opcode= IBV_WC_SEND;
-   if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN)
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
+  IBV_WC_EX_WITH_BYTE_LEN))
wc_buffer.b32++;
-   if (wc_flags & IBV_WC_EX_WITH_IMM)
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
+  IBV_WC_EX_WITH_IMM))
wc_buffer.b32++;
break;
case MLX4_OPCODE_RDMA_READ:
wc_ex->opcode= IBV_WC_RDMA_READ;
-   if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) {
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_fla

[PATCH v1 libmlx4 2/7] Add support for ibv_create_cq_ex

2015-10-27 Thread Matan Barak
Add an extension verb mlx4_create_cq_ex that follows the
standard extension verb mechanism.
This function is similar to mlx4_create_cq but supports the
extension verbs functions and stores the creation flags
for later use (for example, timestamp flag is used in poll_cq).
The function fails if the user passes unsupported WC attributes.

Signed-off-by: Matan Barak 
---
 src/mlx4-abi.h |  12 ++
 src/mlx4.c |   1 +
 src/mlx4.h |   3 ++
 src/verbs.c| 117 +
 4 files changed, 117 insertions(+), 16 deletions(-)

diff --git a/src/mlx4-abi.h b/src/mlx4-abi.h
index b348ce3..9b765e4 100644
--- a/src/mlx4-abi.h
+++ b/src/mlx4-abi.h
@@ -72,12 +72,24 @@ struct mlx4_create_cq {
__u64   db_addr;
 };
 
+struct mlx4_create_cq_ex {
+   struct ibv_create_cq_ex ibv_cmd;
+   __u64   buf_addr;
+   __u64   db_addr;
+};
+
 struct mlx4_create_cq_resp {
struct ibv_create_cq_resp   ibv_resp;
__u32   cqn;
__u32   reserved;
 };
 
+struct mlx4_create_cq_resp_ex {
+   struct ibv_create_cq_resp_exibv_resp;
+   __u32   cqn;
+   __u32   reserved;
+};
+
 struct mlx4_resize_cq {
struct ibv_resize_cqibv_cmd;
__u64   buf_addr;
diff --git a/src/mlx4.c b/src/mlx4.c
index d41dff0..9cfd013 100644
--- a/src/mlx4.c
+++ b/src/mlx4.c
@@ -208,6 +208,7 @@ static int mlx4_init_context(struct verbs_device *v_device,
verbs_set_ctx_op(verbs_ctx, ibv_create_flow, ibv_cmd_create_flow);
verbs_set_ctx_op(verbs_ctx, ibv_destroy_flow, ibv_cmd_destroy_flow);
verbs_set_ctx_op(verbs_ctx, query_device_ex, mlx4_query_device_ex);
+   verbs_set_ctx_op(verbs_ctx, create_cq_ex, mlx4_create_cq_ex);
 
return 0;
 
diff --git a/src/mlx4.h b/src/mlx4.h
index 0f643bc..91eb79c 100644
--- a/src/mlx4.h
+++ b/src/mlx4.h
@@ -222,6 +222,7 @@ struct mlx4_cq {
uint32_t   *arm_db;
int arm_sn;
int cqe_size;
+   int creation_flags;
 };
 
 struct mlx4_srq {
@@ -402,6 +403,8 @@ int mlx4_dereg_mr(struct ibv_mr *mr);
 struct ibv_cq *mlx4_create_cq(struct ibv_context *context, int cqe,
   struct ibv_comp_channel *channel,
   int comp_vector);
+struct ibv_cq *mlx4_create_cq_ex(struct ibv_context *context,
+struct ibv_create_cq_attr_ex *cq_attr);
 int mlx4_alloc_cq_buf(struct mlx4_device *dev, struct mlx4_buf *buf, int nent,
  int entry_size);
 int mlx4_resize_cq(struct ibv_cq *cq, int cqe);
diff --git a/src/verbs.c b/src/verbs.c
index e93114b..3290b86 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -272,19 +272,69 @@ int align_queue_size(int req)
return nent;
 }
 
-struct ibv_cq *mlx4_create_cq(struct ibv_context *context, int cqe,
-  struct ibv_comp_channel *channel,
-  int comp_vector)
+enum cmd_type {
+   MLX4_CMD_TYPE_BASIC,
+   MLX4_CMD_TYPE_EXTENDED
+};
+
+enum {
+   CREATE_CQ_SUPPORTED_COMP_MASK = IBV_CREATE_CQ_ATTR_FLAGS
+};
+
+enum {
+   CREATE_CQ_SUPPORTED_FLAGS = IBV_CREATE_CQ_ATTR_COMPLETION_TIMESTAMP
+};
+
+enum {
+   CREATE_CQ_SUPPORTED_WC_FLAGS = IBV_WC_STANDARD_FLAGS
+};
+
+static struct ibv_cq *create_cq(struct ibv_context *context,
+   struct ibv_create_cq_attr_ex *cq_attr,
+   enum cmd_type cmd_type)
 {
-   struct mlx4_create_cq  cmd;
-   struct mlx4_create_cq_resp resp;
-   struct mlx4_cq*cq;
-   intret;
-   struct mlx4_context   *mctx = to_mctx(context);
+   struct mlx4_create_cq   cmd;
+   struct mlx4_create_cq_excmd_e;
+   struct mlx4_create_cq_resp  resp;
+   struct mlx4_create_cq_resp_ex   resp_e;
+   struct mlx4_cq  *cq;
+   int ret;
+   struct mlx4_context *mctx = to_mctx(context);
+   struct ibv_create_cq_attr_excq_attr_e;
+   int cqe;
 
/* Sanity check CQ size before proceeding */
-   if (cqe > 0x3f)
+   if (cq_attr->cqe > 0x3f)
+   return NULL;
+
+   if (cq_attr->comp_mask & ~CREATE_CQ_SUPPORTED_COMP_MASK) {
+   errno = EINVAL;
return NULL;
+   }
+
+   if (cq_attr->comp_mask & IBV_CREATE_CQ_ATTR_FLAGS &&
+   cq_attr->flags & ~CREATE_CQ_SUPPORTED_FLAGS) {
+   errno = EINVAL;
+   return NULL;
+   }
+
+   if (cq_attr->wc_flags & ~CREATE_CQ_SUPPORTED_WC_FLAGS) {
+   errno = ENOTSUP;
+   return 

[PATCH v1 libmlx4 1/7] Add support for extended version of ibv_query_device

2015-10-27 Thread Matan Barak
The new mlx4_query_device_ex implementation uses the
extended version of libibverbs/uverbs query_device command.
In addition, it reads the hca_core_clock offset in the bar
from the vendor specific part of ibv_query_device_ex command.

Signed-off-by: Matan Barak 
---
 src/mlx4-abi.h | 13 +
 src/mlx4.c |  1 +
 src/mlx4.h |  8 
 src/verbs.c| 54 +++---
 4 files changed, 73 insertions(+), 3 deletions(-)

diff --git a/src/mlx4-abi.h b/src/mlx4-abi.h
index b48f6fc..b348ce3 100644
--- a/src/mlx4-abi.h
+++ b/src/mlx4-abi.h
@@ -111,4 +111,17 @@ struct mlx4_create_qp {
__u8reserved[5];
 };
 
+enum query_device_resp_mask {
+   QUERY_DEVICE_RESP_MASK_TIMESTAMP = 1UL << 0,
+};
+
+struct query_device_ex_resp {
+   struct ibv_query_device_resp_ex core;
+   struct {
+   uint32_t comp_mask;
+   uint32_t response_length;
+   uint64_t hca_core_clock_offset;
+   };
+};
+
 #endif /* MLX4_ABI_H */
diff --git a/src/mlx4.c b/src/mlx4.c
index c30f4bf..d41dff0 100644
--- a/src/mlx4.c
+++ b/src/mlx4.c
@@ -207,6 +207,7 @@ static int mlx4_init_context(struct verbs_device *v_device,
verbs_set_ctx_op(verbs_ctx, open_qp, mlx4_open_qp);
verbs_set_ctx_op(verbs_ctx, ibv_create_flow, ibv_cmd_create_flow);
verbs_set_ctx_op(verbs_ctx, ibv_destroy_flow, ibv_cmd_destroy_flow);
+   verbs_set_ctx_op(verbs_ctx, query_device_ex, mlx4_query_device_ex);
 
return 0;
 
diff --git a/src/mlx4.h b/src/mlx4.h
index 519d8f4..0f643bc 100644
--- a/src/mlx4.h
+++ b/src/mlx4.h
@@ -198,6 +198,7 @@ struct mlx4_context {
uint8_t link_layer;
enum ibv_port_cap_flags caps;
} port_query_cache[MLX4_PORTS_NUM];
+   uint64_tcore_clock_offset;
 };
 
 struct mlx4_buf {
@@ -378,6 +379,13 @@ void mlx4_free_db(struct mlx4_context *context, enum 
mlx4_db_type type, uint32_t
 
 int mlx4_query_device(struct ibv_context *context,
   struct ibv_device_attr *attr);
+int _mlx4_query_device_ex(struct ibv_context *context,
+ const struct ibv_query_device_ex_input *input,
+ struct ibv_device_attr_ex *attr, size_t attr_size,
+ uint32_t *comp_mask);
+int mlx4_query_device_ex(struct ibv_context *context,
+const struct ibv_query_device_ex_input *input,
+struct ibv_device_attr_ex *attr, size_t attr_size);
 int mlx4_query_port(struct ibv_context *context, uint8_t port,
 struct ibv_port_attr *attr);
 
diff --git a/src/verbs.c b/src/verbs.c
index 2cb1f8a..e93114b 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -45,6 +45,14 @@
 #include "mlx4-abi.h"
 #include "wqe.h"
 
+static void parse_raw_fw_ver(uint64_t raw_fw_ver, unsigned *major,
+unsigned *minor, unsigned *sub_minor)
+{
+   *major = (raw_fw_ver >> 32) & 0x;
+   *minor = (raw_fw_ver >> 16) & 0x;
+   *sub_minor = raw_fw_ver & 0x;
+}
+
 int mlx4_query_device(struct ibv_context *context, struct ibv_device_attr 
*attr)
 {
struct ibv_query_device cmd;
@@ -56,9 +64,7 @@ int mlx4_query_device(struct ibv_context *context, struct 
ibv_device_attr *attr)
if (ret)
return ret;
 
-   major = (raw_fw_ver >> 32) & 0x;
-   minor = (raw_fw_ver >> 16) & 0x;
-   sub_minor = raw_fw_ver & 0x;
+   parse_raw_fw_ver(raw_fw_ver, &major, &minor, &sub_minor);
 
snprintf(attr->fw_ver, sizeof attr->fw_ver,
 "%d.%d.%03d", major, minor, sub_minor);
@@ -66,6 +72,48 @@ int mlx4_query_device(struct ibv_context *context, struct 
ibv_device_attr *attr)
return 0;
 }
 
+int _mlx4_query_device_ex(struct ibv_context *context,
+ const struct ibv_query_device_ex_input *input,
+ struct ibv_device_attr_ex *attr, size_t attr_size,
+ uint32_t *comp_mask)
+{
+   struct ibv_query_device_ex cmd;
+   struct query_device_ex_resp resp;
+   uint64_t raw_fw_ver;
+   unsigned major, minor, sub_minor;
+   int ret;
+
+   memset(&resp, 0, sizeof(resp));
+
+   ret = ibv_cmd_query_device_ex(context, input, attr, attr_size,
+ &raw_fw_ver, &cmd, sizeof(cmd),
+ sizeof(cmd), &resp.core,
+ sizeof(resp.core), sizeof(resp));
+   if (ret)
+   return ret;
+
+   parse_raw_fw_ver(raw_fw_ver, &major, &minor, &sub_minor);
+
+   snprintf(attr->orig_attr.fw_ver, sizeof(attr->orig_attr.fw_ver),
+"%d.%d.%03d", major, minor, sub_minor);
+
+   if (resp.comp_mask & QUERY_DEVICE_RESP_MASK_TIMESTAMP)
+   to_mctx(context)->core_clock_offset =
+   re

[PATCH v1 libmlx4 5/7] Add support for ibv_query_values_ex

2015-10-27 Thread Matan Barak
Adding mlx4_query_values as implementation for
ibv_query_values_ex. mlx4_query_values follows the
standard extension verb mechanism.
This function supports reading the hwclock via mmaping
the required space from kernel.

Signed-off-by: Matan Barak 
---
 src/mlx4.c  | 36 
 src/mlx4.h  |  3 +++
 src/verbs.c | 45 +
 3 files changed, 84 insertions(+)

diff --git a/src/mlx4.c b/src/mlx4.c
index cc1211f..6d66cf0 100644
--- a/src/mlx4.c
+++ b/src/mlx4.c
@@ -116,6 +116,28 @@ static struct ibv_context_ops mlx4_ctx_ops = {
.detach_mcast  = ibv_cmd_detach_mcast
 };
 
+static int mlx4_map_internal_clock(struct mlx4_device *dev,
+  struct ibv_context *ibv_ctx)
+{
+   struct mlx4_context *context = to_mctx(ibv_ctx);
+   void *hca_clock_page;
+
+   hca_clock_page = mmap(NULL, dev->page_size, PROT_READ, MAP_SHARED,
+ ibv_ctx->cmd_fd, dev->page_size * 3);
+
+   if (hca_clock_page == MAP_FAILED) {
+   fprintf(stderr, PFX
+   "Warning: Timestamp available,\n"
+   "but failed to mmap() hca core clock page, errno=%d.\n",
+   errno);
+   return -1;
+   }
+
+   context->hca_core_clock = hca_clock_page +
+   context->core_clock_offset % dev->page_size;
+   return 0;
+}
+
 static int mlx4_init_context(struct verbs_device *v_device,
struct ibv_context *ibv_ctx, int cmd_fd)
 {
@@ -127,6 +149,10 @@ static int mlx4_init_context(struct verbs_device *v_device,
__u16   bf_reg_size;
struct mlx4_device  *dev = to_mdev(&v_device->device);
struct verbs_context *verbs_ctx = verbs_get_ctx(ibv_ctx);
+   struct ibv_query_device_ex_input input_query_device = {.comp_mask = 0};
+   struct ibv_device_attr_ex   dev_attrs;
+   uint32_tdev_attrs_comp_mask;
+   int err;
 
/* memory footprint of mlx4_context and verbs_context share
* struct ibv_context.
@@ -194,6 +220,12 @@ static int mlx4_init_context(struct verbs_device *v_device,
context->bf_buf_size = 0;
}
 
+   context->hca_core_clock = NULL;
+   err = _mlx4_query_device_ex(ibv_ctx, &input_query_device, &dev_attrs,
+   sizeof(dev_attrs), &dev_attrs_comp_mask);
+   if (!err && dev_attrs_comp_mask & QUERY_DEVICE_RESP_MASK_TIMESTAMP)
+   mlx4_map_internal_clock(dev, ibv_ctx);
+
pthread_spin_init(&context->uar_lock, PTHREAD_PROCESS_PRIVATE);
ibv_ctx->ops = mlx4_ctx_ops;
 
@@ -210,6 +242,7 @@ static int mlx4_init_context(struct verbs_device *v_device,
verbs_set_ctx_op(verbs_ctx, query_device_ex, mlx4_query_device_ex);
verbs_set_ctx_op(verbs_ctx, create_cq_ex, mlx4_create_cq_ex);
verbs_set_ctx_op(verbs_ctx, poll_cq_ex, mlx4_poll_cq_ex);
+   verbs_set_ctx_op(verbs_ctx, query_values, mlx4_query_values);
 
return 0;
 
@@ -223,6 +256,9 @@ static void mlx4_uninit_context(struct verbs_device 
*v_device,
munmap(context->uar, to_mdev(&v_device->device)->page_size);
if (context->bf_page)
munmap(context->bf_page, to_mdev(&v_device->device)->page_size);
+   if (context->hca_core_clock)
+   munmap(context->hca_core_clock - context->core_clock_offset,
+  to_mdev(&v_device->device)->page_size);
 
 }
 
diff --git a/src/mlx4.h b/src/mlx4.h
index 2465298..8e1935d 100644
--- a/src/mlx4.h
+++ b/src/mlx4.h
@@ -199,6 +199,7 @@ struct mlx4_context {
enum ibv_port_cap_flags caps;
} port_query_cache[MLX4_PORTS_NUM];
uint64_tcore_clock_offset;
+   void   *hca_core_clock;
 };
 
 struct mlx4_buf {
@@ -403,6 +404,8 @@ int _mlx4_query_device_ex(struct ibv_context *context,
 int mlx4_query_device_ex(struct ibv_context *context,
 const struct ibv_query_device_ex_input *input,
 struct ibv_device_attr_ex *attr, size_t attr_size);
+int mlx4_query_values(struct ibv_context *context,
+ struct ibv_values_ex *values);
 int mlx4_query_port(struct ibv_context *context, uint8_t port,
 struct ibv_port_attr *attr);
 
diff --git a/src/verbs.c b/src/verbs.c
index a8d6bd7..843ca1e 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -114,6 +114,51 @@ int mlx4_query_device_ex(struct ibv_context *context,
return _mlx4_query_device_ex(context, input, attr, attr_size, NULL);
 }
 
+#define READL(ptr) (*((uint32_t *)(ptr)))
+static int mlx4_read_clock(struct ibv_context *context, uint64_t *cycles)
+{
+   unsigned int clockhi, clocklo, clockhi1;
+   int i;
+   struct mlx4_context *ctx = to_mctx(context);
+
+   if (!ctx->hca_core_clo

[PATCH v1 libmlx4 6/7] Add support for different poll_one_ex functions

2015-10-27 Thread Matan Barak
In order to opitimize the poll_one extended verb for different
wc_flags, add support for poll_one_ex callback function.

Signed-off-by: Matan Barak 
---
 src/cq.c| 5 +++--
 src/mlx4.h  | 5 +
 src/verbs.c | 1 +
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/src/cq.c b/src/cq.c
index 7f40f12..1f2d572 100644
--- a/src/cq.c
+++ b/src/cq.c
@@ -601,7 +601,8 @@ int mlx4_poll_cq_ex(struct ibv_cq *ibcq,
int npolled;
int err = CQ_OK;
unsigned int ne = attr->max_entries;
-   uint64_t wc_flags = cq->wc_flags;
+   int (*poll_fn)(struct mlx4_cq *cq, struct mlx4_qp **cur_qp,
+  struct ibv_wc_ex **wc_ex) = cq->mlx4_poll_one;
 
if (attr->comp_mask)
return -EINVAL;
@@ -609,7 +610,7 @@ int mlx4_poll_cq_ex(struct ibv_cq *ibcq,
pthread_spin_lock(&cq->lock);
 
for (npolled = 0; npolled < ne; ++npolled) {
-   err = _mlx4_poll_one_ex(cq, &qp, &wc, wc_flags);
+   err = poll_fn(cq, &qp, &wc);
if (err != CQ_OK)
break;
}
diff --git a/src/mlx4.h b/src/mlx4.h
index 8e1935d..46a18d6 100644
--- a/src/mlx4.h
+++ b/src/mlx4.h
@@ -215,6 +215,8 @@ struct mlx4_pd {
 struct mlx4_cq {
struct ibv_cq   ibv_cq;
uint64_twc_flags;
+   int (*mlx4_poll_one)(struct mlx4_cq *cq, struct mlx4_qp **cur_qp,
+struct ibv_wc_ex **wc_ex);
struct mlx4_buf buf;
struct mlx4_buf resize_buf;
pthread_spinlock_t  lock;
@@ -432,6 +434,9 @@ int mlx4_poll_cq(struct ibv_cq *cq, int ne, struct ibv_wc 
*wc);
 int mlx4_poll_cq_ex(struct ibv_cq *ibcq,
struct ibv_wc_ex *wc,
struct ibv_poll_cq_ex_attr *attr);
+int mlx4_poll_one_ex(struct mlx4_cq *cq,
+struct mlx4_qp **cur_qp,
+struct ibv_wc_ex **pwc_ex);
 int mlx4_arm_cq(struct ibv_cq *cq, int solicited);
 void mlx4_cq_event(struct ibv_cq *cq);
 void __mlx4_cq_clean(struct mlx4_cq *cq, uint32_t qpn, struct mlx4_srq *srq);
diff --git a/src/verbs.c b/src/verbs.c
index 843ca1e..62908c1 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -432,6 +432,7 @@ static struct ibv_cq *create_cq(struct ibv_context *context,
if (ret)
goto err_db;
 
+   cq->mlx4_poll_one = mlx4_poll_one_ex;
cq->creation_flags = cmd_e.ibv_cmd.flags;
cq->wc_flags = cq_attr->wc_flags;
cq->cqn = resp.cqn;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 libmlx4 3/7] Implement ibv_poll_cq_ex extension verb

2015-10-27 Thread Matan Barak
Add an implementation for verb_poll_cq extension verb.
This patch implements the new API via the standard
function mlx4_poll_one.

Signed-off-by: Matan Barak 
---
 src/cq.c| 307 ++--
 src/mlx4.c  |   1 +
 src/mlx4.h  |   4 +
 src/verbs.c |   1 +
 4 files changed, 284 insertions(+), 29 deletions(-)

diff --git a/src/cq.c b/src/cq.c
index 32c9070..c86e824 100644
--- a/src/cq.c
+++ b/src/cq.c
@@ -52,6 +52,7 @@ enum {
 };
 
 enum {
+   CQ_CONTINUE =  1,
CQ_OK   =  0,
CQ_EMPTY= -1,
CQ_POLL_ERR = -2
@@ -121,7 +122,9 @@ static void update_cons_index(struct mlx4_cq *cq)
*cq->set_ci_db = htonl(cq->cons_index & 0xff);
 }
 
-static void mlx4_handle_error_cqe(struct mlx4_err_cqe *cqe, struct ibv_wc *wc)
+static void mlx4_handle_error_cqe(struct mlx4_err_cqe *cqe,
+ enum ibv_wc_status *status,
+ enum ibv_wc_opcode *vendor_err)
 {
if (cqe->syndrome == MLX4_CQE_SYNDROME_LOCAL_QP_OP_ERR)
printf(PFX "local QP operation err "
@@ -133,64 +136,68 @@ static void mlx4_handle_error_cqe(struct mlx4_err_cqe 
*cqe, struct ibv_wc *wc)
 
switch (cqe->syndrome) {
case MLX4_CQE_SYNDROME_LOCAL_LENGTH_ERR:
-   wc->status = IBV_WC_LOC_LEN_ERR;
+   *status = IBV_WC_LOC_LEN_ERR;
break;
case MLX4_CQE_SYNDROME_LOCAL_QP_OP_ERR:
-   wc->status = IBV_WC_LOC_QP_OP_ERR;
+   *status = IBV_WC_LOC_QP_OP_ERR;
break;
case MLX4_CQE_SYNDROME_LOCAL_PROT_ERR:
-   wc->status = IBV_WC_LOC_PROT_ERR;
+   *status = IBV_WC_LOC_PROT_ERR;
break;
case MLX4_CQE_SYNDROME_WR_FLUSH_ERR:
-   wc->status = IBV_WC_WR_FLUSH_ERR;
+   *status = IBV_WC_WR_FLUSH_ERR;
break;
case MLX4_CQE_SYNDROME_MW_BIND_ERR:
-   wc->status = IBV_WC_MW_BIND_ERR;
+   *status = IBV_WC_MW_BIND_ERR;
break;
case MLX4_CQE_SYNDROME_BAD_RESP_ERR:
-   wc->status = IBV_WC_BAD_RESP_ERR;
+   *status = IBV_WC_BAD_RESP_ERR;
break;
case MLX4_CQE_SYNDROME_LOCAL_ACCESS_ERR:
-   wc->status = IBV_WC_LOC_ACCESS_ERR;
+   *status = IBV_WC_LOC_ACCESS_ERR;
break;
case MLX4_CQE_SYNDROME_REMOTE_INVAL_REQ_ERR:
-   wc->status = IBV_WC_REM_INV_REQ_ERR;
+   *status = IBV_WC_REM_INV_REQ_ERR;
break;
case MLX4_CQE_SYNDROME_REMOTE_ACCESS_ERR:
-   wc->status = IBV_WC_REM_ACCESS_ERR;
+   *status = IBV_WC_REM_ACCESS_ERR;
break;
case MLX4_CQE_SYNDROME_REMOTE_OP_ERR:
-   wc->status = IBV_WC_REM_OP_ERR;
+   *status = IBV_WC_REM_OP_ERR;
break;
case MLX4_CQE_SYNDROME_TRANSPORT_RETRY_EXC_ERR:
-   wc->status = IBV_WC_RETRY_EXC_ERR;
+   *status = IBV_WC_RETRY_EXC_ERR;
break;
case MLX4_CQE_SYNDROME_RNR_RETRY_EXC_ERR:
-   wc->status = IBV_WC_RNR_RETRY_EXC_ERR;
+   *status = IBV_WC_RNR_RETRY_EXC_ERR;
break;
case MLX4_CQE_SYNDROME_REMOTE_ABORTED_ERR:
-   wc->status = IBV_WC_REM_ABORT_ERR;
+   *status = IBV_WC_REM_ABORT_ERR;
break;
default:
-   wc->status = IBV_WC_GENERAL_ERR;
+   *status = IBV_WC_GENERAL_ERR;
break;
}
 
-   wc->vendor_err = cqe->vendor_err;
+   *vendor_err = cqe->vendor_err;
 }
 
-static int mlx4_poll_one(struct mlx4_cq *cq,
-struct mlx4_qp **cur_qp,
-struct ibv_wc *wc)
+static inline int mlx4_handle_cq(struct mlx4_cq *cq,
+struct mlx4_qp **cur_qp,
+uint64_t *wc_wr_id,
+enum ibv_wc_status *wc_status,
+uint32_t *wc_vendor_err,
+struct mlx4_cqe **pcqe,
+uint32_t *pqpn,
+int *pis_send)
 {
struct mlx4_wq *wq;
struct mlx4_cqe *cqe;
struct mlx4_srq *srq;
uint32_t qpn;
-   uint32_t g_mlpath_rqpn;
-   uint16_t wqe_index;
int is_error;
int is_send;
+   uint16_t wqe_index;
 
cqe = next_cqe_sw(cq);
if (!cqe)
@@ -201,7 +208,7 @@ static int mlx4_poll_one(struct mlx4_cq *cq,
 
++cq->cons_index;
 
-   VALGRIND_MAKE_MEM_DEFINED(cqe, sizeof *cqe);
+   VALGRIND_MAKE_MEM_DEFINED(cqe, sizeof(*cqe));
 
/*
 * Make sure we read CQ entry contents after we've checked the
@@ -210,7 +217,6 @@ st

RE: [PATCH 1/2] mlx4: Expose correct max_sge_rd limit

2015-10-27 Thread Eli Cohen
Just discussed the issue with Sagi. Sagi will follow up with a small correction.

-Original Message-
From: Sagi Grimberg [mailto:sa...@dev.mellanox.co.il] 
Sent: Tuesday, October 27, 2015 11:32 AM
To: Bart Van Assche; linux-rdma@vger.kernel.org; target-de...@vger.kernel.org
Cc: Steve Wise; Nicholas A. Bellinger; Or Gerlitz; Doug Ledford; Eli Cohen
Subject: Re: [PATCH 1/2] mlx4: Expose correct max_sge_rd limit

> Hello Sagi,
>
> Is this the same issue as what has been discussed in 
> http://www.spinics.net/lists/linux-rdma/msg21799.html ?

Looks like it.

I think this patch addresses this issue, but lets CC Eli to comment if I'm 
missing something.

Thanks for digging this up...

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/sa: replace GFP_KERNEL with GFP_ATOMIC

2015-10-27 Thread ira.weiny
On Tue, Oct 27, 2015 at 09:17:40PM +0530, Saurabh Sengar wrote:
> replace GFP_KERNEL with GFP_ATOMIC, as code while holding a spinlock
> should be atomic
> GFP_KERNEL may sleep and can cause deadlock, where as GFP_ATOMIC may
> fail but certainly avoids deadlock

Great catch.  Thanks!

However, gfp_t is passed to send_mad and we should pass that down and use it.

Compile tested only, suggestion below,
Ira


14:09:12 > git di
diff --git a/drivers/infiniband/core/sa_query.c
b/drivers/infiniband/core/sa_query.c
index 8c014b33d8e0..54d454042b28 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -512,7 +512,7 @@ static int ib_nl_get_path_rec_attrs_len(ib_sa_comp_mask 
comp_mask)
return len;
 }
 
-static int ib_nl_send_msg(struct ib_sa_query *query)
+static int ib_nl_send_msg(struct ib_sa_query *query, gfp_t gfp_mask)
 {
struct sk_buff *skb = NULL;
struct nlmsghdr *nlh;
@@ -526,7 +526,7 @@ static int ib_nl_send_msg(struct ib_sa_query *query)
if (len <= 0)
return -EMSGSIZE;
 
-   skb = nlmsg_new(len, GFP_KERNEL);
+   skb = nlmsg_new(len, gfp_mask);
if (!skb)
return -ENOMEM;
 
@@ -544,7 +544,7 @@ static int ib_nl_send_msg(struct ib_sa_query *query)
/* Repair the nlmsg header length */
nlmsg_end(skb, nlh);
 
-   ret = ibnl_multicast(skb, nlh, RDMA_NL_GROUP_LS, GFP_KERNEL);
+   ret = ibnl_multicast(skb, nlh, RDMA_NL_GROUP_LS, gfp_mask);
if (!ret)
ret = len;
else
@@ -553,7 +553,7 @@ static int ib_nl_send_msg(struct ib_sa_query *query)
return ret;
 }
 
-static int ib_nl_make_request(struct ib_sa_query *query)
+static int ib_nl_make_request(struct ib_sa_query *query, gfp_t gfp_mask)
 {
unsigned long flags;
unsigned long delay;
@@ -563,7 +563,7 @@ static int ib_nl_make_request(struct ib_sa_query *query)
query->seq = (u32)atomic_inc_return(&ib_nl_sa_request_seq);
 
spin_lock_irqsave(&ib_nl_request_lock, flags);
-   ret = ib_nl_send_msg(query);
+   ret = ib_nl_send_msg(query, gfp_mask);
if (ret <= 0) {
ret = -EIO;
goto request_out;
@@ -1105,7 +1105,7 @@ static int send_mad(struct ib_sa_query *query, int 
timeout_ms, gfp_t gfp_mask)
 
if (query->flags & IB_SA_ENABLE_LOCAL_SERVICE) {
if (!ibnl_chk_listeners(RDMA_NL_GROUP_LS)) {
-   if (!ib_nl_make_request(query))
+   if (!ib_nl_make_request(query, gfp_mask))
return id;
}
ib_sa_disable_local_svc(query);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/sa: replace GFP_KERNEL with GFP_ATOMIC

2015-10-27 Thread Jason Gunthorpe
On Tue, Oct 27, 2015 at 02:12:36PM -0400, ira.weiny wrote:
> On Tue, Oct 27, 2015 at 09:17:40PM +0530, Saurabh Sengar wrote:
> > replace GFP_KERNEL with GFP_ATOMIC, as code while holding a spinlock
> > should be atomic
> > GFP_KERNEL may sleep and can cause deadlock, where as GFP_ATOMIC may
> > fail but certainly avoids deadlock
> 
> Great catch.  Thanks!
> 
> However, gfp_t is passed to send_mad and we should pass that down and use it.

> spin_lock_irqsave(&ib_nl_request_lock, flags);
> -   ret = ib_nl_send_msg(query);
> +   ret = ib_nl_send_msg(query, gfp_mask);

A spin lock is guarenteed held around ib_nl_send_msg, so it's
allocations have to be atomic, can't use gfp_mask here..

I do wonder if it is a good idea to call ib_nl_send_msg with a spinlock
held though.. Would be nice to see that go away.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/sa: replace GFP_KERNEL with GFP_ATOMIC

2015-10-27 Thread ira.weiny
On Tue, Oct 27, 2015 at 12:16:52PM -0600, Jason Gunthorpe wrote:
> On Tue, Oct 27, 2015 at 02:12:36PM -0400, ira.weiny wrote:
> > On Tue, Oct 27, 2015 at 09:17:40PM +0530, Saurabh Sengar wrote:
> > > replace GFP_KERNEL with GFP_ATOMIC, as code while holding a spinlock
> > > should be atomic
> > > GFP_KERNEL may sleep and can cause deadlock, where as GFP_ATOMIC may
> > > fail but certainly avoids deadlock
> > 
> > Great catch.  Thanks!
> > 
> > However, gfp_t is passed to send_mad and we should pass that down and use 
> > it.
> 
> > spin_lock_irqsave(&ib_nl_request_lock, flags);
> > -   ret = ib_nl_send_msg(query);
> > +   ret = ib_nl_send_msg(query, gfp_mask);
> 
> A spin lock is guarenteed held around ib_nl_send_msg, so it's
> allocations have to be atomic, can't use gfp_mask here..
> 
> I do wonder if it is a good idea to call ib_nl_send_msg with a spinlock
> held though.. Would be nice to see that go away.

Ah, yea my bad.

Ira

> 
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH libibverbs] Expose QP block self multicast loopback creation flag

2015-10-27 Thread Leon Romanovsky
On Tue, Oct 27, 2015 at 02:53:01PM +0200, Eran Ben Elisha wrote:
...
> +enum ibv_qp_create_flags {
> + IBV_QP_CREATE_BLOCK_SELF_MCAST_LB   = 1 << 1,
>  };
>  
I'm sure that I'm missing something important, but why did it start
from shift 1 and not shift 0?

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/sa: replace GFP_KERNEL with GFP_ATOMIC

2015-10-27 Thread Jason Gunthorpe
On Tue, Oct 27, 2015 at 06:56:50PM +, Wan, Kaike wrote:
 
> > I do wonder if it is a good idea to call ib_nl_send_msg with a spinlock held
> > though.. Would be nice to see that go away.
> 
> We have to hold the lock to protect against a race condition that a
> quick response will try to free the request from the
> ib_nl_request_list before we even put it on the list.

Put is on the list first? Use a kref? Doesn't look like a big deal to
clean this up.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 12/23] staging/rdma/hfi1: Macro code clean up

2015-10-27 Thread ira.weiny
On Tue, Oct 27, 2015 at 05:19:10PM +0900, Greg KH wrote:
> On Mon, Oct 26, 2015 at 10:28:38AM -0400, ira.we...@intel.com wrote:
> > From: Mitko Haralanov 
> > 
> > Clean up the context and sdma macros and move them to a more logical place 
> > in
> > hfi.h
> > 
> > Signed-off-by: Mitko Haralanov 
> > Signed-off-by: Ira Weiny 
> > ---
> >  drivers/staging/rdma/hfi1/hfi.h | 22 ++
> >  1 file changed, 10 insertions(+), 12 deletions(-)
> > 
> > diff --git a/drivers/staging/rdma/hfi1/hfi.h 
> > b/drivers/staging/rdma/hfi1/hfi.h
> > index a35213e9b500..41ad9a30149b 100644
> > --- a/drivers/staging/rdma/hfi1/hfi.h
> > +++ b/drivers/staging/rdma/hfi1/hfi.h
> > @@ -1104,6 +1104,16 @@ struct hfi1_filedata {
> > int rec_cpu_num;
> >  };
> >  
> > +/* for use in system calls, where we want to know device type, etc. */
> > +#define fp_to_fd(fp) ((struct hfi1_filedata *)(fp)->private_data)
> > +#define ctxt_fp(fp) (fp_to_fd((fp))->uctxt)
> > +#define subctxt_fp(fp) (fp_to_fd((fp))->subctxt)
> > +#define tidcursor_fp(fp) (fp_to_fd((fp))->tidcursor)
> > +#define user_sdma_pkt_fp(fp) (fp_to_fd((fp))->pq)
> > +#define user_sdma_comp_fp(fp) (fp_to_fd((fp))->cq)
> > +#define notifier_fp(fp) (fp_to_fd((fp))->mn)
> > +#define rb_fp(fp) (fp_to_fd((fp))->tid_rb_root)
> 
> Ick, no, don't do this, just spell it all out (odds are you will see tht
> you can make the code simpler...)  If you don't know what "cq" or "pq"
> are, then name them properly.
> 
> These need to be all removed.

Ok.

Can I add the removal of these macros to the TODO list and get this patch
accepted in the interm?

Many of the patches I am queueing up to submit as well as one in this series do
not apply cleanly without this change.  It will be much easier if I can get
everything applied and then do a global clean up of these macros after the
fact.

Thanks,
Ira

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 23/23] staging/rdma/hfi1: Update driver version string to 0.9-294

2015-10-27 Thread ira.weiny
On Tue, Oct 27, 2015 at 05:46:41PM +0900, Greg KH wrote:
> On Mon, Oct 26, 2015 at 10:28:49AM -0400, ira.we...@intel.com wrote:
> > From: Jubin John 
> > 
> > Signed-off-by: Jubin John 
> > Signed-off-by: Ira Weiny 
> > ---
> >  drivers/staging/rdma/hfi1/common.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/staging/rdma/hfi1/common.h 
> > b/drivers/staging/rdma/hfi1/common.h
> > index 7809093eb55e..5dd92720faae 100644
> > --- a/drivers/staging/rdma/hfi1/common.h
> > +++ b/drivers/staging/rdma/hfi1/common.h
> > @@ -205,7 +205,7 @@
> >   * to the driver itself, not the software interfaces it supports.
> >   */
> >  #ifndef HFI1_DRIVER_VERSION_BASE
> > -#define HFI1_DRIVER_VERSION_BASE "0.9-248"
> > +#define HFI1_DRIVER_VERSION_BASE "0.9-294"
> 
> Patches like this make no sense at all, please drop it and only use the
> kernel version.

What do you mean by "only use the kernel version"?  Do you mean

#define HFI1_DRIVER_VERSION_BASE UTS_RELEASE
 
Or just remove the macro entirely?

>
> Trust me, it's going to get messy really fast (hint, it
> already did...)

Did I base this on the wrong tree?  Not sure how this could have messed you up.

Thanks,
Ira

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 12/23] staging/rdma/hfi1: Macro code clean up

2015-10-27 Thread Greg KH
On Tue, Oct 27, 2015 at 04:51:15PM -0400, ira.weiny wrote:
> On Tue, Oct 27, 2015 at 05:19:10PM +0900, Greg KH wrote:
> > On Mon, Oct 26, 2015 at 10:28:38AM -0400, ira.we...@intel.com wrote:
> > > From: Mitko Haralanov 
> > > 
> > > Clean up the context and sdma macros and move them to a more logical 
> > > place in
> > > hfi.h
> > > 
> > > Signed-off-by: Mitko Haralanov 
> > > Signed-off-by: Ira Weiny 
> > > ---
> > >  drivers/staging/rdma/hfi1/hfi.h | 22 ++
> > >  1 file changed, 10 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/drivers/staging/rdma/hfi1/hfi.h 
> > > b/drivers/staging/rdma/hfi1/hfi.h
> > > index a35213e9b500..41ad9a30149b 100644
> > > --- a/drivers/staging/rdma/hfi1/hfi.h
> > > +++ b/drivers/staging/rdma/hfi1/hfi.h
> > > @@ -1104,6 +1104,16 @@ struct hfi1_filedata {
> > >   int rec_cpu_num;
> > >  };
> > >  
> > > +/* for use in system calls, where we want to know device type, etc. */
> > > +#define fp_to_fd(fp) ((struct hfi1_filedata *)(fp)->private_data)
> > > +#define ctxt_fp(fp) (fp_to_fd((fp))->uctxt)
> > > +#define subctxt_fp(fp) (fp_to_fd((fp))->subctxt)
> > > +#define tidcursor_fp(fp) (fp_to_fd((fp))->tidcursor)
> > > +#define user_sdma_pkt_fp(fp) (fp_to_fd((fp))->pq)
> > > +#define user_sdma_comp_fp(fp) (fp_to_fd((fp))->cq)
> > > +#define notifier_fp(fp) (fp_to_fd((fp))->mn)
> > > +#define rb_fp(fp) (fp_to_fd((fp))->tid_rb_root)
> > 
> > Ick, no, don't do this, just spell it all out (odds are you will see tht
> > you can make the code simpler...)  If you don't know what "cq" or "pq"
> > are, then name them properly.
> > 
> > These need to be all removed.
> 
> Ok.
> 
> Can I add the removal of these macros to the TODO list and get this patch
> accepted in the interm?

Nope, sorry, why would I accept a known-problem patch?  Would you do
such a thing?

> Many of the patches I am queueing up to submit as well as one in this series 
> do
> not apply cleanly without this change.  It will be much easier if I can get
> everything applied and then do a global clean up of these macros after the
> fact.

But you would have no incentive to do that if I take this patch now :)

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 23/23] staging/rdma/hfi1: Update driver version string to 0.9-294

2015-10-27 Thread Greg KH
On Tue, Oct 27, 2015 at 05:00:22PM -0400, ira.weiny wrote:
> On Tue, Oct 27, 2015 at 05:46:41PM +0900, Greg KH wrote:
> > On Mon, Oct 26, 2015 at 10:28:49AM -0400, ira.we...@intel.com wrote:
> > > From: Jubin John 
> > > 
> > > Signed-off-by: Jubin John 
> > > Signed-off-by: Ira Weiny 
> > > ---
> > >  drivers/staging/rdma/hfi1/common.h | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/staging/rdma/hfi1/common.h 
> > > b/drivers/staging/rdma/hfi1/common.h
> > > index 7809093eb55e..5dd92720faae 100644
> > > --- a/drivers/staging/rdma/hfi1/common.h
> > > +++ b/drivers/staging/rdma/hfi1/common.h
> > > @@ -205,7 +205,7 @@
> > >   * to the driver itself, not the software interfaces it supports.
> > >   */
> > >  #ifndef HFI1_DRIVER_VERSION_BASE
> > > -#define HFI1_DRIVER_VERSION_BASE "0.9-248"
> > > +#define HFI1_DRIVER_VERSION_BASE "0.9-294"
> > 
> > Patches like this make no sense at all, please drop it and only use the
> > kernel version.
> 
> What do you mean by "only use the kernel version"?  Do you mean
> 
> #define HFI1_DRIVER_VERSION_BASE UTS_RELEASE
>  
> Or just remove the macro entirely?

Remove it entirely, it's pointless and makes no sense for in-kernel
code.

> > Trust me, it's going to get messy really fast (hint, it
> > already did...)
> 
> Did I base this on the wrong tree?  Not sure how this could have messed you 
> up.

Nope, the patch applied just fine, but think about it, I didn't take all
of the patches you sent me, so what exactly does that version number now
represent?  Hint, absolutely nothing, or even worse, something
completely wrong :)

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/7] Fix an infinite loop in the SRP initiator

2015-10-27 Thread Bart Van Assche
Submitting a SCSI request through the SG_IO mechanism with a scatterlist 
that is longer than what is supported by the SRP initiator triggers an 
infinite loop. This patch series fixes that behavior.


The individual patches in this series are as follows:

0001-IB-srp-Fix-a-spelling-error.patch
0002-IB-srp-Document-srp_map_data-return-value.patch
0003-IB-srp-Rename-work-request-ID-labels.patch
0004-IB-srp-Fix-a-potential-queue-overflow-in-an-error-pa.patch
0005-IB-srp-Fix-srp_map_data-error-paths.patch
0006-IB-srp-Introduce-target-mr_pool_size.patch
0007-IB-srp-Avoid-that-mapping-failure-triggers-an-infini.patch
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] IB/srp: Document srp_map_data() return value

2015-10-27 Thread Bart Van Assche
Signed-off-by: Bart Van Assche 
Cc: Sagi Grimberg 
Cc: Sebastian Parschauer 
---
 drivers/infiniband/ulp/srp/ib_srp.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index 1c94d93..c1faf70 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -1497,6 +1497,15 @@ out:
return ret;
 }
 
+/**
+ * srp_map_data() - map SCSI data buffer onto an SRP request
+ * @scmnd: SCSI command to map
+ * @ch: SRP RDMA channel
+ * @req: SRP request
+ *
+ * Returns the length in bytes of the SRP_CMD IU or a negative value if
+ * mapping failed.
+ */
 static int srp_map_data(struct scsi_cmnd *scmnd, struct srp_rdma_ch *ch,
struct srp_request *req)
 {
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/7] IB/srp: Fix a spelling error

2015-10-27 Thread Bart Van Assche
Signed-off-by: Bart Van Assche 
Cc: Sagi Grimberg 
Cc: Sebastian Parschauer 
---
 drivers/infiniband/ulp/srp/ib_srp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index d01395b..1c94d93 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -1407,7 +1407,7 @@ static int srp_map_sg_entry(struct srp_map_state *state,
/*
 * If the last entry of the MR wasn't a full page, then we need to
 * close it out and start a new one -- we can only merge at page
-* boundries.
+* boundaries.
 */
ret = 0;
if (len != dev->mr_page_size)
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/7] IB/srp: Avoid that mapping failure triggers an infinite loop

2015-10-27 Thread Bart Van Assche
Signed-off-by: Bart Van Assche 
Cc: Sagi Grimberg 
Cc: Sebastian Parschauer 
---
 drivers/infiniband/ulp/srp/ib_srp.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index 47c3a72..59d3ff9 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -1666,6 +1666,8 @@ map_complete:
 
 unmap:
srp_unmap_data(scmnd, ch, req, true);
+   if (ret == -ENOMEM && req->nmdesc >= target->mr_pool_size)
+   ret = -E2BIG;
return ret;
 }
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/7] IB/srp: Introduce target->mr_pool_size

2015-10-27 Thread Bart Van Assche
This patch does not change any functionality.

Signed-off-by: Bart Van Assche 
Cc: Sagi Grimberg 
Cc: Sebastian Parschauer 
---
 drivers/infiniband/ulp/srp/ib_srp.c | 6 +++---
 drivers/infiniband/ulp/srp/ib_srp.h | 1 +
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index fb6b654..47c3a72 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -315,7 +315,7 @@ static struct ib_fmr_pool *srp_alloc_fmr_pool(struct 
srp_target_port *target)
struct ib_fmr_pool_param fmr_param;
 
memset(&fmr_param, 0, sizeof(fmr_param));
-   fmr_param.pool_size = target->scsi_host->can_queue;
+   fmr_param.pool_size = target->mr_pool_size;
fmr_param.dirty_watermark   = fmr_param.pool_size / 4;
fmr_param.cache = 1;
fmr_param.max_pages_per_fmr = dev->max_pages_per_mr;
@@ -449,8 +449,7 @@ static struct srp_fr_pool *srp_alloc_fr_pool(struct 
srp_target_port *target)
 {
struct srp_device *dev = target->srp_host->srp_dev;
 
-   return srp_create_fr_pool(dev->dev, dev->pd,
- target->scsi_host->can_queue,
+   return srp_create_fr_pool(dev->dev, dev->pd, target->mr_pool_size,
  dev->max_pages_per_mr);
 }
 
@@ -3247,6 +3246,7 @@ static ssize_t srp_create_target(struct device *dev,
}
 
target_host->sg_tablesize = target->sg_tablesize;
+   target->mr_pool_size = target->scsi_host->can_queue;
target->indirect_size = target->sg_tablesize *
sizeof (struct srp_direct_buf);
target->max_iu_len = sizeof (struct srp_cmd) +
diff --git a/drivers/infiniband/ulp/srp/ib_srp.h 
b/drivers/infiniband/ulp/srp/ib_srp.h
index 1c6a715..af084f7 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.h
+++ b/drivers/infiniband/ulp/srp/ib_srp.h
@@ -205,6 +205,7 @@ struct srp_target_port {
chartarget_name[32];
unsigned intscsi_id;
unsigned intsg_tablesize;
+   int mr_pool_size;
int queue_size;
int req_ring_size;
int comp_vector;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/7] IB/srp: Fix srp_map_data() error paths

2015-10-27 Thread Bart Van Assche
Ensure that req->nmdesc is set correctly in srp_map_sg() if mapping
fails. Avoid that mapping failure causes a memory descriptor leak.
Report srp_map_sg() failure to the caller.

Signed-off-by: Bart Van Assche 
Cc: Sagi Grimberg 
Cc: Sebastian Parschauer 
---
 drivers/infiniband/ulp/srp/ib_srp.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index 6d17fe2..fb6b654 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -1473,7 +1473,6 @@ static int srp_map_sg(struct srp_map_state *state, struct 
srp_rdma_ch *ch,
}
}
 
-   req->nmdesc = state->nmdesc;
ret = 0;
 
 out:
@@ -1594,7 +1593,10 @@ static int srp_map_data(struct scsi_cmnd *scmnd, struct 
srp_rdma_ch *ch,
   target->indirect_size, DMA_TO_DEVICE);
 
memset(&state, 0, sizeof(state));
-   srp_map_sg(&state, ch, req, scat, count);
+   ret = srp_map_sg(&state, ch, req, scat, count);
+   req->nmdesc = state.nmdesc;
+   if (ret < 0)
+   goto unmap;
 
/* We've mapped the request, now pull as much of the indirect
 * descriptor table as we can into the command buffer. If this
@@ -1617,7 +1619,8 @@ static int srp_map_data(struct scsi_cmnd *scmnd, struct 
srp_rdma_ch *ch,
!target->allow_ext_sg)) {
shost_printk(KERN_ERR, target->scsi_host,
 "Could not fit S/G list into SRP_CMD\n");
-   return -EIO;
+   ret = -EIO;
+   goto unmap;
}
 
count = min(state.ndesc, target->cmd_sg_cnt);
@@ -1635,7 +1638,7 @@ static int srp_map_data(struct scsi_cmnd *scmnd, struct 
srp_rdma_ch *ch,
ret = srp_map_idb(ch, req, state.gen.next, state.gen.end,
  idb_len, &idb_rkey);
if (ret < 0)
-   return ret;
+   goto unmap;
req->nmdesc++;
} else {
idb_rkey = target->global_mr->rkey;
@@ -1661,6 +1664,10 @@ map_complete:
cmd->buf_fmt = fmt;
 
return len;
+
+unmap:
+   srp_unmap_data(scmnd, ch, req, true);
+   return ret;
 }
 
 /*
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/7] IB/srp: Rename work request ID labels

2015-10-27 Thread Bart Van Assche
Work request IDs in the SRP initiator driver are either a pointer or a
value that is not a valid pointer. Since the local invalidate and fast
registration work requests IDs are not used as masks drop the suffix
"mask" from their name.

Signed-off-by: Bart Van Assche 
Cc: Sagi Grimberg 
Cc: Sebastian Parschauer 
---
 drivers/infiniband/ulp/srp/ib_srp.c | 8 
 drivers/infiniband/ulp/srp/ib_srp.h | 7 +++
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index c1faf70..1aa9a4c 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -1049,7 +1049,7 @@ static int srp_inv_rkey(struct srp_rdma_ch *ch, u32 rkey)
struct ib_send_wr *bad_wr;
struct ib_send_wr wr = {
.opcode = IB_WR_LOCAL_INV,
-   .wr_id  = LOCAL_INV_WR_ID_MASK,
+   .wr_id  = LOCAL_INV_WR_ID,
.next   = NULL,
.num_sge= 0,
.send_flags = 0,
@@ -1325,7 +1325,7 @@ static int srp_map_finish_fr(struct srp_map_state *state,
 
memset(&wr, 0, sizeof(wr));
wr.opcode = IB_WR_FAST_REG_MR;
-   wr.wr_id = FAST_REG_WR_ID_MASK;
+   wr.wr_id = FAST_REG_WR_ID;
wr.wr.fast_reg.iova_start = state->base_dma_addr;
wr.wr.fast_reg.page_list = desc->frpl;
wr.wr.fast_reg.page_list_len = state->npages;
@@ -1940,11 +1940,11 @@ static void srp_handle_qp_err(u64 wr_id, enum 
ib_wc_status wc_status,
}
 
if (ch->connected && !target->qp_in_error) {
-   if (wr_id & LOCAL_INV_WR_ID_MASK) {
+   if (wr_id == LOCAL_INV_WR_ID) {
shost_printk(KERN_ERR, target->scsi_host, PFX
 "LOCAL_INV failed with status %s (%d)\n",
 ib_wc_status_msg(wc_status), wc_status);
-   } else if (wr_id & FAST_REG_WR_ID_MASK) {
+   } else if (wr_id == FAST_REG_WR_ID) {
shost_printk(KERN_ERR, target->scsi_host, PFX
 "FAST_REG_MR failed status %s (%d)\n",
 ib_wc_status_msg(wc_status), wc_status);
diff --git a/drivers/infiniband/ulp/srp/ib_srp.h 
b/drivers/infiniband/ulp/srp/ib_srp.h
index 3608f2e..1c6a715 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.h
+++ b/drivers/infiniband/ulp/srp/ib_srp.h
@@ -67,10 +67,9 @@ enum {
 
SRP_MAX_PAGES_PER_MR= 512,
 
-   LOCAL_INV_WR_ID_MASK= 1,
-   FAST_REG_WR_ID_MASK = 2,
-
-   SRP_LAST_WR_ID  = 0xfffcU,
+   LOCAL_INV_WR_ID = 1,
+   FAST_REG_WR_ID  = 2,
+   SRP_LAST_WR_ID  = 3,
 };
 
 enum srp_target_state {
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/7] IB/srp: Fix a potential queue overflow in an error path

2015-10-27 Thread Bart Van Assche
Wait until memory registration has finished in the srp_queuecommand()
error path before invalidating memory regions to avoid a send queue
overflow.

Signed-off-by: Bart Van Assche 
Cc: Sagi Grimberg 
Cc: Sebastian Parschauer 
---
 drivers/infiniband/ulp/srp/ib_srp.c | 41 ++---
 1 file changed, 34 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index 1aa9a4c..6d17fe2 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -1044,7 +1044,7 @@ static int srp_connect_ch(struct srp_rdma_ch *ch, bool 
multich)
}
 }
 
-static int srp_inv_rkey(struct srp_rdma_ch *ch, u32 rkey)
+static int srp_inv_rkey(struct srp_rdma_ch *ch, u32 rkey, u32 send_flags)
 {
struct ib_send_wr *bad_wr;
struct ib_send_wr wr = {
@@ -1052,16 +1052,32 @@ static int srp_inv_rkey(struct srp_rdma_ch *ch, u32 
rkey)
.wr_id  = LOCAL_INV_WR_ID,
.next   = NULL,
.num_sge= 0,
-   .send_flags = 0,
+   .send_flags = send_flags,
.ex.invalidate_rkey = rkey,
};
 
return ib_post_send(ch->qp, &wr, &bad_wr);
 }
 
+static bool srp_wait_until_done(struct srp_rdma_ch *ch, int i, long timeout)
+{
+   WARN_ON_ONCE(timeout <= 0);
+
+   for ( ; i > 0; i--) {
+   spin_lock_irq(&ch->lock);
+   srp_send_completion(ch->send_cq, ch);
+   spin_unlock_irq(&ch->lock);
+
+   if (wait_for_completion_timeout(&ch->done, timeout) > 0)
+   return true;
+   }
+   return false;
+}
+
 static void srp_unmap_data(struct scsi_cmnd *scmnd,
   struct srp_rdma_ch *ch,
-  struct srp_request *req)
+  struct srp_request *req,
+  bool wait_for_first_unmap)
 {
struct srp_target_port *target = ch->target;
struct srp_device *dev = target->srp_host->srp_dev;
@@ -1077,13 +1093,19 @@ static void srp_unmap_data(struct scsi_cmnd *scmnd,
struct srp_fr_desc **pfr;
 
for (i = req->nmdesc, pfr = req->fr_list; i > 0; i--, pfr++) {
-   res = srp_inv_rkey(ch, (*pfr)->mr->rkey);
+   res = srp_inv_rkey(ch, (*pfr)->mr->rkey,
+  wait_for_first_unmap ?
+  IB_SEND_SIGNALED : 0);
if (res < 0) {
shost_printk(KERN_ERR, target->scsi_host, PFX
  "Queueing INV WR for rkey %#x failed (%d)\n",
  (*pfr)->mr->rkey, res);
queue_work(system_long_wq,
   &target->tl_err_work);
+   } else if (wait_for_first_unmap) {
+   wait_for_first_unmap = false;
+   WARN_ON_ONCE(!srp_wait_until_done(ch, 10,
+   msecs_to_jiffies(100)));
}
}
if (req->nmdesc)
@@ -1144,7 +1166,7 @@ static void srp_free_req(struct srp_rdma_ch *ch, struct 
srp_request *req,
 {
unsigned long flags;
 
-   srp_unmap_data(scmnd, ch, req);
+   srp_unmap_data(scmnd, ch, req, false);
 
spin_lock_irqsave(&ch->lock, flags);
ch->req_lim += req_lim_delta;
@@ -1982,7 +2004,12 @@ static void srp_send_completion(struct ib_cq *cq, void 
*ch_ptr)
struct srp_iu *iu;
 
while (ib_poll_cq(cq, 1, &wc) > 0) {
-   if (likely(wc.status == IB_WC_SUCCESS)) {
+   if (unlikely(wc.wr_id == LOCAL_INV_WR_ID)) {
+   complete(&ch->done);
+   if (wc.status != IB_WC_SUCCESS)
+   srp_handle_qp_err(wc.wr_id, wc.status, true,
+ ch);
+   } else if (likely(wc.status == IB_WC_SUCCESS)) {
iu = (struct srp_iu *) (uintptr_t) wc.wr_id;
list_add(&iu->list, &ch->free_tx);
} else {
@@ -2084,7 +2111,7 @@ unlock_rport:
return ret;
 
 err_unmap:
-   srp_unmap_data(scmnd, ch, req);
+   srp_unmap_data(scmnd, ch, req, true);
 
 err_iu:
srp_put_tx_iu(ch, iu, SRP_IU_CMD);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] IB/mlx5: Publish mlx5 driver support for extended create QP

2015-10-27 Thread Eli Cohen
Signed-off-by: Eli Cohen 
---
 drivers/infiniband/hw/mlx5/main.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index f1ccd40..634de84 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1385,7 +1385,8 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
(1ull << IB_USER_VERBS_CMD_CREATE_XSRQ) |
(1ull << IB_USER_VERBS_CMD_OPEN_QP);
dev->ib_dev.uverbs_ex_cmd_mask =
-   (1ull << IB_USER_VERBS_EX_CMD_QUERY_DEVICE);
+   (1ull << IB_USER_VERBS_EX_CMD_QUERY_DEVICE) |
+   (1ull << IB_USER_VERBS_EX_CMD_CREATE_QP);
 
dev->ib_dev.query_device= mlx5_ib_query_device;
dev->ib_dev.query_port  = mlx5_ib_query_port;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Loopback prevention support

2015-10-27 Thread Eli Cohen
Hi Doug,

This two patch series adds support for loopback prevention in mlx5_ib for
userspace consumers.

Eli

*** Resending since may have had some problem with my subscription so the
*** patchest did not make it to the rdma list

Eli Cohen (2):
  IB/mlx5: Add debug print to signify if block multicast is used
  IB/mlx5: Publish mlx5 driver support for extended create QP

 drivers/infiniband/hw/mlx5/main.c |3 ++-
 drivers/infiniband/hw/mlx5/qp.c   |4 
 2 files changed, 6 insertions(+), 1 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] IB/mlx5: Add debug print to signify if block multicast is used

2015-10-27 Thread Eli Cohen
Signed-off-by: Eli Cohen 
---
 drivers/infiniband/hw/mlx5/qp.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 6f521a3..b80b2bd 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -1033,6 +1033,10 @@ static int create_qp_common(struct mlx5_ib_dev *dev, 
struct ib_pd *pd,
 
qp->mqp.event = mlx5_ib_qp_event;
 
+   /* QP related debug prints go here */
+   if (qp->flags & MLX5_IB_QP_BLOCK_MULTICAST_LOOPBACK)
+   mlx5_ib_dbg(dev, "QP 0x%x will block multicast\n", qp->mqp.qpn);
+
return 0;
 
 err_create:
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/25] IB/mthca, net/mlx4: remove counting semaphores

2015-10-27 Thread Arnd Bergmann
The mthca and mlx4 device drivers use the same method
to switch between polling and event-driven command mode,
abusing two semaphores to create a mutual exclusion between
one polled command or multiple concurrent event driven
commands.

Since we want to make counting semaphores go away, this
patch replaces the semaphore counting the event-driven
commands with an open-coded wait-queue, which should
be an equivalent transformation of the code, although
it does not make it any nicer.

As far as I can tell, there is a preexisting race condition
regarding the cmd->use_events flag, which is not protected
by any lock. When this flag is toggled while another command
is being started, that command gets stuck until the mode is
toggled back.

A better solution that would solve the race condition and
at the same time improve the code readability would create
a new locking primitive that replaces both semaphores, like

static int mlx4_use_events(struct mlx4_cmd *cmd)
{
int ret = -EAGAIN;
spin_lock(&cmd->lock);
if (cmd->use_events && cmd->commands < cmd->max_commands) {
cmd->commands++;
ret = 1;
} else if (!cmd->use_events && cmd->commands == 0) {
cmd->commands = 1;
ret = 0;
}
spin_unlock(&cmd->lock);
return ret;
}

static bool mlx4_use_events(struct mlx4_cmd *cmd)
{
int ret;
wait_event(cmd->events_wq, ret = __mlx4_use_events(cmd) >= 0);
return ret;
}

Cc: Roland Dreier 
Cc: Eli Cohen 
Cc: Yevgeny Petrilin 
Cc: net...@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Signed-off-by: Arnd Bergmann 

Conflicts:

drivers/net/mlx4/cmd.c
drivers/net/mlx4/mlx4.h
---
 drivers/infiniband/hw/mthca/mthca_cmd.c   | 12 
 drivers/infiniband/hw/mthca/mthca_dev.h   |  3 ++-
 drivers/net/ethernet/mellanox/mlx4/cmd.c  | 12 
 drivers/net/ethernet/mellanox/mlx4/mlx4.h |  3 ++-
 4 files changed, 20 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/mthca/mthca_cmd.c 
b/drivers/infiniband/hw/mthca/mthca_cmd.c
index 9d3e5c1ac60e..aad1852e8e10 100644
--- a/drivers/infiniband/hw/mthca/mthca_cmd.c
+++ b/drivers/infiniband/hw/mthca/mthca_cmd.c
@@ -417,7 +417,8 @@ static int mthca_cmd_wait(struct mthca_dev *dev,
int err = 0;
struct mthca_cmd_context *context;
 
-   down(&dev->cmd.event_sem);
+   wait_event(dev->cmd.event_wait,
+  atomic_add_unless(&dev->cmd.commands, -1, 0));
 
spin_lock(&dev->cmd.context_lock);
BUG_ON(dev->cmd.free_head < 0);
@@ -459,7 +460,8 @@ out:
dev->cmd.free_head = context - dev->cmd.context;
spin_unlock(&dev->cmd.context_lock);
 
-   up(&dev->cmd.event_sem);
+   atomic_inc(&dev->cmd.commands);
+   wake_up(&dev->cmd.event_wait);
return err;
 }
 
@@ -571,7 +573,8 @@ int mthca_cmd_use_events(struct mthca_dev *dev)
dev->cmd.context[dev->cmd.max_cmds - 1].next = -1;
dev->cmd.free_head = 0;
 
-   sema_init(&dev->cmd.event_sem, dev->cmd.max_cmds);
+   init_waitqueue_head(&dev->cmd.event_wait);
+   atomic_set(&dev->cmd.commands, dev->cmd.max_cmds);
spin_lock_init(&dev->cmd.context_lock);
 
for (dev->cmd.token_mask = 1;
@@ -597,7 +600,8 @@ void mthca_cmd_use_polling(struct mthca_dev *dev)
dev->cmd.flags &= ~MTHCA_CMD_USE_EVENTS;
 
for (i = 0; i < dev->cmd.max_cmds; ++i)
-   down(&dev->cmd.event_sem);
+   wait_event(dev->cmd.event_wait,
+  atomic_add_unless(&dev->cmd.commands, -1, 0));
 
kfree(dev->cmd.context);
 
diff --git a/drivers/infiniband/hw/mthca/mthca_dev.h 
b/drivers/infiniband/hw/mthca/mthca_dev.h
index 7e6a6d64ad4e..3055f5c12ac8 100644
--- a/drivers/infiniband/hw/mthca/mthca_dev.h
+++ b/drivers/infiniband/hw/mthca/mthca_dev.h
@@ -121,7 +121,8 @@ struct mthca_cmd {
struct pci_pool  *pool;
struct mutex  hcr_mutex;
struct semaphore  poll_sem;
-   struct semaphore  event_sem;
+   wait_queue_head_t event_wait;
+   atomic_t  commands;
int   max_cmds;
spinlock_tcontext_lock;
int   free_head;
diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c 
b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index 78f5a1a0b8c8..60134a4245ef 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -273,7 +273,8 @@ static int mlx4_cmd_wait(struct mlx4_dev *dev, u64 
in_param, u64 *out_param,
struct mlx4_cmd_context *context;
int err = 0;
 
-   down(&cmd->event_sem);
+   wait_event(cmd->event_wait,
+  atomic_add_unless(&cmd->commands, -1, 0));
 
spin_lock(&cmd->context_lock);
BUG_ON(cmd->free_head < 0);
@@ -305,7 +306,8 @@ out:
cmd->free_head = context - cmd->context;

RE: [PATCH] IB/sa: replace GFP_KERNEL with GFP_ATOMIC

2015-10-27 Thread Weiny, Ira
> 
> On Tue, Oct 27, 2015 at 06:56:50PM +, Wan, Kaike wrote:
> 
> > > I do wonder if it is a good idea to call ib_nl_send_msg with a
> > > spinlock held though.. Would be nice to see that go away.
> >
> > We have to hold the lock to protect against a race condition that a
> > quick response will try to free the request from the
> > ib_nl_request_list before we even put it on the list.
> 
> Put is on the list first? Use a kref? Doesn't look like a big deal to clean 
> this up.
> 
> Jason

Not sure what "Put is on the list first?" means.  I think it is valid to build 
the request, 
if success, add to the list, then send it.  That would solve the problem you 
mention 
above.  Was that what you hand in mind, Jason?

I don't have time to work on this right now, not sure about Kaike.  Until we 
can remove 
the spinlock the current proposed patch should be applied in the interim.  
Sorry for the
noise before.

Reviewed-By: Ira Weiny 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html