Re: [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature

2014-10-13 Thread Bart Van Assche

On 10/07/14 16:48, Sagi Grimberg wrote:

-static __be64 *mr_align(__be64 *ptr, int align)
+static void *mr_align(void *ptr, int align)
  {
unsigned long mask = align - 1;

-   return (__be64 *)(((unsigned long)ptr + mask) & ~mask);
+   return (void *)(((unsigned long)ptr + mask) & ~mask);
  }


Maybe I'm missing something, but why doesn't this function use the 
ALIGN() macro defined in  ? Are you aware that a type 
with name uintptr_t has been defined in  that is intended 
for casting a pointer to an int ?


> +struct ib_indir_reg_list *
> +mlx5_ib_alloc_indir_reg_list(struct ib_device *device,
> +   unsigned int max_indir_list_len)

There are three k[zc]alloc() calls in this function. Can these be 
combined into one without increasing the chance too much that this will 
increase the order of the allocation ?


> +  dsize = sizeof(*mirl->klms) * max_indir_list_len;
> +  mirl->mapped_ilist = kzalloc(dsize + MLX5_UMR_ALIGN - 1,
> +GFP_KERNEL);

The value of the constant MLX5_UMR_ALIGN is 2048. Does that mean this 
code will always allocate 2 KB more than needed ? Is this really the 
minimum alignment required for this data structure ? How likely is it 
that this code will trigger a higher order allocation ?


> +  if (unlikely((*seg == qp->sq.qend)))

A minor comment: one pair of parentheses can be left out from the above 
code.


> +  mlx5_ib_warn(dev, "\n");

Will the warning generated by this code allow a user to find out which 
code triggered that warning ?


Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API

2014-10-13 Thread Bart Van Assche

On 10/07/14 16:48, Sagi Grimberg wrote:

In order to support that we provide the user with an interface
to pass a scattered list of buffers to the IB core layer called
ib_indir_reg_list and provide the a new send work request opcode
called IB_WR_REG_INDIR_MR. We extend wr union with a new type of
memory registration called indir_reg where the user can place the
relevant information to perform such a memory registration.

The verbs user is expected to perform these steps:
0. Make sure that the device supports Indirect memory registration via
ib_device_cap_flag IB_DEVICE_INDIR_REGISTRATION and make sure
that ib_device_attr max_indir_reg_mr_list_len suffice for the
expected scatterlist length

1. Allocate a memory region with IB_MR_INDIRECT_REG creation flag
This is done via ib_create_mr() with mr_init_attr.flags = IB_MR_INDIRECT_REG

2. Allocate an ib_indir_reg_list structure to hold the scattered buffers
pointers. This is done via new ib_alloc_indir_reg_list() verb

3. Populate the scattered buffers in ib_indir_reg_list.sg_list

4. Post a work request with a new opcode IB_WR_REG_INDIR_MR and
provide the populated ib_indir_reg_list

5. Perform data transfer

6. Get completion of kind IB_WC_REG_INDIR_MR (if requested)

7. Free indirect MR and ib_indir_reg_list via
ib_destroy_mr() and ib_free_indir_reg_list()


Hello Sagi,

Is there documentation available somewhere about the order in which an 
HCA must execute an indirect memory registration request relative to 
other work requests, similar to the "Work Request Operation Ordering" 
table in the InfiniBand specification ? I think such documentation is 
needed to ensure consistent behavior across HCA models.


Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] IB/iser: Remove hard coded values for cqe and send_wr

2014-10-13 Thread Jayamohan Kallickal

>On Tue, Oct 7, 2014 at 10:58 PM, Sagi Grimberg  
>wrote:
>>On 10/8/2014 3:41 AM, Jay Kallickal wrote:
>>From: Jayamohan Kallickal 

>>  This patch allows the underlying hardware to choose
>>values other than  hard coded max values for cqe and send_wr
>>while preventing them from exceeding max supported values.

>Hi Minh and Jayamohan,

>So I agree that we would want to take device capabilities into account
>here, but we need to be able to adjust scsi_cmds_max (can_queue) in case
>the max wqe supported is lower than scsi_cmds_max * num_posts_per_cmd.


I feel we should be fine as long as we support the max_cmds of 128 supported by 
open-iscsi layer.

I see the iser layer uses a value of 512 though it only serves as a limit 
check. 


>So generally I agree with this approach, but we need to take care of
>stuff later when the session is created.

>One more thing, this is not rebased on the latest iser patches please
>send v1 on top of: http://marc.info/?l=linux-rdma&m=141216135013146&w=2

Yes, I will recreate the patch and send on top of this


Resending:
 Looks like my response got dropped by the filters.

Am travelling . Minh is working on the patches and would respond 
-Jay
--
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] RDMA/cxgb4: Report the actual address of the remote connecting peer

2014-10-13 Thread Tatyana Nikolova
From: Steve Wise 

When iWARP port mapping is being done, the passive side of a connection
only knows the mapped address/port of the peer.  So now query the IWPM
to get the actual address/port of the peer.

Also setup the passive side endpoint to correctly display the actual
and mapped addresses for the new connection.

Signed-off-by: Steve Wise 
---

 drivers/infiniband/hw/cxgb4/cm.c |   54 +++---
 drivers/infiniband/hw/cxgb4/device.c |1 +
 2 files changed, 51 insertions(+), 4 deletions(-)


diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index c2fb71c..f6c7b32 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -580,6 +580,22 @@ static void c4iw_record_pm_msg(struct c4iw_ep *ep,
sizeof(ep->com.mapped_remote_addr));
 }
 
+static int get_remote_addr(struct c4iw_ep *ep)
+{
+   int ret;
+
+   print_addr(&ep->com, __func__, "get_remote_addr");
+
+   ret = iwpm_get_remote_info(&ep->com.mapped_local_addr,
+  &ep->com.mapped_remote_addr,
+  &ep->com.remote_addr, RDMA_NL_C4IW);
+   if (ret)
+   pr_info(MOD "Unable to find remote peer addr info - err %d\n",
+   ret);
+
+   return ret;
+}
+
 static void best_mtu(const unsigned short *mtus, unsigned short mtu,
 unsigned int *idx, int use_ts)
 {
@@ -2343,27 +2359,57 @@ static int pass_accept_req(struct c4iw_dev *dev, struct 
sk_buff *skb)
state_set(&child_ep->com, CONNECTING);
child_ep->com.dev = dev;
child_ep->com.cm_id = NULL;
+
+   /*
+* The mapped_local and mapped_remote addresses get setup with
+* the actual 4-tuple.  The local address will be based on the
+* actual local address of the connection, but on the port number
+* of the parent listening endpoint.  The remote address is
+* setup based on a query to the IWPM since we don't know what it
+* originally was before mapping.  If no mapping was done, then
+* mapped_remote == remote, and mapped_local == local.
+*/
if (iptype == 4) {
struct sockaddr_in *sin = (struct sockaddr_in *)
-   &child_ep->com.local_addr;
+   &child_ep->com.mapped_local_addr;
+
sin->sin_family = PF_INET;
sin->sin_port = local_port;
sin->sin_addr.s_addr = *(__be32 *)local_ip;
-   sin = (struct sockaddr_in *)&child_ep->com.remote_addr;
+
+   sin = (struct sockaddr_in *)&child_ep->com.local_addr;
+   sin->sin_family = PF_INET;
+   sin->sin_port = ((struct sockaddr_in *)
+&parent_ep->com.local_addr)->sin_port;
+   sin->sin_addr.s_addr = *(__be32 *)local_ip;
+
+   sin = (struct sockaddr_in *)&child_ep->com.mapped_remote_addr;
sin->sin_family = PF_INET;
sin->sin_port = peer_port;
sin->sin_addr.s_addr = *(__be32 *)peer_ip;
} else {
struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)
-   &child_ep->com.local_addr;
+   &child_ep->com.mapped_local_addr;
+
sin6->sin6_family = PF_INET6;
sin6->sin6_port = local_port;
memcpy(sin6->sin6_addr.s6_addr, local_ip, 16);
-   sin6 = (struct sockaddr_in6 *)&child_ep->com.remote_addr;
+
+   sin6 = (struct sockaddr_in6 *)&child_ep->com.local_addr;
+   sin6->sin6_family = PF_INET6;
+   sin6->sin6_port = ((struct sockaddr_in6 *)
+  &parent_ep->com.local_addr)->sin6_port;
+   memcpy(sin6->sin6_addr.s6_addr, local_ip, 16);
+
+   sin6 = (struct sockaddr_in6 *)&child_ep->com.mapped_remote_addr;
sin6->sin6_family = PF_INET6;
sin6->sin6_port = peer_port;
memcpy(sin6->sin6_addr.s6_addr, peer_ip, 16);
}
+   memcpy(&child_ep->com.remote_addr, &child_ep->com.mapped_remote_addr,
+  sizeof(child_ep->com.remote_addr));
+   get_remote_addr(child_ep);
+
c4iw_get_ep(&parent_ep->com);
child_ep->parent_ep = parent_ep;
child_ep->tos = GET_POPEN_TOS(ntohl(req->tos_stid));
diff --git a/drivers/infiniband/hw/cxgb4/device.c 
b/drivers/infiniband/hw/cxgb4/device.c
index 72f1f05..4c0f238 100644
--- a/drivers/infiniband/hw/cxgb4/device.c
+++ b/drivers/infiniband/hw/cxgb4/device.c
@@ -93,6 +93,7 @@ static struct ibnl_client_cbs c4iw_nl_cb_table[] = {
[RDMA_NL_IWPM_ADD_MAPPING] = {.dump = iwpm_add_mapping_cb},
[RDMA_NL_IWPM_QUERY_MAPPING] = {.dump = iwpm_add_and_query_mapping_cb},
[RDMA_NL_IWPM_HANDLE_ERR] = {.dump = iwpm_mapping_error_cb},
+   [RDMA_NL_IWPM_REMOTE_INFO] = {.dump = iwpm_remote_info_cb},

[PATCH 2/3] RDMA/nes: Report the actual address of the remote connecting peer

2014-10-13 Thread Tatyana Nikolova
Get the actual (non-mapped) ip/tcp address of the remote connecting peer
from the port mapper and report the address info to the user space application
at the time of connection establishment

Signed-off-by: Tatyana Nikolova 
---
 drivers/infiniband/hw/nes/nes.c|1 +
 drivers/infiniband/hw/nes/nes_cm.c |   65 ++-
 2 files changed, 49 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/hw/nes/nes.c b/drivers/infiniband/hw/nes/nes.c
index 3b2a6dc..9f9d5c5 100644
--- a/drivers/infiniband/hw/nes/nes.c
+++ b/drivers/infiniband/hw/nes/nes.c
@@ -116,6 +116,7 @@ static struct ibnl_client_cbs nes_nl_cb_table[] = {
[RDMA_NL_IWPM_REG_PID] = {.dump = iwpm_register_pid_cb},
[RDMA_NL_IWPM_ADD_MAPPING] = {.dump = iwpm_add_mapping_cb},
[RDMA_NL_IWPM_QUERY_MAPPING] = {.dump = iwpm_add_and_query_mapping_cb},
+   [RDMA_NL_IWPM_REMOTE_INFO] = {.dump = iwpm_remote_info_cb},
[RDMA_NL_IWPM_HANDLE_ERR] = {.dump = iwpm_mapping_error_cb},
[RDMA_NL_IWPM_MAPINFO] = {.dump = iwpm_mapping_info_cb},
[RDMA_NL_IWPM_MAPINFO_NUM] = {.dump = iwpm_ack_mapping_info_cb}
diff --git a/drivers/infiniband/hw/nes/nes_cm.c 
b/drivers/infiniband/hw/nes/nes_cm.c
index 6f09a72..785c2fa 100644
--- a/drivers/infiniband/hw/nes/nes_cm.c
+++ b/drivers/infiniband/hw/nes/nes_cm.c
@@ -596,27 +596,52 @@ static void nes_form_reg_msg(struct nes_vnic *nesvnic,
memcpy(pm_msg->if_name, nesvnic->netdev->name, IWPM_IFNAME_SIZE);
 }
 
+static void record_sockaddr_info(struct sockaddr_storage *addr_info,
+   nes_addr_t *ip_addr, u16 *port_num)
+{
+   struct sockaddr_in *in_addr = (struct sockaddr_in *)addr_info;
+
+   if (in_addr->sin_family == AF_INET) {
+   *ip_addr = ntohl(in_addr->sin_addr.s_addr);
+   *port_num = ntohs(in_addr->sin_port);
+   }
+}
+
 /*
  * nes_record_pm_msg - Save the received mapping info
  */
 static void nes_record_pm_msg(struct nes_cm_info *cm_info,
struct iwpm_sa_data *pm_msg)
 {
-   struct sockaddr_in *mapped_loc_addr =
-   (struct sockaddr_in *)&pm_msg->mapped_loc_addr;
-   struct sockaddr_in *mapped_rem_addr =
-   (struct sockaddr_in *)&pm_msg->mapped_rem_addr;
-
-   if (mapped_loc_addr->sin_family == AF_INET) {
-   cm_info->mapped_loc_addr =
-   ntohl(mapped_loc_addr->sin_addr.s_addr);
-   cm_info->mapped_loc_port = ntohs(mapped_loc_addr->sin_port);
-   }
-   if (mapped_rem_addr->sin_family == AF_INET) {
-   cm_info->mapped_rem_addr =
-   ntohl(mapped_rem_addr->sin_addr.s_addr);
-   cm_info->mapped_rem_port = ntohs(mapped_rem_addr->sin_port);
-   }
+   record_sockaddr_info(&pm_msg->mapped_loc_addr,
+   &cm_info->mapped_loc_addr, &cm_info->mapped_loc_port);
+
+   record_sockaddr_info(&pm_msg->mapped_rem_addr,
+   &cm_info->mapped_rem_addr, &cm_info->mapped_rem_port);
+}
+
+/*
+ * nes_get_reminfo - Get the address info of the remote connecting peer
+ */
+static int nes_get_remote_addr(struct nes_cm_node *cm_node)
+{
+   struct sockaddr_storage mapped_loc_addr, mapped_rem_addr;
+   struct sockaddr_storage remote_addr;
+   int ret;
+
+   nes_create_sockaddr(htonl(cm_node->mapped_loc_addr),
+   htons(cm_node->mapped_loc_port), &mapped_loc_addr);
+   nes_create_sockaddr(htonl(cm_node->mapped_rem_addr),
+   htons(cm_node->mapped_rem_port), &mapped_rem_addr);
+
+   ret = iwpm_get_remote_info(&mapped_loc_addr, &mapped_rem_addr,
+   &remote_addr, RDMA_NL_NES);
+   if (ret)
+   nes_debug(NES_DBG_CM, "Unable to find remote peer address 
info\n");
+   else
+   record_sockaddr_info(&remote_addr, &cm_node->rem_addr,
+   &cm_node->rem_port);
+   return ret;
 }
 
 /**
@@ -1566,9 +1591,14 @@ static struct nes_cm_node *make_cm_node(struct 
nes_cm_core *cm_core,
return NULL;
 
/* set our node specific transport info */
-   cm_node->loc_addr = cm_info->loc_addr;
+   if (listener) {
+   cm_node->loc_addr = listener->loc_addr;
+   cm_node->loc_port = listener->loc_port;
+   } else {
+   cm_node->loc_addr = cm_info->loc_addr;
+   cm_node->loc_port = cm_info->loc_port;
+   }
cm_node->rem_addr = cm_info->rem_addr;
-   cm_node->loc_port = cm_info->loc_port;
cm_node->rem_port = cm_info->rem_port;
 
cm_node->mapped_loc_addr = cm_info->mapped_loc_addr;
@@ -2151,6 +2181,7 @@ static int handle_ack_pkt(struct nes_cm_node *cm_node, 
struct sk_buff *skb,
cm_node->state = NES_CM_STATE_ESTABLISHED;
if (datasize) {
cm_node->tcp_cntxt.rcv_nxt = inc_sequence + datasize;
+  

[PATCH 1/3] RDMA/core: Enable the iWarp Port Mapper to provide the actual address of the remote connecting peer

2014-10-13 Thread Tatyana Nikolova
Add functionality to allow the port mapper to provide to its client 
the actual (non-mapped) ip/tcp address information of the remote connecting peer

Signed-off-by: Tatyana Nikolova 
Reviewed-by: Steve Wise 
---
 drivers/infiniband/core/iwpm_msg.c  |   73 -
 drivers/infiniband/core/iwpm_util.c |  208 +--
 drivers/infiniband/core/iwpm_util.h |   15 +++
 include/rdma/iw_portmap.h   |   25 
 include/uapi/rdma/rdma_netlink.h|1 +
 5 files changed, 288 insertions(+), 34 deletions(-)

diff --git a/drivers/infiniband/core/iwpm_msg.c 
b/drivers/infiniband/core/iwpm_msg.c
index b85ddbc..ab08170 100644
--- a/drivers/infiniband/core/iwpm_msg.c
+++ b/drivers/infiniband/core/iwpm_msg.c
@@ -468,7 +468,8 @@ add_mapping_response_exit:
 }
 EXPORT_SYMBOL(iwpm_add_mapping_cb);
 
-/* netlink attribute policy for the response to add and query mapping request 
*/
+/* netlink attribute policy for the response to add and query mapping request
+ * and response with remote address info */
 static const struct nla_policy resp_query_policy[IWPM_NLA_RQUERY_MAPPING_MAX] 
= {
[IWPM_NLA_QUERY_MAPPING_SEQ]  = { .type = NLA_U32 },
[IWPM_NLA_QUERY_LOCAL_ADDR]   = { .len = sizeof(struct 
sockaddr_storage) },
@@ -559,6 +560,76 @@ query_mapping_response_exit:
 }
 EXPORT_SYMBOL(iwpm_add_and_query_mapping_cb);
 
+/*
+ * iwpm_remote_info_cb - Process a port mapper message, containing
+ *   the remote connecting peer address info
+ */
+int iwpm_remote_info_cb(struct sk_buff *skb, struct netlink_callback *cb)
+{
+   struct nlattr *nltb[IWPM_NLA_RQUERY_MAPPING_MAX];
+   struct sockaddr_storage *local_sockaddr, *remote_sockaddr;
+   struct sockaddr_storage *mapped_loc_sockaddr, *mapped_rem_sockaddr;
+   struct iwpm_remote_info *rem_info;
+   const char *msg_type;
+   u8 nl_client;
+   int ret = -EINVAL;
+
+   msg_type = "Remote Mapping info";
+   if (iwpm_parse_nlmsg(cb, IWPM_NLA_RQUERY_MAPPING_MAX,
+   resp_query_policy, nltb, msg_type))
+   return ret;
+
+   nl_client = RDMA_NL_GET_CLIENT(cb->nlh->nlmsg_type);
+   if (!iwpm_valid_client(nl_client)) {
+   pr_info("%s: Invalid port mapper client = %d\n",
+   __func__, nl_client);
+   return ret;
+   }
+   atomic_set(&echo_nlmsg_seq, cb->nlh->nlmsg_seq);
+
+   local_sockaddr = (struct sockaddr_storage *)
+   nla_data(nltb[IWPM_NLA_QUERY_LOCAL_ADDR]);
+   remote_sockaddr = (struct sockaddr_storage *)
+   nla_data(nltb[IWPM_NLA_QUERY_REMOTE_ADDR]);
+   mapped_loc_sockaddr = (struct sockaddr_storage *)
+   nla_data(nltb[IWPM_NLA_RQUERY_MAPPED_LOC_ADDR]);
+   mapped_rem_sockaddr = (struct sockaddr_storage *)
+   nla_data(nltb[IWPM_NLA_RQUERY_MAPPED_REM_ADDR]);
+
+   if (mapped_loc_sockaddr->ss_family != local_sockaddr->ss_family ||
+   mapped_rem_sockaddr->ss_family != remote_sockaddr->ss_family) {
+   pr_info("%s: Sockaddr family doesn't match the requested one\n",
+   __func__);
+   return ret;
+   }
+   rem_info = kzalloc(sizeof(struct iwpm_remote_info), GFP_ATOMIC);
+   if (!rem_info) {
+   pr_err("%s: Unable to allocate a remote info\n", __func__);
+   ret = -ENOMEM;
+   return ret;
+   }
+   memcpy(&rem_info->mapped_loc_sockaddr, mapped_loc_sockaddr,
+  sizeof(struct sockaddr_storage));
+   memcpy(&rem_info->remote_sockaddr, remote_sockaddr,
+  sizeof(struct sockaddr_storage));
+   memcpy(&rem_info->mapped_rem_sockaddr, mapped_rem_sockaddr,
+  sizeof(struct sockaddr_storage));
+   rem_info->nl_client = nl_client;
+
+   iwpm_add_remote_info(rem_info);
+
+   iwpm_print_sockaddr(local_sockaddr,
+   "remote_info: Local sockaddr:");
+   iwpm_print_sockaddr(mapped_loc_sockaddr,
+   "remote_info: Mapped local sockaddr:");
+   iwpm_print_sockaddr(remote_sockaddr,
+   "remote_info: Remote sockaddr:");
+   iwpm_print_sockaddr(mapped_rem_sockaddr,
+   "remote_info: Mapped remote sockaddr:");
+   return ret;
+}
+EXPORT_SYMBOL(iwpm_remote_info_cb);
+
 /* netlink attribute policy for the received request for mapping info */
 static const struct nla_policy resp_mapinfo_policy[IWPM_NLA_MAPINFO_REQ_MAX] = 
{
[IWPM_NLA_MAPINFO_ULIB_NAME] = { .type = NLA_STRING,
diff --git a/drivers/infiniband/core/iwpm_util.c 
b/drivers/infiniband/core/iwpm_util.c
index 69e9f84..d1b24d2 100644
--- a/drivers/infiniband/core/iwpm_util.c
+++ b/drivers/infiniband/core/iwpm_util.c
@@ -33,8 +33,10 @@
 
 #include "iwpm_util.h"
 
-#define IWPM_HASH_BUCKET_SIZE  512
-#define IWPM_HASH_BUCKET_MASK  (IWPM_HASH_BUCKET_SIZE - 

Re: [PATCH RFC 0/2] Indirect Fast Memory registration support

2014-10-13 Thread Steve Wise

On 10/13/2014 6:18 AM, Sagi Grimberg wrote:

On 10/13/2014 11:48 AM, Bart Van Assche wrote:

On 10/12/14 21:43, Or Gerlitz wrote:

Sean, Bart - any comment on the API before Sagi sits down to code the
iSER changes?


Hello Or,


Hey Bart,



Will this API ever be supported by other Mellanox RDMA drivers than
mlx5, e.g. mlx4 ?


Well, unfortunately mlx4 micro-architecture does not support
indirect memory registration. IMHO, simulating it in SW/FW is
not feasible.


Will this API ever be supported by a Mellanox HCA that
supports RoCE ?


Yes, mlx5 will support RoCE in the future. Moreover AFAIK future
Mellanox HCAs will keep supporting this technology as it is useful
for other use-cases as well.


Additionally, do you perhaps know whether any other RDMA
vendors have plans to implement this API for their RDMA HCA's ?


I can't comment on other vendors plans...
Maybe once this feature is included they will consider it... ;)



cxgb4 currently does not support this.  I'm not sure if it could be done 
given the hw architecture.


Steve.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API

2014-10-13 Thread Steve Wise
> On 10/8/2014 4:54 PM, Steve Wise wrote:
>  @@ -1056,6 +1067,14 @@ struct ib_send_wr {
> intaccess_flags;
> struct ib_sge   *prot;
> } sig_handover;
>  +struct {
>  +u64iova_start;
>  +struct ib_indir_reg_list   *indir_list;
>  +unsigned intindir_list_len;
>  +u64length;
>  +unsigned intaccess_flags;
>  +u32mkey;
>  +} indir_reg;
> >>>
> >>> What is mkey?  Shouldn't this be an rkey?
> >>
> >> mkey means memory key. I can change it to rkey if that
> >> is clearer.
> >
> > Is it valid to use an lkey here?  Or is an rkey required?  If an rkey is 
> > required, then I'd say it is clearer to name it rkey
(and
> > that aligns with the fastreg struct).
> >
> 
> It is valid. The memory key depends on the use case.
> In case a client want to send an rkey to a peer, it will register using
> rkey. In case a server wants to transfer data from it's local buffer
> it will register using lkey.
> 
> So I didn't impose a specific key here - this is why I chose mkey.
> 
> I can modify it to rkey to mimic the well known fastreg, but its not
> a must.
> 

If both local-only and local/remote are valid, then I agree mkey is good.   I 
was thinking an application wouldn't need this API for
local-only registrations; it could just use the local dma lkey and a bus 
address in the sge.   But perhaps the indirect fastreg
allows a deeper sgl than is supported by providers via the SEND/READ/WRITE/RECV 
work requests...

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature

2014-10-13 Thread Or Gerlitz


On 10/13/2014 11:32 AM, Sagi Grimberg wrote:

n 10/12/2014 10:39 PM, Or Gerlitz wrote:


On 10/7/2014 4:48 PM, Sagi Grimberg wrote:


* Nit change in mr_align() static routine to handle void*
instead of __be64.


nit comment... any reason not to put in different and unrelated to this
series patch?


I can, But there is no use for this out of this patch context.

if you really want it to be this way, let it be
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature

2014-10-13 Thread Or Gerlitz


On 10/13/2014 11:32 AM, Sagi Grimberg wrote:

mask = MLX5_MKEY_MASK_LEN|
+MLX5_MKEY_MASK_PAGE_SIZE|
+MLX5_MKEY_MASK_START_ADDR|
+MLX5_MKEY_MASK_EN_RINVAL|
+MLX5_MKEY_MASK_KEY|
+MLX5_MKEY_MASK_LR|
+MLX5_MKEY_MASK_LW|
+MLX5_MKEY_MASK_RR|
+MLX5_MKEY_MASK_RW|
+MLX5_MKEY_MASK_A|
+MLX5_MKEY_MASK_FREE; 
@ least let's not repeat this mask = MLX5_MKEY_MASK_YYY | [...] exactly 
twice, BTW I think Eli is OOO this week and I'm not sure we want to stop 
this train just as of this small suggestion I made here




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] Indirect Fast Memory registration support

2014-10-13 Thread Sagi Grimberg

On 10/13/2014 11:48 AM, Bart Van Assche wrote:

On 10/12/14 21:43, Or Gerlitz wrote:

Sean, Bart - any comment on the API before Sagi sits down to code the
iSER changes?


Hello Or,


Hey Bart,



Will this API ever be supported by other Mellanox RDMA drivers than
mlx5, e.g. mlx4 ?


Well, unfortunately mlx4 micro-architecture does not support
indirect memory registration. IMHO, simulating it in SW/FW is
not feasible.


Will this API ever be supported by a Mellanox HCA that
supports RoCE ?


Yes, mlx5 will support RoCE in the future. Moreover AFAIK future
Mellanox HCAs will keep supporting this technology as it is useful
for other use-cases as well.


Additionally, do you perhaps know whether any other RDMA
vendors have plans to implement this API for their RDMA HCA's ?


I can't comment on other vendors plans...
Maybe once this feature is included they will consider it... ;)


Drivers like the iSER and SRP initiator and target drivers have been implemented
on top of the RDMA API and support all hardware that implements the RDMA
API. Since this patch series adds support for the indirect fast memory
registration API for mlx5 only this means that adding support for this
API in block storage drivers also means adding a fall-back path for RDMA
drivers that do not have support for this API. That will make these
block storage drivers harder to maintain, which is something I would
like to avoid.


I agree with you on that,

But state-of-the-art devices can introduce new functionality which
sometimes isn't supported in older/other devices. So while keeping
a fall-back logic is a pain, we should strive to adopt new features.

As you know, the problem this feature is solving is even more painful
in iSER than in SRP (only a single rkey per-cmd is allowed) and while
the BB memcopy work-around will not be removed from the code completely,
at least it can hide under some abstraction and be avoided when this
feature is supported.

Cheers,
Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V1 for-next 1/9] IB/core: Introduce peer client interface

2014-10-13 Thread Bart Van Assche

On 10/12/14 14:03, Shachar Raindel wrote:

The kernel-doc output for the above comment block is incomplete. Please
fix this, and please also fix the warnings reported by the kernel-doc


We are not sure what is the best way to document an operations struct
using nanodocs.
Looking at mmu_notifiers, they are documented inline, using a
comment style that nanodoc can't parse
(http://lxr.free-electrons.com/source/include/linux/mmu_notifier.h#L27 )
Looking at the file_operations struct, there is no documentation
whatsoever
(http://lxr.free-electrons.com/source/include/linux/fs.h#L1482 )
Looking at ib_device
(http://lxr.free-electrons.com/source/include/rdma/ib_verbs.h#L1436 ),
again there is no documentation.

I attach our current nanodoc parsable comment. No errors are now
generated when running kernel-doc on the attached file.
Is this what you are aiming at? Can you give an example for the proper
way to format such comments?


Hello Sachar,

Please note that I'm not an expert with regard to kernel documentation 
formats. In my opinion the documentation in the attached header file is 
not only easy to read for humans but the scripts/kernel-doc tool also 
generates nicely formatted HTML documentation from that document so I'm 
happy with the documentation format that is used in that header file.


Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 02/12] blk-mq: Add blk_mq_unique_tag()

2014-10-13 Thread Christoph Hellwig
On Mon, Oct 13, 2014 at 11:21:54AM +0200, Bart Van Assche wrote:
> With the approach for block layer tag management proposed in this patch
> series SCSI LLDs no longer need to call this function. This means that the
> blk_mq_build_unique_tag() function can be eliminated by inlining it into
> blk_mq_unique_tag(). Would you like me to rework this patch accordingly ?

Yes, please.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 01/12] blk-mq: Use all available hardware queues

2014-10-13 Thread Bart Van Assche

On 10/11/14 13:11, Christoph Hellwig wrote:

On Wed, Oct 08, 2014 at 03:21:56PM +0200, Bart Van Assche wrote:

On 10/07/14 16:37, Jens Axboe wrote:

Lets do this separate, as explained last time, it needs to be evaluated
on its own and doesn't really belong in this series of patches.


Hello Jens,

A few minutes ago I have resent this patch to you with the LKML in CC. I
hope that Christoph agrees with leaving out this patch from this series
without me having to resend this series. BTW, Sagi promised me off-list to
review the other patches in this series.


Ignoring patches is one of the easier tasks, np.  Do you want me to
merge all the srp updates?  They normally go through the IB tree, but
I can pick them up this time so that we don't need to synchronize the
two trees.


Hello Christoph,

Since patch 1/12 already has been sent separately to Jens patches 
2/12..12/12 remain. The SRP initiator changes in this series depend on 
the blk-mq and scsi-mq features added in patches 2/12..4/12. I think we 
should avoid splitting these patches over multiple kernel trees to avoid 
having to deal with dependencies between kernel trees.


Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 02/12] blk-mq: Add blk_mq_unique_tag()

2014-10-13 Thread Bart Van Assche

On 10/11/14 13:08, Christoph Hellwig wrote:

+static inline u32 blk_mq_build_unique_tag(int hwq, int tag)
+{
+   return (hwq << BLK_MQ_UNIQUE_TAG_BITS) | (tag & BLK_MQ_UNIQUE_TAG_MASK);
+}


Is there any value in having this as a separate helper?


Hello Christoph,

With the approach for block layer tag management proposed in this patch 
series SCSI LLDs no longer need to call this function. This means that 
the blk_mq_build_unique_tag() function can be eliminated by inlining it 
into blk_mq_unique_tag(). Would you like me to rework this patch 
accordingly ?


Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] IB/srp: Add multichannel support

2014-10-13 Thread Bart Van Assche

On 10/13/14 10:17, Sagi Grimberg wrote:

On 10/7/2014 3:51 PM, Bart Van Assche wrote:

On 09/23/14 18:32, Sagi Grimberg wrote:

Since you don't seem to negotiate/declare multichannel with the target,
did you test this code with some target implementations other than SCST
that happen to be out there?


(replying to an e-mail of two weeks ago)

I have just verified that the multichannel code in this patch series
works fine in combination with the upstream SRP target driver.


Working as in single channel mode? or multichannel mode?


Hello Sagi,

In my e-mail I was referring to multichannel mode.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] Indirect Fast Memory registration support

2014-10-13 Thread Bart Van Assche

On 10/12/14 21:43, Or Gerlitz wrote:

Sean, Bart - any comment on the API before Sagi sits down to code the
iSER changes?


Hello Or,

Will this API ever be supported by other Mellanox RDMA drivers than 
mlx5, e.g. mlx4 ? Will this API ever be supported by a Mellanox HCA that 
supports RoCE ? Additionally, do you perhaps know whether any other RDMA 
vendors have plans to implement this API for their RDMA HCA's ? Drivers 
like the iSER and SRP initiator and target drivers have been implemented 
on top of the RDMA API and support all hardware that implements the RDMA 
API. Since this patch series adds support for the indirect fast memory 
registration API for mlx5 only this means that adding support for this 
API in block storage drivers also means adding a fall-back path for RDMA 
drivers that do not have support for this API. That will make these 
block storage drivers harder to maintain, which is something I would 
like to avoid.


Bart.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] s390/kernel: add system calls for access PCI memory

2014-10-13 Thread Martin Schwidefsky
On Sun, 12 Oct 2014 11:52:55 +
Shachar Raindel  wrote:

> > +   switch (length) {
> > +   case 1:
> > +   ret = get_user(value.buf8, ((u8 *)user_buffer));
> 
> This cast (and similar casts across the code) kills the __user
> annotation of the user buffer pointer.
> First - fix this to help various static verification tools such
> as sparse work on your code.
> Second - are you sure this switch-case block achieves any
> performance gain compared to always using copy_from_user?
> If so, why not just push it into the S390 copy from user code?

The __user annotation is indeed missing. If the switch is improving
performance needs to be seen, with the compile options set for z10
the get_user is inlined while the copy_from_user calls a function.
For compiles < z10 all 5 switch cases will call the same
__copy_from_user function. So it depends, as long as the switch is
correct I am ok the code block for now.
 
> > +   switch (length) {
> > +   case 1:
> > +   value.buf8 = __raw_readb(io_addr);
> > +   ret = put_user(value.buf8, ((u8 *)user_buffer));
> 
> Add __user annotations in this code block as well.

Yes, please add.

> Generally speaking, looks OK once the __user annotation is added.
> 
> I suspect you might need ack/review from the S390 maintainer as
> well for this to be pushed, as the syscall is generic to the
> entire S390 subsystem.

With the missing __user annotations added:
Acked-by: Martin Schwidefsky 

-- 
blue skies,
   Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 2/2] IB/mlx5: Implement Fast Indirect Memory Registration Feature

2014-10-13 Thread Sagi Grimberg

On 10/12/2014 10:39 PM, Or Gerlitz wrote:


On 10/7/2014 4:48 PM, Sagi Grimberg wrote:


* Nit change in mr_align() static routine to handle void*
instead of __be64.


nit comment... any reason not to put in different and unrelated to this
series patch?


I can, But there is no use for this out of this patch context.



diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c



+static void set_indir_umr_segment(struct mlx5_wqe_umr_ctrl_seg *umr,
+  struct ib_send_wr *wr)
+{
+u64 mask;
+u32 list_len = wr->wr.indir_reg.indir_list_len;
+
+memset(umr, 0, sizeof(*umr));
+
+umr->klm_octowords = get_klm_octo(list_len * 2);
+mask = MLX5_MKEY_MASK_LEN|
+MLX5_MKEY_MASK_PAGE_SIZE|
+MLX5_MKEY_MASK_START_ADDR|
+MLX5_MKEY_MASK_EN_RINVAL|
+MLX5_MKEY_MASK_KEY|
+MLX5_MKEY_MASK_LR|
+MLX5_MKEY_MASK_LW|
+MLX5_MKEY_MASK_RR|
+MLX5_MKEY_MASK_RW|
+MLX5_MKEY_MASK_A|
+MLX5_MKEY_MASK_FREE;
+
+umr->mkey_mask = cpu_to_be64(mask);
+}


here you basically replicate the majority of the code from
set_reg_umr_segment - share the common part...


I've thought about it, but I know Eli prefers not to centralize the
umr control segment (and mkey context) setting routines at this point.
I am not against doing it at all, but let's let Eli comment here.

Eli?

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API

2014-10-13 Thread Sagi Grimberg

On 10/9/2014 11:13 PM, Or Gerlitz wrote:

On Tue, Oct 7, 2014, Sagi Grimberg  wrote:
[...]

  enum ib_signature_prot_cap {
@@ -182,6 +183,7 @@ struct ib_device_attr {
 int max_srq_wr;
 int max_srq_sge;
 unsigned intmax_fast_reg_page_list_len;
+   unsigned intmax_indir_reg_mr_list_len;


The indirection registration list is basically made of  struct ib_sge
objects which are posted on a send-like work-request, any reason  to
have a dedicated dev cap attribute for that and not use the already
existing one (max_sge)?



Hi Or,

max_send_sge capability is how many ib_sge's the device can handle in a
single send work request which takes into consideration element such as
wqe control segment size and sq reservations. This is different from how
many ib_sge's the device can register to a single indirect memory key
which is free of such limitations.

So given these are different capabilities I prefer to expose them in
different attributes.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] IB/srp: Add multichannel support

2014-10-13 Thread Sagi Grimberg

On 10/7/2014 3:51 PM, Bart Van Assche wrote:

On 09/23/14 18:32, Sagi Grimberg wrote:

Since you don't seem to negotiate/declare multichannel with the target,
did you test this code with some target implementations other than SCST
that happen to be out there?


(replying to an e-mail of two weeks ago)

Hello Sagi,

I have just verified that the multichannel code in this patch series
works fine in combination with the upstream SRP target driver.



Working as in single channel mode? or multichannel mode?

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/1] IB/iser: Remove hard coded values for cqe and send_wr

2014-10-13 Thread Sagi Grimberg

On 10/9/2014 8:14 AM, Jayamohan.K wrote:


Hi Minh and Jayamohan,

So I agree that we would want to take device capabilities into account
here, but we need to be able to adjust scsi_cmds_max (can_queue) in case
the max wqe supported is lower than scsi_cmds_max * num_posts_per_cmd.



I feel we should be fine as long as we support the max_cmds of 128
supported by open-iscsi layer.



The cmds_max can be modified by the user, so we should be fine in this
case as well.


I see the iser layer uses a value of 512 though it only serves as a
limit check.


So if iser supports less than 512 (due to device capability) the
boundary check should be modified as well.




So generally I agree with this approach, but we need to take care of
stuff later when the session is created.

One more thing, this is not rebased on the latest iser patches please
send v1 on top of:
http://marc.info/?l=linux-__rdma&m=141216135013146&w=2



Yes, I will recreate the patch and send on top of this


Great!





P.S.
What device did you test with (that supports less than iSER needs)?



This question still holds.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API

2014-10-13 Thread Sagi Grimberg

On 10/9/2014 11:13 PM, Or Gerlitz wrote:

On Tue, Oct 7, 2014, Sagi Grimberg  wrote:
[...]

  enum ib_signature_prot_cap {
@@ -182,6 +183,7 @@ struct ib_device_attr {
 int max_srq_wr;
 int max_srq_sge;
 unsigned intmax_fast_reg_page_list_len;
+   unsigned intmax_indir_reg_mr_list_len;


The indirection registration list is basically made of  struct ib_sge
objects which are posted on a send-like work-request, any reason  to
have a dedicated dev cap attribute for that and not use the already
existing one (max_sge)?



Hi Or,

max_send_sge capability is how many ib_sge's the device can handle in a
single send work request which takes into consideration element such as
wqe control segment size and sq reservations. This is different from how
many ib_sge's the device can register to a single indirect memory key
which is free of such limitations.

So given these are different capabilities I prefer to expose them in
different attributes.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] Indirect Fast Memory registration support

2014-10-13 Thread Sagi Grimberg

On 10/8/2014 2:06 PM, Devesh Sharma wrote:

-Original Message-
From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
ow...@vger.kernel.org] On Behalf Of Sagi Grimberg
Sent: Tuesday, October 07, 2014 8:18 PM
To: linux-rdma@vger.kernel.org
Cc: bvanass...@acm.org; rol...@kernel.org; e...@mellanox.com;
ogerl...@mellanox.com; o...@mellanox.com; sean.he...@intel.com
Subject: [PATCH RFC 0/2] Indirect Fast Memory registration support

This patch set introduces support for registering a scattered memory area in
an indirect manner.

Current supported fast registration support has a known limitation where the
memory must be page aligned, meaning memory scatters must be in chunks
of page size except the first which may be in some offset from the start of a
page and the last which may end before the page boundary.

This can make life hard for ULPs which may serve a scattered list without the
above limitations. Two immediate examples are iSER and SRP that have some
extra logic or work-arounds to handle an arbitrary scatter list of buffers
(which is supported in the entire stack above them).

The proposed API attempts to follow the well-known fast registration
scheme while allowing the ULP to pass an sg vector rather than a page list
(u64 vector).
I expect ULPs to make use of the global DMA key to populate the lkey of the
sg vector. for example for a scatter list sg of 3 elements the indirect ib_sge
vector will look like:
ib_sge[0]: {dma_key, sg[0]->dma_addr, sg[0]->length}


Here each sge can be of max 4G like regular SGEs?


Correct. This is a normal ib_sge.


And can lkey in each SGE be different?


Yes of course. Generally each lkey can be a different pre-registered
memory. I specifically used this example to show how a ULP would utilize
the global DMA lkey to register an arbitrary scatterlist.




ib_sge[1]: {dma_key, sg[1]->dma_addr, sg[1]->length}
ib_sge[2]: {dma_key, sg[2]->dma_addr, sg[2]->length}



Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] IB/core: Introduce Fast Indirect Memory Registration verbs API

2014-10-13 Thread Sagi Grimberg

On 10/8/2014 4:54 PM, Steve Wise wrote:

@@ -1056,6 +1067,14 @@ struct ib_send_wr {
   intaccess_flags;
   struct ib_sge   *prot;
   } sig_handover;
+struct {
+u64iova_start;
+struct ib_indir_reg_list   *indir_list;
+unsigned intindir_list_len;
+u64length;
+unsigned intaccess_flags;
+u32mkey;
+} indir_reg;


What is mkey?  Shouldn't this be an rkey?


mkey means memory key. I can change it to rkey if that
is clearer.


Is it valid to use an lkey here?  Or is an rkey required?  If an rkey is 
required, then I'd say it is clearer to name it rkey (and
that aligns with the fastreg struct).



It is valid. The memory key depends on the use case.
In case a client want to send an rkey to a peer, it will register using
rkey. In case a server wants to transfer data from it's local buffer
it will register using lkey.

So I didn't impose a specific key here - this is why I chose mkey.

I can modify it to rkey to mimic the well known fastreg, but its not
a must.

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html