Re: rdma_getaddrinfo and GUID

2015-07-22 Thread Vasiliy Tolstov
2015-07-23 7:29 GMT+03:00 Hefty, Sean :
>> Hello, does it possible to use rdma_getaddrinfo and specify in node port
>> GUID?
>
> The current code does not support this.


Why rdma_getaddrinfo does not return error? Why it returns success and
result contains port guid?

-- 
Vasiliy Tolstov,
e-mail: v.tols...@selfip.ru
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: rdma_getaddrinfo and GUID

2015-07-22 Thread Hefty, Sean
> Hello, does it possible to use rdma_getaddrinfo and specify in node port
> GUID?

The current code does not support this.


RE: [PATCH WIP 01/43] IB: Modify ib_create_mr API

2015-07-22 Thread Hefty, Sean
> +enum ib_mr_type {
> + IB_MR_TYPE_FAST_REG,
> + IB_MR_TYPE_SIGNATURE,

If we're going to go through the trouble of changing everything, I vote for 
dropping the word 'fast'. It's a marketing term.  It's goofy.  And the IB spec 
is goofy for using it.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V3 1/5] RDMA/core: Transport-independent access flags

2015-07-22 Thread Hefty, Sean
> > This just emphasizes why we need to converge to a single method.
> 
> In my opinion, we already have it.
> 
> For local registrations, ib_reg_phys_mr()/ib_get_dma_mr(). These are not
> quite equivalent, btw.

Personally, I would work to eliminate local registration as part of the API.

> For remote registrations, ib_post_send(FRMR).

I slightly agree here.  IMO, the rkey should be obtained by associating the 
memory buffer with the QP, but not explicitly as a 'send' operation.

If queuing is a concern, we could expose 
max_active_registrations/max_total_registrations QP attributes.

I would hide the protection domain concept entirely.  If needed, a provider can 
allocate one PD per QP, with memory registration going to the PD.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/10] net/9p: Remove ib_get_dma_mr calls

2015-07-22 Thread Jason Gunthorpe
The pd now has a local_dma_lkey member which completely replaces
ib_get_dma_mr, use it instead.

Signed-off-by: Jason Gunthorpe 
---
 net/9p/trans_rdma.c | 26 ++
 1 file changed, 2 insertions(+), 24 deletions(-)

diff --git a/net/9p/trans_rdma.c b/net/9p/trans_rdma.c
index 37a78d20c0f6..ba1210253f5e 100644
--- a/net/9p/trans_rdma.c
+++ b/net/9p/trans_rdma.c
@@ -94,8 +94,6 @@ struct p9_trans_rdma {
struct ib_pd *pd;
struct ib_qp *qp;
struct ib_cq *cq;
-   struct ib_mr *dma_mr;
-   u32 lkey;
long timeout;
int sq_depth;
struct semaphore sq_sem;
@@ -382,9 +380,6 @@ static void rdma_destroy_trans(struct p9_trans_rdma *rdma)
if (!rdma)
return;
 
-   if (rdma->dma_mr && !IS_ERR(rdma->dma_mr))
-   ib_dereg_mr(rdma->dma_mr);
-
if (rdma->qp && !IS_ERR(rdma->qp))
ib_destroy_qp(rdma->qp);
 
@@ -415,7 +410,7 @@ post_recv(struct p9_client *client, struct p9_rdma_context 
*c)
 
sge.addr = c->busa;
sge.length = client->msize;
-   sge.lkey = rdma->lkey;
+   sge.lkey = rdma->pd->local_dma_lkey;
 
wr.next = NULL;
c->wc_op = IB_WC_RECV;
@@ -506,7 +501,7 @@ dont_need_post_recv:
 
sge.addr = c->busa;
sge.length = c->req->tc->size;
-   sge.lkey = rdma->lkey;
+   sge.lkey = rdma->pd->local_dma_lkey;
 
wr.next = NULL;
c->wc_op = IB_WC_SEND;
@@ -647,7 +642,6 @@ rdma_create_trans(struct p9_client *client, const char 
*addr, char *args)
struct p9_trans_rdma *rdma;
struct rdma_conn_param conn_param;
struct ib_qp_init_attr qp_attr;
-   struct ib_device_attr devattr;
struct ib_cq_init_attr cq_attr = {};
 
/* Parse the transport specific mount options */
@@ -700,11 +694,6 @@ rdma_create_trans(struct p9_client *client, const char 
*addr, char *args)
if (err || (rdma->state != P9_RDMA_ROUTE_RESOLVED))
goto error;
 
-   /* Query the device attributes */
-   err = ib_query_device(rdma->cm_id->device, &devattr);
-   if (err)
-   goto error;
-
/* Create the Completion Queue */
cq_attr.cqe = opts.sq_depth + opts.rq_depth + 1;
rdma->cq = ib_create_cq(rdma->cm_id->device, cq_comp_handler,
@@ -719,17 +708,6 @@ rdma_create_trans(struct p9_client *client, const char 
*addr, char *args)
if (IS_ERR(rdma->pd))
goto error;
 
-   /* Cache the DMA lkey in the transport */
-   rdma->dma_mr = NULL;
-   if (devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
-   rdma->lkey = rdma->cm_id->device->local_dma_lkey;
-   else {
-   rdma->dma_mr = ib_get_dma_mr(rdma->pd, IB_ACCESS_LOCAL_WRITE);
-   if (IS_ERR(rdma->dma_mr))
-   goto error;
-   rdma->lkey = rdma->dma_mr->lkey;
-   }
-
/* Create the Queue Pair */
memset(&qp_attr, 0, sizeof qp_attr);
qp_attr.event_handler = qp_event_handler;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/10] IB/core: Guarantee that a local_dma_lkey is available

2015-07-22 Thread Jason Gunthorpe
Every single ULP requires a local_dma_lkey to do anything with
a QP, so let us ensure one exists for every PD created.

If the driver can supply a global local_dma_lkey then use that, otherwise
ask the driver to create a local use all physical memory MR associated
with the new PD.

Signed-off-by: Jason Gunthorpe 
Reviewed-by: Sagi Grimberg 
Acked-by: Christoph Hellwig 
---
 drivers/infiniband/core/verbs.c | 40 +++-
 include/rdma/ib_verbs.h |  2 ++
 2 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index bac3fb406a74..1ddf06314f36 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -213,24 +213,54 @@ EXPORT_SYMBOL(rdma_port_get_link_layer);
 
 /* Protection domains */
 
+/* Return a pd for in-kernel use that has a local_dma_lkey which provides
+   local access to all physical memory. */
 struct ib_pd *ib_alloc_pd(struct ib_device *device)
 {
struct ib_pd *pd;
+   struct ib_device_attr devattr;
+   int rc;
+
+   rc = ib_query_device(device, &devattr);
+   if (rc)
+   return ERR_PTR(rc);
 
pd = device->alloc_pd(device, NULL, NULL);
+   if (IS_ERR(pd))
+   return pd;
+
+   pd->device = device;
+   pd->uobject = NULL;
+   pd->local_mr = NULL;
+   atomic_set(&pd->usecnt, 0);
+
+   if (devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)
+   pd->local_dma_lkey = device->local_dma_lkey;
+   else {
+   struct ib_mr *mr;
+
+   mr = ib_get_dma_mr(pd, IB_ACCESS_LOCAL_WRITE);
+   if (IS_ERR(mr)) {
+   ib_dealloc_pd(pd);
+   return (struct ib_pd *)mr;
+   }
 
-   if (!IS_ERR(pd)) {
-   pd->device  = device;
-   pd->uobject = NULL;
-   atomic_set(&pd->usecnt, 0);
+   pd->local_mr = mr;
+   pd->local_dma_lkey = pd->local_mr->lkey;
}
-
return pd;
 }
+
 EXPORT_SYMBOL(ib_alloc_pd);
 
 int ib_dealloc_pd(struct ib_pd *pd)
 {
+   if (pd->local_mr) {
+   if (ib_dereg_mr(pd->local_mr))
+   return -EBUSY;
+   pd->local_mr = NULL;
+   }
+
if (atomic_read(&pd->usecnt))
return -EBUSY;
 
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 986fddb08579..cfda95d7b067 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1255,6 +1255,8 @@ struct ib_pd {
struct ib_device   *device;
struct ib_uobject  *uobject;
atomic_tusecnt; /* count all resources */
+   struct ib_mr   *local_mr;
+   u32 local_dma_lkey;
 };
 
 struct ib_xrcd {
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/10] IB/mad: Remove ib_get_dma_mr calls

2015-07-22 Thread Jason Gunthorpe
The pd now has a local_dma_lkey member which completely replaces
ib_get_dma_mr, use it instead.

Signed-off-by: Jason Gunthorpe 
---
 drivers/infiniband/core/mad.c  | 26 +++---
 drivers/infiniband/core/mad_priv.h |  1 -
 include/rdma/ib_mad.h  |  1 -
 3 files changed, 3 insertions(+), 25 deletions(-)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index a4b1466c1bf6..3c5fa8dc5ba7 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -338,13 +338,6 @@ struct ib_mad_agent *ib_register_mad_agent(struct 
ib_device *device,
goto error1;
}
 
-   mad_agent_priv->agent.mr = ib_get_dma_mr(port_priv->qp_info[qpn].qp->pd,
-IB_ACCESS_LOCAL_WRITE);
-   if (IS_ERR(mad_agent_priv->agent.mr)) {
-   ret = ERR_PTR(-ENOMEM);
-   goto error2;
-   }
-
if (mad_reg_req) {
reg_req = kmemdup(mad_reg_req, sizeof *reg_req, GFP_KERNEL);
if (!reg_req) {
@@ -429,8 +422,6 @@ error4:
spin_unlock_irqrestore(&port_priv->reg_lock, flags);
kfree(reg_req);
 error3:
-   ib_dereg_mr(mad_agent_priv->agent.mr);
-error2:
kfree(mad_agent_priv);
 error1:
return ret;
@@ -590,7 +581,6 @@ static void unregister_mad_agent(struct 
ib_mad_agent_private *mad_agent_priv)
wait_for_completion(&mad_agent_priv->comp);
 
kfree(mad_agent_priv->reg_req);
-   ib_dereg_mr(mad_agent_priv->agent.mr);
kfree(mad_agent_priv);
 }
 
@@ -1037,7 +1027,7 @@ struct ib_mad_send_buf * ib_create_send_mad(struct 
ib_mad_agent *mad_agent,
 
mad_send_wr->mad_agent_priv = mad_agent_priv;
mad_send_wr->sg_list[0].length = hdr_len;
-   mad_send_wr->sg_list[0].lkey = mad_agent->mr->lkey;
+   mad_send_wr->sg_list[0].lkey = mad_agent->qp->pd->local_dma_lkey;
 
/* OPA MADs don't have to be the full 2048 bytes */
if (opa && base_version == OPA_MGMT_BASE_VERSION &&
@@ -1046,7 +1036,7 @@ struct ib_mad_send_buf * ib_create_send_mad(struct 
ib_mad_agent *mad_agent,
else
mad_send_wr->sg_list[1].length = mad_size - hdr_len;
 
-   mad_send_wr->sg_list[1].lkey = mad_agent->mr->lkey;
+   mad_send_wr->sg_list[1].lkey = mad_agent->qp->pd->local_dma_lkey;
 
mad_send_wr->send_wr.wr_id = (unsigned long) mad_send_wr;
mad_send_wr->send_wr.sg_list = mad_send_wr->sg_list;
@@ -2884,7 +2874,7 @@ static int ib_mad_post_receive_mads(struct ib_mad_qp_info 
*qp_info,
struct ib_mad_queue *recv_queue = &qp_info->recv_queue;
 
/* Initialize common scatter list fields */
-   sg_list.lkey = (*qp_info->port_priv->mr).lkey;
+   sg_list.lkey = qp_info->port_priv->pd->local_dma_lkey;
 
/* Initialize common receive WR fields */
recv_wr.next = NULL;
@@ -3200,13 +3190,6 @@ static int ib_mad_port_open(struct ib_device *device,
goto error4;
}
 
-   port_priv->mr = ib_get_dma_mr(port_priv->pd, IB_ACCESS_LOCAL_WRITE);
-   if (IS_ERR(port_priv->mr)) {
-   dev_err(&device->dev, "Couldn't get ib_mad DMA MR\n");
-   ret = PTR_ERR(port_priv->mr);
-   goto error5;
-   }
-
if (has_smi) {
ret = create_mad_qp(&port_priv->qp_info[0], IB_QPT_SMI);
if (ret)
@@ -3247,8 +3230,6 @@ error8:
 error7:
destroy_mad_qp(&port_priv->qp_info[0]);
 error6:
-   ib_dereg_mr(port_priv->mr);
-error5:
ib_dealloc_pd(port_priv->pd);
 error4:
ib_destroy_cq(port_priv->cq);
@@ -3283,7 +3264,6 @@ static int ib_mad_port_close(struct ib_device *device, 
int port_num)
destroy_workqueue(port_priv->wq);
destroy_mad_qp(&port_priv->qp_info[1]);
destroy_mad_qp(&port_priv->qp_info[0]);
-   ib_dereg_mr(port_priv->mr);
ib_dealloc_pd(port_priv->pd);
ib_destroy_cq(port_priv->cq);
cleanup_recv_queue(&port_priv->qp_info[1]);
diff --git a/drivers/infiniband/core/mad_priv.h 
b/drivers/infiniband/core/mad_priv.h
index 5be89f98928f..4a4f7aad0978 100644
--- a/drivers/infiniband/core/mad_priv.h
+++ b/drivers/infiniband/core/mad_priv.h
@@ -199,7 +199,6 @@ struct ib_mad_port_private {
int port_num;
struct ib_cq *cq;
struct ib_pd *pd;
-   struct ib_mr *mr;
 
spinlock_t reg_lock;
struct ib_mad_mgmt_version_table version[MAX_MGMT_VERSION];
diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h
index c8422d5a5a91..1f27023c919a 100644
--- a/include/rdma/ib_mad.h
+++ b/include/rdma/ib_mad.h
@@ -388,7 +388,6 @@ enum {
 struct ib_mad_agent {
struct ib_device*device;
struct ib_qp*qp;
-   struct ib_mr*mr;
ib_mad_recv_handler recv_handler;
ib_mad_send_handler send_handler;
ib_mad_snoop_handlersnoop_handler;
-- 
2.1.4

--
To unsubscribe from this list: se

[PATCH 06/10] IB/iser: Use pd->local_dma_lkey

2015-07-22 Thread Jason Gunthorpe
Replace all leys with  pd->local_dma_lkey. This driver does not support
iWarp, so this is safe.

The insecure use of ib_get_dma_mr is thus isolated to an rkey, and this
looks trivially fixed by forcing the use of registration in a future
patch.

Signed-off-by: Jason Gunthorpe 
---
 drivers/infiniband/ulp/iser/iscsi_iser.c | 2 +-
 drivers/infiniband/ulp/iser/iser_initiator.c | 8 
 drivers/infiniband/ulp/iser/iser_memory.c| 2 +-
 drivers/infiniband/ulp/iser/iser_verbs.c | 2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c 
b/drivers/infiniband/ulp/iser/iscsi_iser.c
index 6a594aac2290..f44c6b879329 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.c
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.c
@@ -204,7 +204,7 @@ iser_initialize_task_headers(struct iscsi_task *task,
tx_desc->dma_addr = dma_addr;
tx_desc->tx_sg[0].addr   = tx_desc->dma_addr;
tx_desc->tx_sg[0].length = ISER_HEADERS_LEN;
-   tx_desc->tx_sg[0].lkey   = device->mr->lkey;
+   tx_desc->tx_sg[0].lkey   = device->pd->local_dma_lkey;
 
iser_task->iser_conn = iser_conn;
 out:
diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c 
b/drivers/infiniband/ulp/iser/iser_initiator.c
index 3e2118e8ed87..2d02f042c69a 100644
--- a/drivers/infiniband/ulp/iser/iser_initiator.c
+++ b/drivers/infiniband/ulp/iser/iser_initiator.c
@@ -173,8 +173,8 @@ static void iser_create_send_desc(struct iser_conn  
*iser_conn,
 
tx_desc->num_sge = 1;
 
-   if (tx_desc->tx_sg[0].lkey != device->mr->lkey) {
-   tx_desc->tx_sg[0].lkey = device->mr->lkey;
+   if (tx_desc->tx_sg[0].lkey != device->pd->local_dma_lkey) {
+   tx_desc->tx_sg[0].lkey = device->pd->local_dma_lkey;
iser_dbg("sdesc %p lkey mismatch, fixing\n", tx_desc);
}
 }
@@ -291,7 +291,7 @@ int iser_alloc_rx_descriptors(struct iser_conn *iser_conn,
rx_sg = &rx_desc->rx_sg;
rx_sg->addr   = rx_desc->dma_addr;
rx_sg->length = ISER_RX_PAYLOAD_SIZE;
-   rx_sg->lkey   = device->mr->lkey;
+   rx_sg->lkey   = device->pd->local_dma_lkey;
}
 
iser_conn->rx_desc_head = 0;
@@ -543,7 +543,7 @@ int iser_send_control(struct iscsi_conn *conn,
 
tx_dsg->addr= iser_conn->login_req_dma;
tx_dsg->length  = task->data_count;
-   tx_dsg->lkey= device->mr->lkey;
+   tx_dsg->lkey= device->pd->local_dma_lkey;
mdesc->num_sge = 2;
}
 
diff --git a/drivers/infiniband/ulp/iser/iser_memory.c 
b/drivers/infiniband/ulp/iser/iser_memory.c
index f0cdc961eb11..3129a42150ff 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -393,7 +393,7 @@ iser_reg_dma(struct iser_device *device, struct 
iser_data_buf *mem,
 {
struct scatterlist *sg = mem->sg;
 
-   reg->sge.lkey = device->mr->lkey;
+   reg->sge.lkey = device->pd->local_dma_lkey;
reg->rkey = device->mr->rkey;
reg->sge.addr = ib_sg_dma_address(device->ib_device, &sg[0]);
reg->sge.length = ib_sg_dma_len(device->ib_device, &sg[0]);
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c 
b/drivers/infiniband/ulp/iser/iser_verbs.c
index 5c9f565ea0e8..52268356c79e 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -1017,7 +1017,7 @@ int iser_post_recvl(struct iser_conn *iser_conn)
 
sge.addr   = iser_conn->login_resp_dma;
sge.length = ISER_RX_LOGIN_SIZE;
-   sge.lkey   = ib_conn->device->mr->lkey;
+   sge.lkey   = ib_conn->device->pd->local_dma_lkey;
 
rx_wr.wr_id   = (uintptr_t)iser_conn->login_resp_buf;
rx_wr.sg_list = &sge;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/10] IB/mlx4: Remove ib_get_dma_mr calls

2015-07-22 Thread Jason Gunthorpe
The pd now has a local_dma_lkey member which completely replaces
ib_get_dma_mr, use it instead.

Signed-off-by: Jason Gunthorpe 
---
 drivers/infiniband/hw/mlx4/mad.c | 23 ---
 drivers/infiniband/hw/mlx4/mlx4_ib.h |  1 -
 2 files changed, 4 insertions(+), 20 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/mad.c b/drivers/infiniband/hw/mlx4/mad.c
index 85a50df2f203..3a897ed3db30 100644
--- a/drivers/infiniband/hw/mlx4/mad.c
+++ b/drivers/infiniband/hw/mlx4/mad.c
@@ -580,7 +580,7 @@ int mlx4_ib_send_to_slave(struct mlx4_ib_dev *dev, int 
slave, u8 port,
 
list.addr = tun_qp->tx_ring[tun_tx_ix].buf.map;
list.length = sizeof (struct mlx4_rcv_tunnel_mad);
-   list.lkey = tun_ctx->mr->lkey;
+   list.lkey = tun_ctx->pd->local_dma_lkey;
 
wr.wr.ud.ah = ah;
wr.wr.ud.port_num = port;
@@ -1123,7 +1123,7 @@ static int mlx4_ib_post_pv_qp_buf(struct 
mlx4_ib_demux_pv_ctx *ctx,
 
sg_list.addr = tun_qp->ring[index].map;
sg_list.length = size;
-   sg_list.lkey = ctx->mr->lkey;
+   sg_list.lkey = ctx->pd->local_dma_lkey;
 
recv_wr.next = NULL;
recv_wr.sg_list = &sg_list;
@@ -1234,7 +1234,7 @@ int mlx4_ib_send_to_wire(struct mlx4_ib_dev *dev, int 
slave, u8 port,
 
list.addr = sqp->tx_ring[wire_tx_ix].buf.map;
list.length = sizeof (struct mlx4_mad_snd_buf);
-   list.lkey = sqp_ctx->mr->lkey;
+   list.lkey = sqp_ctx->pd->local_dma_lkey;
 
wr.wr.ud.ah = ah;
wr.wr.ud.port_num = port;
@@ -1817,19 +1817,12 @@ static int create_pv_resources(struct ib_device *ibdev, 
int slave, int port,
goto err_cq;
}
 
-   ctx->mr = ib_get_dma_mr(ctx->pd, IB_ACCESS_LOCAL_WRITE);
-   if (IS_ERR(ctx->mr)) {
-   ret = PTR_ERR(ctx->mr);
-   pr_err("Couldn't get tunnel DMA MR (%d)\n", ret);
-   goto err_pd;
-   }
-
if (ctx->has_smi) {
ret = create_pv_sqp(ctx, IB_QPT_SMI, create_tun);
if (ret) {
pr_err("Couldn't create %s QP0 (%d)\n",
   create_tun ? "tunnel for" : "",  ret);
-   goto err_mr;
+   goto err_pd;
}
}
 
@@ -1866,10 +1859,6 @@ err_qp0:
ib_destroy_qp(ctx->qp[0].qp);
ctx->qp[0].qp = NULL;
 
-err_mr:
-   ib_dereg_mr(ctx->mr);
-   ctx->mr = NULL;
-
 err_pd:
ib_dealloc_pd(ctx->pd);
ctx->pd = NULL;
@@ -1906,8 +1895,6 @@ static void destroy_pv_resources(struct mlx4_ib_dev *dev, 
int slave, int port,
ib_destroy_qp(ctx->qp[1].qp);
ctx->qp[1].qp = NULL;
mlx4_ib_free_pv_qp_bufs(ctx, IB_QPT_GSI, 1);
-   ib_dereg_mr(ctx->mr);
-   ctx->mr = NULL;
ib_dealloc_pd(ctx->pd);
ctx->pd = NULL;
ib_destroy_cq(ctx->cq);
@@ -2040,8 +2027,6 @@ static void mlx4_ib_free_sqp_ctx(struct 
mlx4_ib_demux_pv_ctx *sqp_ctx)
ib_destroy_qp(sqp_ctx->qp[1].qp);
sqp_ctx->qp[1].qp = NULL;
mlx4_ib_free_pv_qp_bufs(sqp_ctx, IB_QPT_GSI, 0);
-   ib_dereg_mr(sqp_ctx->mr);
-   sqp_ctx->mr = NULL;
ib_dealloc_pd(sqp_ctx->pd);
sqp_ctx->pd = NULL;
ib_destroy_cq(sqp_ctx->cq);
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h 
b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 334387f63358..9066fc2406cc 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -415,7 +415,6 @@ struct mlx4_ib_demux_pv_ctx {
struct ib_device *ib_dev;
struct ib_cq *cq;
struct ib_pd *pd;
-   struct ib_mr *mr;
struct work_struct work;
struct workqueue_struct *wq;
struct mlx4_ib_demux_pv_qp qp[2];
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/10] ib_srpt: Remove ib_get_dma_mr calls

2015-07-22 Thread Jason Gunthorpe
The pd now has a local_dma_lkey member which completely replaces
ib_get_dma_mr, use it instead.

Signed-off-by: Jason Gunthorpe 
---
 drivers/infiniband/ulp/srpt/ib_srpt.c | 15 ---
 drivers/infiniband/ulp/srpt/ib_srpt.h |  1 -
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c 
b/drivers/infiniband/ulp/srpt/ib_srpt.c
index 82897ca17f32..07da3817d92d 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.c
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
@@ -783,7 +783,7 @@ static int srpt_post_recv(struct srpt_device *sdev,
 
list.addr = ioctx->ioctx.dma;
list.length = srp_max_req_size;
-   list.lkey = sdev->mr->lkey;
+   list.lkey = sdev->pd->local_dma_lkey;
 
wr.next = NULL;
wr.sg_list = &list;
@@ -818,7 +818,7 @@ static int srpt_post_send(struct srpt_rdma_ch *ch,
 
list.addr = ioctx->ioctx.dma;
list.length = len;
-   list.lkey = sdev->mr->lkey;
+   list.lkey = sdev->pd->local_dma_lkey;
 
wr.next = NULL;
wr.wr_id = encode_wr_id(SRPT_SEND, ioctx->ioctx.index);
@@ -1206,7 +1206,7 @@ static int srpt_map_sg_to_ib_sge(struct srpt_rdma_ch *ch,
 
while (rsize > 0 && tsize > 0) {
sge->addr = dma_addr;
-   sge->lkey = ch->sport->sdev->mr->lkey;
+   sge->lkey = ch->sport->sdev->pd->local_dma_lkey;
 
if (rsize >= dma_len) {
sge->length =
@@ -3212,10 +3212,6 @@ static void srpt_add_one(struct ib_device *device)
if (IS_ERR(sdev->pd))
goto free_dev;
 
-   sdev->mr = ib_get_dma_mr(sdev->pd, IB_ACCESS_LOCAL_WRITE);
-   if (IS_ERR(sdev->mr))
-   goto err_pd;
-
sdev->srq_size = min(srpt_srq_size, sdev->dev_attr.max_srq_wr);
 
srq_attr.event_handler = srpt_srq_event;
@@ -3227,7 +3223,7 @@ static void srpt_add_one(struct ib_device *device)
 
sdev->srq = ib_create_srq(sdev->pd, &srq_attr);
if (IS_ERR(sdev->srq))
-   goto err_mr;
+   goto err_pd;
 
pr_debug("%s: create SRQ #wr= %d max_allow=%d dev= %s\n",
 __func__, sdev->srq_size, sdev->dev_attr.max_srq_wr,
@@ -3312,8 +3308,6 @@ err_cm:
ib_destroy_cm_id(sdev->cm_id);
 err_srq:
ib_destroy_srq(sdev->srq);
-err_mr:
-   ib_dereg_mr(sdev->mr);
 err_pd:
ib_dealloc_pd(sdev->pd);
 free_dev:
@@ -3359,7 +3353,6 @@ static void srpt_remove_one(struct ib_device *device)
srpt_release_sdev(sdev);
 
ib_destroy_srq(sdev->srq);
-   ib_dereg_mr(sdev->mr);
ib_dealloc_pd(sdev->pd);
 
srpt_free_ioctx_ring((struct srpt_ioctx **)sdev->ioctx_ring, sdev,
diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.h 
b/drivers/infiniband/ulp/srpt/ib_srpt.h
index 21f8df67522a..5faad8acd789 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.h
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.h
@@ -393,7 +393,6 @@ struct srpt_port {
 struct srpt_device {
struct ib_device*device;
struct ib_pd*pd;
-   struct ib_mr*mr;
struct ib_srq   *srq;
struct ib_cm_id *cm_id;
struct ib_device_attr   dev_attr;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/10] IB: Replace safe uses for ib_get_dma_mr with pd->local_dma_lkey

2015-07-22 Thread Jason Gunthorpe
This series moves dealing with the safe all physical mr:

  ib_get_dma_mr(pd,IB_ACCESS_LOCAL_WRITE);

Into ib_alloc_pd, and in the process makes the global local_dma_lkey 
functionality
broadly enabled for all ULPs.

The remaining users of ib_get_dma_mr are all unsafe:
 drivers/infiniband/ulp/iser/iser_verbs.c:
device->mr = ib_get_dma_mr(device->pd, IB_ACCESS_LOCAL_WRITE |
   IB_ACCESS_REMOTE_WRITE |
   IB_ACCESS_REMOTE_READ);

 drivers/infiniband/ulp/srp/ib_srp.c:
srp_dev->mr = ib_get_dma_mr(srp_dev->pd,
IB_ACCESS_LOCAL_WRITE |
IB_ACCESS_REMOTE_READ |
IB_ACCESS_REMOTE_WRITE);

 drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c:
int acflags = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
mr = ib_get_dma_mr(hdev->ibh_pd, acflags);

 net/rds/iw.c:
rds_iwdev->mr = ib_get_dma_mr(rds_iwdev->pd,
IB_ACCESS_REMOTE_READ |
IB_ACCESS_REMOTE_WRITE |
IB_ACCESS_LOCAL_WRITE);

 net/sunrpc/xprtrdma/svc_rdma_transport.c:
if (rdma_protocol_iwarp(newxprt->sc_cm_id->device,
newxprt->sc_cm_id->port_num) &&
!(newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG))
dma_mr_acc |= IB_ACCESS_REMOTE_WRITE;
newxprt->sc_phys_mr =
ib_get_dma_mr(newxprt->sc_pd, dma_mr_acc);

 net/sunrpc/xprtrdma/verbs.c:
case RPCRDMA_ALLPHYSICAL:
ia->ri_ops = &rpcrdma_physical_memreg_ops;
mem_priv = IB_ACCESS_LOCAL_WRITE |
IB_ACCESS_REMOTE_WRITE |
IB_ACCESS_REMOTE_READ;
ia->ri_bind_mem = ib_get_dma_mr(ia->ri_pd, mem_priv);

Calling ib_get_dma_mr with IB_ACCESS_REMOTE_* flags is considered to be a
serious security problem and should not be done without the user directly
opting in to an off-by-default scheme. The call allows the peer on the QP
unrestricted access to local physical memory if they can guess the rkey value.

A future series will cause the kernel to be tainted by the above call sites to
promote migrating away from this.

To Migrate:
 * If ib_get_dma_mr was being used to get an lkey then use
   local_dma_lkey instead (I belive this series gets all of those cases).

   If the lkey is being used for RDMA_READ, and iWarp support is required then
   iWarp must be detected and FRMR must be used to create a limited temporary
   MR just for the RDMA_READ. (eg NFS, RDS)

 * If ib_get_dma_mr was being used to get an rkey then use FRMR to cerate
   limited temporary MR's (eg SRP, iSER, etc)

Doug, this needs to be sequenced after the mlx5 patch:
 https://patchwork.kernel.org/patch/6829351/

All patches are compile tested. I've done basic testing up to and including
the IPoIB patch, the rest required specialized setups I don't have access to,
but are fairly straightforward. Feel free to take whatever subset of this gets
tested/ack'd before the next cycle.

Sagi, IB/iser should have special attention paid, as it is less clear to me if
it got everything.

Jason Gunthorpe (10):
  IB/core: Guarantee that a local_dma_lkey is available
  IB/mad: Remove ib_get_dma_mr calls
  IB/ipoib: Remove ib_get_dma_mr calls
  IB/mlx4: Remove ib_get_dma_mr calls
  IB/mlx5: Remove ib_get_dma_mr calls
  IB/iser: Use pd->local_dma_lkey
  iser-target: Remove ib_get_dma_mr calls
  IB/srp: Use pd->local_dma_lkey
  ib_srpt: Remove ib_get_dma_mr calls
  net/9p: Remove ib_get_dma_mr calls

 drivers/infiniband/core/mad.c| 26 +++---
 drivers/infiniband/core/mad_priv.h   |  1 -
 drivers/infiniband/core/verbs.c  | 40 
 drivers/infiniband/hw/mlx4/mad.c | 23 +++-
 drivers/infiniband/hw/mlx4/mlx4_ib.h |  1 -
 drivers/infiniband/hw/mlx5/main.c| 13 -
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  1 -
 drivers/infiniband/hw/mlx5/mr.c  |  5 ++--
 drivers/infiniband/ulp/ipoib/ipoib.h |  1 -
 drivers/infiniband/ulp/ipoib/ipoib_cm.c  |  2 +-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c   | 18 +++--
 drivers/infiniband/ulp/iser/iscsi_iser.c |  2 +-
 drivers/infiniband/ulp/iser/iser_initiator.c |  8 +++---
 drivers/infiniband/ulp/iser/iser_memory.c|  2 +-
 drivers/infiniband/ulp/iser/iser_verbs.c |  2 +-
 drivers/infiniband/ulp/isert/ib_isert.c  | 33 ---
 drivers/infiniband/ulp/isert/ib_isert.h  |  1 -
 drivers/infiniband/ulp/srp/ib_srp.c  |  2 +-
 drivers/infiniband/ulp/srpt/ib_srpt.c| 15 +++
 drivers/infiniband/ulp/srpt/ib_srpt.h|  1 -
 include/rdma/ib_mad.h  

[PATCH 07/10] iser-target: Remove ib_get_dma_mr calls

2015-07-22 Thread Jason Gunthorpe
The pd now has a local_dma_lkey member which completely replaces
ib_get_dma_mr, use it instead.

Signed-off-by: Jason Gunthorpe 
---
 drivers/infiniband/ulp/isert/ib_isert.c | 33 +++--
 drivers/infiniband/ulp/isert/ib_isert.h |  1 -
 2 files changed, 11 insertions(+), 23 deletions(-)

diff --git a/drivers/infiniband/ulp/isert/ib_isert.c 
b/drivers/infiniband/ulp/isert/ib_isert.c
index 771700963127..00d75d977131 100644
--- a/drivers/infiniband/ulp/isert/ib_isert.c
+++ b/drivers/infiniband/ulp/isert/ib_isert.c
@@ -235,7 +235,7 @@ isert_alloc_rx_descriptors(struct isert_conn *isert_conn)
rx_sg = &rx_desc->rx_sg;
rx_sg->addr = rx_desc->dma_addr;
rx_sg->length = ISER_RX_PAYLOAD_SIZE;
-   rx_sg->lkey = device->mr->lkey;
+   rx_sg->lkey = device->pd->local_dma_lkey;
}
 
isert_conn->rx_desc_head = 0;
@@ -385,22 +385,12 @@ isert_create_device_ib_res(struct isert_device *device)
goto out_cq;
}
 
-   device->mr = ib_get_dma_mr(device->pd, IB_ACCESS_LOCAL_WRITE);
-   if (IS_ERR(device->mr)) {
-   ret = PTR_ERR(device->mr);
-   isert_err("failed to create dma mr, device %p, ret=%d\n",
- device, ret);
-   goto out_mr;
-   }
-
/* Check signature cap */
device->pi_capable = dev_attr->device_cap_flags &
 IB_DEVICE_SIGNATURE_HANDOVER ? true : false;
 
return 0;
 
-out_mr:
-   ib_dealloc_pd(device->pd);
 out_cq:
isert_free_comps(device);
return ret;
@@ -411,7 +401,6 @@ isert_free_device_ib_res(struct isert_device *device)
 {
isert_info("device %p\n", device);
 
-   ib_dereg_mr(device->mr);
ib_dealloc_pd(device->pd);
isert_free_comps(device);
 }
@@ -1086,8 +1075,8 @@ isert_create_send_desc(struct isert_conn *isert_conn,
tx_desc->num_sge = 1;
tx_desc->isert_cmd = isert_cmd;
 
-   if (tx_desc->tx_sg[0].lkey != device->mr->lkey) {
-   tx_desc->tx_sg[0].lkey = device->mr->lkey;
+   if (tx_desc->tx_sg[0].lkey != device->pd->local_dma_lkey) {
+   tx_desc->tx_sg[0].lkey = device->pd->local_dma_lkey;
isert_dbg("tx_desc %p lkey mismatch, fixing\n", tx_desc);
}
 }
@@ -1110,7 +1099,7 @@ isert_init_tx_hdrs(struct isert_conn *isert_conn,
tx_desc->dma_addr = dma_addr;
tx_desc->tx_sg[0].addr  = tx_desc->dma_addr;
tx_desc->tx_sg[0].length = ISER_HEADERS_LEN;
-   tx_desc->tx_sg[0].lkey = device->mr->lkey;
+   tx_desc->tx_sg[0].lkey = device->pd->local_dma_lkey;
 
isert_dbg("Setup tx_sg[0].addr: 0x%llx length: %u lkey: 0x%x\n",
  tx_desc->tx_sg[0].addr, tx_desc->tx_sg[0].length,
@@ -1143,7 +1132,7 @@ isert_rdma_post_recvl(struct isert_conn *isert_conn)
memset(&sge, 0, sizeof(struct ib_sge));
sge.addr = isert_conn->login_req_dma;
sge.length = ISER_RX_LOGIN_SIZE;
-   sge.lkey = isert_conn->device->mr->lkey;
+   sge.lkey = isert_conn->device->pd->local_dma_lkey;
 
isert_dbg("Setup sge: addr: %llx length: %d 0x%08x\n",
sge.addr, sge.length, sge.lkey);
@@ -1193,7 +1182,7 @@ isert_put_login_tx(struct iscsi_conn *conn, struct 
iscsi_login *login,
 
tx_dsg->addr= isert_conn->login_rsp_dma;
tx_dsg->length  = length;
-   tx_dsg->lkey= isert_conn->device->mr->lkey;
+   tx_dsg->lkey= isert_conn->device->pd->local_dma_lkey;
tx_desc->num_sge = 2;
}
if (!login->login_failed) {
@@ -2210,7 +2199,7 @@ isert_put_response(struct iscsi_conn *conn, struct 
iscsi_cmd *cmd)
isert_cmd->pdu_buf_len = pdu_len;
tx_dsg->addr= isert_cmd->pdu_buf_dma;
tx_dsg->length  = pdu_len;
-   tx_dsg->lkey= device->mr->lkey;
+   tx_dsg->lkey= device->pd->local_dma_lkey;
isert_cmd->tx_desc.num_sge = 2;
}
 
@@ -2338,7 +2327,7 @@ isert_put_reject(struct iscsi_cmd *cmd, struct iscsi_conn 
*conn)
isert_cmd->pdu_buf_len = ISCSI_HDR_LEN;
tx_dsg->addr= isert_cmd->pdu_buf_dma;
tx_dsg->length  = ISCSI_HDR_LEN;
-   tx_dsg->lkey= device->mr->lkey;
+   tx_dsg->lkey= device->pd->local_dma_lkey;
isert_cmd->tx_desc.num_sge = 2;
 
isert_init_send_wr(isert_conn, isert_cmd, send_wr);
@@ -2379,7 +2368,7 @@ isert_put_text_rsp(struct iscsi_cmd *cmd, struct 
iscsi_conn *conn)
isert_cmd->pdu_buf_len = txt_rsp_len;
tx_dsg->addr= isert_cmd->pdu_buf_dma;
tx_dsg->length  = txt_rsp_len;
-   tx_dsg->lkey= device->mr->lkey;
+   tx_dsg->lkey= device->pd->local_dma_lkey;
isert_cmd->tx_desc.num_sge = 2;
}
isert_init_send_wr(isert_conn, isert_cmd, send_wr);

[PATCH 05/10] IB/mlx5: Remove ib_get_dma_mr calls

2015-07-22 Thread Jason Gunthorpe
The pd now has a local_dma_lkey member which completely replaces
ib_get_dma_mr, use it instead.

Signed-off-by: Jason Gunthorpe 
---
 drivers/infiniband/hw/mlx5/main.c| 13 -
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  1 -
 drivers/infiniband/hw/mlx5/mr.c  |  5 ++---
 3 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 085c24b4b603..72ab0bdb415a 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1121,7 +1121,6 @@ static void destroy_umrc_res(struct mlx5_ib_dev *dev)
 
mlx5_ib_destroy_qp(dev->umrc.qp);
ib_destroy_cq(dev->umrc.cq);
-   ib_dereg_mr(dev->umrc.mr);
ib_dealloc_pd(dev->umrc.pd);
 }
 
@@ -1136,7 +1135,6 @@ static int create_umr_res(struct mlx5_ib_dev *dev)
struct ib_pd *pd;
struct ib_cq *cq;
struct ib_qp *qp;
-   struct ib_mr *mr;
struct ib_cq_init_attr cq_attr = {};
int ret;
 
@@ -1154,13 +1152,6 @@ static int create_umr_res(struct mlx5_ib_dev *dev)
goto error_0;
}
 
-   mr = ib_get_dma_mr(pd,  IB_ACCESS_LOCAL_WRITE);
-   if (IS_ERR(mr)) {
-   mlx5_ib_dbg(dev, "Couldn't create DMA MR for sync UMR QP\n");
-   ret = PTR_ERR(mr);
-   goto error_1;
-   }
-
cq_attr.cqe = 128;
cq = ib_create_cq(&dev->ib_dev, mlx5_umr_cq_handler, NULL, NULL,
  &cq_attr);
@@ -1218,7 +1209,6 @@ static int create_umr_res(struct mlx5_ib_dev *dev)
 
dev->umrc.qp = qp;
dev->umrc.cq = cq;
-   dev->umrc.mr = mr;
dev->umrc.pd = pd;
 
sema_init(&dev->umrc.sem, MAX_UMR_WR);
@@ -1240,9 +1230,6 @@ error_3:
ib_destroy_cq(cq);
 
 error_2:
-   ib_dereg_mr(mr);
-
-error_1:
ib_dealloc_pd(pd);
 
 error_0:
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 7cae09836481..446d80427773 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -349,7 +349,6 @@ struct umr_common {
struct ib_pd*pd;
struct ib_cq*cq;
struct ib_qp*qp;
-   struct ib_mr*mr;
/* control access to UMR QP
 */
struct semaphoresem;
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index bc9a0de897cb..4c92ca8a4181 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -690,12 +690,11 @@ static void prep_umr_reg_wqe(struct ib_pd *pd, struct 
ib_send_wr *wr,
 int access_flags)
 {
struct mlx5_ib_dev *dev = to_mdev(pd->device);
-   struct ib_mr *mr = dev->umrc.mr;
struct mlx5_umr_wr *umrwr = (struct mlx5_umr_wr *)&wr->wr.fast_reg;
 
sg->addr = dma;
sg->length = ALIGN(sizeof(u64) * n, 64);
-   sg->lkey = mr->lkey;
+   sg->lkey = dev->umrc.pd->local_dma_lkey;
 
wr->next = NULL;
wr->send_flags = 0;
@@ -926,7 +925,7 @@ int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 
start_page_index, int npages,
sg.addr = dma;
sg.length = ALIGN(npages * sizeof(u64),
MLX5_UMR_MTT_ALIGNMENT);
-   sg.lkey = dev->umrc.mr->lkey;
+   sg.lkey = dev->umrc.pd->local_dma_lkey;
 
wr.send_flags = MLX5_IB_SEND_UMR_FAIL_IF_FREE |
MLX5_IB_SEND_UMR_UPDATE_MTT;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/10] IB/ipoib: Remove ib_get_dma_mr calls

2015-07-22 Thread Jason Gunthorpe
The pd now has a local_dma_lkey member which completely replaces
ib_get_dma_mr, use it instead.

Signed-off-by: Jason Gunthorpe 
---
 drivers/infiniband/ulp/ipoib/ipoib.h   |  1 -
 drivers/infiniband/ulp/ipoib/ipoib_cm.c|  2 +-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 18 +++---
 3 files changed, 4 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index bd94b0a6e9e5..731e540a37df 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -342,7 +342,6 @@ struct ipoib_dev_priv {
u16   pkey;
u16   pkey_index;
struct ib_pd *pd;
-   struct ib_mr *mr;
struct ib_cq *recv_cq;
struct ib_cq *send_cq;
struct ib_qp *qp;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index cf32a778e7d0..d1c217403c6d 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -332,7 +332,7 @@ static void ipoib_cm_init_rx_wr(struct net_device *dev,
int i;
 
for (i = 0; i < priv->cm.num_frags; ++i)
-   sge[i].lkey = priv->mr->lkey;
+   sge[i].lkey = priv->pd->local_dma_lkey;
 
sge[0].length = IPOIB_CM_HEAD_SIZE;
for (i = 1; i < priv->cm.num_frags; ++i)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 
b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index 851c8219d501..8c451983d8a5 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -152,12 +152,6 @@ int ipoib_transport_dev_init(struct net_device *dev, 
struct ib_device *ca)
return -ENODEV;
}
 
-   priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE);
-   if (IS_ERR(priv->mr)) {
-   printk(KERN_WARNING "%s: ib_get_dma_mr failed\n", ca->name);
-   goto out_free_pd;
-   }
-
/*
 * the various IPoIB tasks assume they will never race against
 * themselves, so always use a single thread workqueue
@@ -165,7 +159,7 @@ int ipoib_transport_dev_init(struct net_device *dev, struct 
ib_device *ca)
priv->wq = create_singlethread_workqueue("ipoib_wq");
if (!priv->wq) {
printk(KERN_WARNING "ipoib: failed to allocate device WQ\n");
-   goto out_free_mr;
+   goto out_free_pd;
}
 
size = ipoib_recvq_size + 1;
@@ -225,13 +219,13 @@ int ipoib_transport_dev_init(struct net_device *dev, 
struct ib_device *ca)
priv->dev->dev_addr[3] = (priv->qp->qp_num  ) & 0xff;
 
for (i = 0; i < MAX_SKB_FRAGS + 1; ++i)
-   priv->tx_sge[i].lkey = priv->mr->lkey;
+   priv->tx_sge[i].lkey = priv->pd->local_dma_lkey;
 
priv->tx_wr.opcode  = IB_WR_SEND;
priv->tx_wr.sg_list = priv->tx_sge;
priv->tx_wr.send_flags  = IB_SEND_SIGNALED;
 
-   priv->rx_sge[0].lkey = priv->mr->lkey;
+   priv->rx_sge[0].lkey = priv->pd->local_dma_lkey;
 
priv->rx_sge[0].length = IPOIB_UD_BUF_SIZE(priv->max_ib_mtu);
priv->rx_wr.num_sge = 1;
@@ -254,9 +248,6 @@ out_free_wq:
destroy_workqueue(priv->wq);
priv->wq = NULL;
 
-out_free_mr:
-   ib_dereg_mr(priv->mr);
-
 out_free_pd:
ib_dealloc_pd(priv->pd);
 
@@ -289,9 +280,6 @@ void ipoib_transport_dev_cleanup(struct net_device *dev)
priv->wq = NULL;
}
 
-   if (ib_dereg_mr(priv->mr))
-   ipoib_warn(priv, "ib_dereg_mr failed\n");
-
if (ib_dealloc_pd(priv->pd))
ipoib_warn(priv, "ib_dealloc_pd failed\n");
 
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/10] IB/srp: Use pd->local_dma_lkey

2015-07-22 Thread Jason Gunthorpe
Replace all leys with  pd->local_dma_lkey. This driver does not support
iWarp, so this is safe.

The insecure use of ib_get_dma_mr is thus isolated to an rkey, and will
have to be fixed separately.

Signed-off-by: Jason Gunthorpe 
---
 drivers/infiniband/ulp/srp/ib_srp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index 267dc4f75502..fb9fed0fac28 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -3149,7 +3149,7 @@ static ssize_t srp_create_target(struct device *dev,
target->io_class= SRP_REV16A_IB_IO_CLASS;
target->scsi_host   = target_host;
target->srp_host= host;
-   target->lkey= host->srp_dev->mr->lkey;
+   target->lkey= host->srp_dev->pd->local_dma_lkey;
target->rkey= host->srp_dev->mr->rkey;
target->cmd_sg_cnt  = cmd_sg_entries;
target->sg_tablesize= indirect_sg_entries ? : cmd_sg_entries;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 4/4] IB/sa: Route SA pathrecord query through netlink

2015-07-22 Thread Jason Gunthorpe
On Thu, Jul 09, 2015 at 01:34:28PM -0400, kaike@intel.com wrote:
> From: Kaike Wan 
> 
> This patch routes a SA pathrecord query to netlink first and processes the
> response appropriately. If a failure is returned, the request will be sent
> through IB. The decision whether to route the request to netlink first is
> determined by the presence of a listener for the local service netlink
> multicast group. If the user-space local service netlink multicast group
> listener is not present, the request will be sent through IB, just like
> what is currently being done.

So, I tested it again, and the UAPI looks pretty reasonable now.

The netlink validation still needs to be fixed.

> There is where we might have a problem: nla_parse() allows only one
> entry for each attribute type in the returned array tb[]. If we have
> multiple (say 6) pathrecords returned, each having different flags
> (for different path use), a later one will replace an earlier one in
> tb[].

The parse is OK, continue to use the for loop, nla_parse does more
than just extract into tb, it validates all the sizes are correct,
ignore tb.

My testing shows it seems to get into some kind of fast repeating
query loop:

[ 4904.969188] Return answer 0
[ 4904.969483] Return answer 0
[ 4904.969703] Return answer 0
[ 4904.969824] Return answer 0
[ 4904.969943] Return answer 0
[ 4904.970058] Return answer 0
[ 4904.970175] Return answer 0
[ 4904.970289] Return answer 0

diff --git a/drivers/infiniband/core/sa_query.c 
b/drivers/infiniband/core/sa_query.c
index 63ea36d05072..e6b0181e7076 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -640,6 +640,8 @@ static void ib_nl_process_good_resolve_rsp(struct 
ib_sa_query *query,
}
}
}
+
+   printk("Return answer %u\n",status);
query->callback(query, status, mad);
}

You'll have to figure that out. I'm just doing ping X while running
a responder on the netlink socket, v7 didn't do this, IIRC.

I think it has something to do with the timer as the first request
works fine, then a pause, then a storm.

Actually, looks like nothing removes nl queries from the timeout loop
when they successfully complete. (ie this series doesn't even have a
hope of working properly)

Please actually test this patch completely and thoroughly before
sending another version. I'm broadly happy with this, so get your
whole team to look it over agin.

> + if (query->callback) {
> + head = (const struct nlattr *) nlmsg_data(nlh);
> + len = nlmsg_len(nlh);
> + switch (query->path_use) {
> + case LS_RESOLVE_PATH_USE_UNIDIRECTIONAL:
> + mask = IB_PATH_PRIMARY | IB_PATH_OUTBOUND;
> + break;
> +
> + case LS_RESOLVE_PATH_USE_ALL:
> + case LS_RESOLVE_PATH_USE_GMP:
> + default:
> + mask = IB_PATH_PRIMARY | IB_PATH_GMP |
> + IB_PATH_BIDIRECTIONAL;
> + break;
> + }
> + nla_for_each_attr(curr, head, len, rem) {
> + if (curr->nla_type == LS_NLA_TYPE_PATH_RECORD) {
> + rec = nla_data(curr);
> + /*
> +  * Get the first one. In the future, we may
> +  * need to get up to 6 pathrecords.
> +  */
> + if (rec->flags & mask) {

(rec_>flags & mask) == mask

All requested bits must be set, not just any requested bit.

A IB_PATH_PRIMARY | IB_PATH_OUTBOUND result does not satisfy the
requirement for LS_RESOLVE_PATH_USE_GMP.

The goal is to local the one path of many that satisfies the
requirement. Future kernels will direct 6 path answers 

> +static int ib_nl_handle_set_timeout(struct sk_buff *skb,
> + struct netlink_callback *cb)
> +{
> + const struct nlmsghdr *nlh = (struct nlmsghdr *)cb->nlh;
> + int timeout, delta, abs_delta;
> + const struct nlattr *attr;
> + unsigned long flags;
> + struct ib_sa_query *query;
> + long delay = 0;
> + struct nlattr *tb[LS_NLA_TYPE_MAX + 1];
> + int ret;
> +
> + ret = nla_parse(tb, LS_NLA_TYPE_MAX, nlmsg_data(nlh), nlmsg_len(nlh),
> + NULL);
> + attr = (const struct nlattr *)tb[LS_NLA_TYPE_TIMEOUT];
> + if (ret || !attr || nla_len(attr) != sizeof(u32))
> + goto settimeout_out;

This probably doesn't need the nested attribute, but I'm indifferent.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 1/4] IB/netlink: Add defines for local service requests through netlink

2015-07-22 Thread Jason Gunthorpe
On Thu, Jul 09, 2015 at 01:34:25PM -0400, kaike@intel.com wrote:
> + LS_NLA_TYPE_STATUS = 0,

This is never used, drop it.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/ipoib: Fix CONFIG_INFINIBAND_IPOIB_CM

2015-07-22 Thread Jason Gunthorpe
If the above is turned off then ipoib_cm_dev_init unconditionally
returns ENOSYS, and the newly added error handling in
0b3957 prevents ipoib from coming up at all:

kernel: mlx4_0: ipoib_transport_dev_init failed
kernel: mlx4_0: failed to initialize port 1 (ret = -12)

Fixes: 0b39578bcde4 (IB/ipoib: Use dedicated workqueues per interface)
Signed-off-by: Jason Gunthorpe 
---
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 
b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index 9e6ee82a8fd7..851c8219d501 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -177,7 +177,8 @@ int ipoib_transport_dev_init(struct net_device *dev, struct 
ib_device *ca)
else
size += ipoib_recvq_size * ipoib_max_conn_qp;
} else
-   goto out_free_wq;
+   if (ret != -ENOSYS)
+   goto out_free_wq;
 
cq_attr.cqe = size;
priv->recv_cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL,
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 37/43] xprtrdma: Port to new memory registration API

2015-07-22 Thread Steve Wise


On 7/22/2015 1:55 AM, Sagi Grimberg wrote:

Signed-off-by: Sagi Grimberg 
---
  net/sunrpc/xprtrdma/frwr_ops.c  | 80 ++---
  net/sunrpc/xprtrdma/xprt_rdma.h |  4 ++-
  2 files changed, 47 insertions(+), 37 deletions(-)


Did you intend to change svcrdma as well?


diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 517efed..e28246b 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -151,9 +151,13 @@ __frwr_init(struct rpcrdma_mw *r, struct ib_pd *pd, struct 
ib_device *device,
f->fr_mr = ib_alloc_mr(pd, IB_MR_TYPE_FAST_REG, depth, 0);
if (IS_ERR(f->fr_mr))
goto out_mr_err;
-   f->fr_pgl = ib_alloc_fast_reg_page_list(device, depth);
-   if (IS_ERR(f->fr_pgl))
+
+   f->sg = kcalloc(sizeof(*f->sg), depth, GFP_KERNEL);
+   if (IS_ERR(f->sg))
goto out_list_err;
+
+   sg_init_table(f->sg, depth);
+
return 0;
  
  out_mr_err:

@@ -163,7 +167,7 @@ out_mr_err:
return rc;
  
  out_list_err:

-   rc = PTR_ERR(f->fr_pgl);
+   rc = -ENOMEM;
dprintk("RPC:   %s: ib_alloc_fast_reg_page_list status %i\n",
__func__, rc);
ib_dereg_mr(f->fr_mr);
@@ -179,7 +183,7 @@ __frwr_release(struct rpcrdma_mw *r)
if (rc)
dprintk("RPC:   %s: ib_dereg_mr status %i\n",
__func__, rc);
-   ib_free_fast_reg_page_list(r->r.frmr.fr_pgl);
+   kfree(r->r.frmr.sg);
  }
  
  static int

@@ -320,10 +324,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
struct ib_send_wr fastreg_wr, *bad_wr;
u8 key;
int len, pageoff;
-   int i, rc;
-   int seg_len;
-   u64 pa;
-   int page_no;
+   int i, rc, access;
  
  	mw = seg1->rl_mw;

seg1->rl_mw = NULL;
@@ -344,39 +345,46 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
if (nsegs > ia->ri_max_frmr_depth)
nsegs = ia->ri_max_frmr_depth;
  
-	for (page_no = i = 0; i < nsegs;) {

-   rpcrdma_map_one(device, seg, direction);
-   pa = seg->mr_dma;
-   for (seg_len = seg->mr_len; seg_len > 0; seg_len -= PAGE_SIZE) {
-   frmr->fr_pgl->page_list[page_no++] = pa;
-   pa += PAGE_SIZE;
-   }
+   for (i = 0; i < nsegs;) {
+   sg_set_page(&frmr->sg[i], seg->mr_page,
+   seg->mr_len, offset_in_page(seg->mr_offset));
len += seg->mr_len;
-   ++seg;
++i;
-   /* Check for holes */
+   ++seg;
+
+   /* Check for holes - needed?? */
if ((i < nsegs && offset_in_page(seg->mr_offset)) ||
offset_in_page((seg-1)->mr_offset + (seg-1)->mr_len))
break;
}
+
+   frmr->sg_nents = i;
+   frmr->dma_nents = ib_dma_map_sg(device, frmr->sg,
+   frmr->sg_nents, direction);
+   if (!frmr->dma_nents) {
+   pr_err("RPC:   %s: failed to dma map sg %p sg_nents %d\n",
+   __func__, frmr->sg, frmr->sg_nents);
+   return -ENOMEM;
+   }
+
dprintk("RPC:   %s: Using frmr %p to map %d segments (%d bytes)\n",
__func__, mw, i, len);
  
-	memset(&fastreg_wr, 0, sizeof(fastreg_wr));

-   fastreg_wr.wr_id = (unsigned long)(void *)mw;
-   fastreg_wr.opcode = IB_WR_FAST_REG_MR;
-   fastreg_wr.wr.fast_reg.iova_start = seg1->mr_dma + pageoff;
-   fastreg_wr.wr.fast_reg.page_list = frmr->fr_pgl;
-   fastreg_wr.wr.fast_reg.page_shift = PAGE_SHIFT;
-   fastreg_wr.wr.fast_reg.page_list_len = page_no;
-   fastreg_wr.wr.fast_reg.length = len;
-   fastreg_wr.wr.fast_reg.access_flags = writing ?
-   IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
-   IB_ACCESS_REMOTE_READ;
mr = frmr->fr_mr;
+   access = writing ? IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+  IB_ACCESS_REMOTE_READ;
+   rc = ib_map_mr_sg(mr, frmr->sg, frmr->sg_nents, access);
+   if (rc) {
+   pr_err("RPC:   %s: failed to map mr %p rc %d\n",
+   __func__, frmr->fr_mr, rc);
+   return rc;
+   }
+
key = (u8)(mr->rkey & 0x00FF);
ib_update_fast_reg_key(mr, ++key);
-   fastreg_wr.wr.fast_reg.rkey = mr->rkey;
+
+   memset(&fastreg_wr, 0, sizeof(fastreg_wr));
+   ib_set_fastreg_wr(mr, mr->rkey, (uintptr_t)mw, false, &fastreg_wr);
  
  	DECR_CQCOUNT(&r_xprt->rx_ep);

rc = ib_post_send(ia->ri_id->qp, &fastreg_wr, &bad_wr);
@@ -385,15 +393,14 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
  
  	seg1->rl_mw = mw;

seg1->mr_rkey = mr->rkey;
-   seg1->mr_base

[PATCH] RDMA/cxgb3: fail get_dma_mr if the memory footprint can exceed 32b

2015-07-22 Thread Steve Wise
T3 HW only supports MRs of length < 4GB.  If the system can have more
than that we need to fail dma mr allocation so we con't create a MR that
cannot span the entire possible memory space.

Signed-off-by: Steve Wise 
---

 drivers/infiniband/hw/cxgb3/iwch_provider.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c 
b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index b1b7323..bbbe018 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -736,6 +736,10 @@ static struct ib_mr *iwch_get_dma_mr(struct ib_pd *pd, int 
acc)
/*
 * T3 only supports 32 bits of size.
 */
+   if (sizeof(phys_addr_t) > 4) {
+   pr_warn_once(MOD "Cannot support dma_mrs on this platform.\n");
+   return ERR_PTR(-ENOTSUPP);
+   }
bl.size = 0x;
bl.addr = 0;
kva = 0;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 01/43] IB: Modify ib_create_mr API

2015-07-22 Thread Jason Gunthorpe
On Wed, Jul 22, 2015 at 07:58:23PM +0300, Sagi Grimberg wrote:
> On 7/22/2015 7:44 PM, Christoph Hellwig wrote:
> >On Wed, Jul 22, 2015 at 10:34:05AM -0600, Jason Gunthorpe wrote:
> >>>+/**
> >>>+ * ib_alloc_mr() - Allocates a memory region
> >>>+ * @pd:protection domain associated with the region
> >>>+ * @mr_type:   memory region type
> >>>+ * @max_entries:   maximum registration entries available
> >>>+ * @flags: create flags
> >>>+ */
> >>
> >>Can you update this comment to elaborate some more on what the
> >>parameters are? 'max_entries' is the number of s/g elements or
> >>something?
> >>
> >>>+enum ib_mr_type {
> >>>+  IB_MR_TYPE_FAST_REG,
> >>>+  IB_MR_TYPE_SIGNATURE,
> >>>  };
> >>
> >>Sure would be nice to have some documentation for what these things
> >>do..
> >
> >Agreed on both counts.  Otherwise this looks pretty good to me.
> 
> I can add some more documentation here...

So, I was wrong, 'max_entries' is the number of page entires, not
really the s/g element limit?

In other words the ULP can submit at most max_entires*PAGE_SIZE bytes
for the non ARB_SG case

For the ARB_SG case.. It is some other more difficult computation?

It is somewhat ugly to ask for this upfront as a hard limit..

Is there any reason we can't use a hint_prealloc_pages as the argument
here, and then realloc in the map routine if the hint turns out to be
too small for a particular s/g list?

It looks like all drivers can support this.

That would make it much easier to use correctly, and free ULPs from
dealing with any impedance mismatch with core kernel code that assumes
a sg list length limit, or overall side limit, not some oddball
computation based on pages...

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 02/43] IB/mlx4: Support ib_alloc_mr verb

2015-07-22 Thread Jason Gunthorpe
On Wed, Jul 22, 2015 at 01:50:01PM -0500, Steve Wise wrote:
 
> 43 patches overflows my stack ;)  I agree with Jason's suggestion.

Saig, you may as well just send the ib_alloc_mr rework as a series and
get it done with, I'd pass off on the core parts of v2.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 02/43] IB/mlx4: Support ib_alloc_mr verb

2015-07-22 Thread Steve Wise

On 7/22/2015 12:22 PM, Sagi Grimberg wrote:

On 7/22/2015 7:58 PM, Jason Gunthorpe wrote:

On Wed, Jul 22, 2015 at 09:55:02AM +0300, Sagi Grimberg wrote:


+struct ib_mr *mlx4_ib_alloc_mr(struct ib_pd *pd,
+   enum ib_mr_type mr_type,
+   u32 max_entries,
+   u32 flags)
+{


This is just a copy of mlx4_ib_alloc_fast_reg_mr with
this added:


+if (mr_type != IB_MR_TYPE_FAST_REG || flags)
+return ERR_PTR(-EINVAL);


Are all the driver updates the same? It looks like it.

I'd suggest shortening this patch series, have the core provide the
wrapper immediately:

struct ib_mr *ib_alloc_mr(struct ib_pd *pd,
{
...

 if (pd->device->alloc_mr) {
mr = pd->device->alloc_mr(pd, mr_type, max_entries, flags);
 } else {
if (mr_type != IB_MR_TYPE_FAST_REG || flags ||
!ib_dev->alloc_fast_reg_mr)
return ERR_PTR(-ENOSYS);
mr = pd->device->alloc_fast_reg_mr(..);
 }
}

Then go through the series to remove ib_alloc_fast_reg_mr

Then go through one series to migrate the drivers from
alloc_fast_reg_mr to alloc_mr

Then entirely drop alloc_fast_reg_mr from the driver API.

That should be shorter and easier to read the driver diffs, which is
the major change here.


Yea, it would be better...


43 patches overflows my stack ;)  I agree with Jason's suggestion.

Steve.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 28/43] IB/core: Introduce new fast registration API

2015-07-22 Thread Jason Gunthorpe
On Wed, Jul 22, 2015 at 09:55:28AM +0300, Sagi Grimberg wrote:
> +/**
> + * ib_map_mr_sg() - Populates MR with a dma mapped SG list
> + * @mr:memory region
> + * @sg:dma mapped scatterlist
> + * @sg_nents:  number of entries in sg
> + * @access:access permissions

Again, related to my prior comments, please have two of these:

ib_map_mr_sg_rkey()
ib_map_mr_sg_lkey()

So we force ULPs to think about what they are doing properly, and we
get a chance to actually force lkey to be local use only for IB.

> +static inline void
> +ib_set_fastreg_wr(struct ib_mr *mr,
> +   u32 key,

The key should come from MR. Once the above is split then it is
obvious which key to use.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 38/43] iser-target: Port to new memory registration API

2015-07-22 Thread Jason Gunthorpe
On Wed, Jul 22, 2015 at 08:33:16PM +0300, Sagi Grimberg wrote:
> >>memset(&fr_wr, 0, sizeof(fr_wr));
> >>+   ib_set_fastreg_wr(mr, mr->lkey, ISER_FASTREG_LI_WRID,
> >>+ false, &fr_wr);
> >
> >Shouldn't ib_set_fastreg_wr take care of this memset?  Also it seems
> >instead of the singalled flag to it we might just set that or
> >other flags later if we really want to.

Seems reasonable.

If you want to micro optimize then just zero the few items that are
defined to be accessed for fastreg, no need to zero the whole
structure. Infact, you may have already done that, so just drop the
memset entirely.

> The reason I didn't put it in was that ib_send_wr is not a small struct
> (92 bytes IIRC). So I'm a bit reluctant to add an unconditional memset.
> Maybe it's better that the callers can carefully set it to save some
> cycles?

If you want to optimize this path, then Sean is right, move the post
into the driver and stop pretending that ib_post_send is a performance
API.

ib_post_fastreg_wr would be a function that needs 3 register passed
arguments and does a simple copy to the driver's actual sendq

No 96 byte structure memset, no stack traffic, no conditional
jumps.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 28/43] IB/core: Introduce new fast registration API

2015-07-22 Thread Jason Gunthorpe
On Wed, Jul 22, 2015 at 09:50:12AM -0700, Christoph Hellwig wrote:
> > +/**
> > + * ib_map_mr_sg() - Populates MR with a dma mapped SG list
> > + * @mr:memory region
> > + * @sg:dma mapped scatterlist
> > + * @sg_nents:  number of entries in sg
> > + * @access:access permissions
> 
> I know moving the access flags here was my idea originally, but I seem
> convinced by your argument that it might fit in better with the posting
> helper.  Or did someone else come up with a better argument that mine
> for moving it here?

I was hoping we'd move the DMA flush and translate into here and make
it mandatory. Is there any reason not to do that?

> > +int ib_map_mr_sg(struct ib_mr *mr,
> > +struct scatterlist *sg,
> > +unsigned short sg_nents,
> > +unsigned int access)
> > +{
> > +   int rc;
> > +
> > +   if (!mr->device->map_mr_sg)
> > +   return -ENOSYS;
> > +
> > +   rc = mr->device->map_mr_sg(mr, sg, sg_nents);
> 
> Do we really need a driver callout here?  It seems like we should

The call out makes sense to me..

The driver will convert the scatter list directly into whatever HW
representation it needs and prepare everything for posting. Every
driver has a different HW format, so it must be a callout.

> Also it seems like this returns 0/-error.  How do callers like SRP
> see that it only did a partial mapping and it needs another MR?

I would think it is an error to pass in more sg_nents than the MR was
created with, so SRP should never get a partial mapping as it should
never ask for more than max_entries.

(? Sagi, did I get the intent of this right?)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 00/43] New fast registration API

2015-07-22 Thread Sagi Grimberg

On 7/22/2015 8:10 PM, Christoph Hellwig wrote:

Thanks Sagi,

this looks pretty good in general, various nitpicks nonwithstanding.

The one thing I'm curious about is how we can support SRP with it's
multiple MR support without too much boilerplate code.  One option
would be that pass an array of MRs to the map routines, and while
most callers would just pass in one it would handle multiple for those
drivers that supply them.


We can do that, but I'd prefer not to pollute the API just for this
single use case. What we can do, is add a pool API that would take care
of that. But even then we might end up with different strategies as not
all ULPs can use it the same way (protocol constraints)...

Today SRP has this logic that registers multiple SG aligned partials.
We can just have it pass a partial SG list to what we have today instead
of building the page vectors...

Or if we can come up with something that will keep the API trivial, we
can take care of that too.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 38/43] iser-target: Port to new memory registration API

2015-07-22 Thread Sagi Grimberg

On 7/22/2015 8:04 PM, Christoph Hellwig wrote:

@@ -2585,11 +2517,9 @@ isert_fast_reg_mr(struct isert_conn *isert_conn,
struct isert_device *device = isert_conn->device;
struct ib_device *ib_dev = device->ib_device;
struct ib_mr *mr;
struct ib_send_wr fr_wr, inv_wr;
struct ib_send_wr *bad_wr, *wr = NULL;
+   int ret;

if (mem->dma_nents == 1) {
sge->lkey = device->mr->lkey;
@@ -2600,40 +2530,32 @@ isert_fast_reg_mr(struct isert_conn *isert_conn,
return 0;
}

+   if (ind == ISERT_DATA_KEY_VALID)
/* Registering data buffer */
mr = fr_desc->data_mr;
+   else
/* Registering protection buffer */
mr = fr_desc->pi_ctx->prot_mr;

if (!(fr_desc->ind & ind)) {
isert_inv_rkey(&inv_wr, mr);
wr = &inv_wr;
}

+   ret = ib_map_mr_sg(mr, mem->sg, mem->nents, IB_ACCESS_LOCAL_WRITE);
+   if (ret) {
+   isert_err("failed to map sg %p with %d entries\n",
+mem->sg, mem->dma_nents);
+   return ret;
+   }
+
+   isert_dbg("Use fr_desc %p sg_nents %d offset %u\n",
+ fr_desc, mem->nents, mem->offset);
+
/* Prepare FASTREG WR */
memset(&fr_wr, 0, sizeof(fr_wr));
+   ib_set_fastreg_wr(mr, mr->lkey, ISER_FASTREG_LI_WRID,
+ false, &fr_wr);


Shouldn't ib_set_fastreg_wr take care of this memset?  Also it seems
instead of the singalled flag to it we might just set that or
other flags later if we really want to.


The reason I didn't put it in was that ib_send_wr is not a small struct
(92 bytes IIRC). So I'm a bit reluctant to add an unconditional memset.
Maybe it's better that the callers can carefully set it to save some
cycles?




  struct pi_context {
struct ib_mr   *prot_mr;
-   struct ib_fast_reg_page_list   *prot_frpl;
struct ib_mr   *sig_mr;
  };

  struct fast_reg_descriptor {
struct list_headlist;
struct ib_mr   *data_mr;
-   struct ib_fast_reg_page_list   *data_frpl;
u8  ind;
struct pi_context  *pi_ctx;


As a follow on it might be worth to just kill off the separate
pi_context structure here.


Yea we can do that..
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 40/43] mlx5: Allocate private context for arbitrary scatterlist registration

2015-07-22 Thread Jason Gunthorpe
On Wed, Jul 22, 2015 at 09:55:40AM +0300, Sagi Grimberg wrote:
> + size += max_t(int, MLX5_UMR_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
> + mr->klms = kzalloc(size, GFP_KERNEL);
> + if (!mr->klms)
> + return -ENOMEM;
> +
> + mr->pl_map = dma_map_single(device->dma_device, mr->klms,
> + size, DMA_TO_DEVICE);

This is a misuse of the DMA API, you must call dma_map_single after
the memory is set by the CPU, not before.

The fast reg varient is using coherent allocations, which is OK..

Personally, I'd switch them both to map_single, then when copying the
scatter list
 - Make sure the buffer is DMA unmapped
 - Copy
 - dma_map_single

Unless there is some additional reason for the coherent allocation..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 39/43] IB/core: Add arbitrary sg_list support

2015-07-22 Thread Sagi Grimberg

On 7/22/2015 8:22 PM, Jason Gunthorpe wrote:

On Wed, Jul 22, 2015 at 09:55:39AM +0300, Sagi Grimberg wrote:

+enum ib_mr_flags {
+   IB_MR_MAP_ARB_SG = 1,
+};


Something about this just seems ugly. We are back to what we were
trying to avoid: Adding more types of MRs..

Is this really necessary? Do you really need to know the MR type when
the MR is created, or can the adaptor change types on the fly during
registration?

iSER for example has a rarely used corner case where it needs this,


I can tell you that its anything but a corner case. direct-io, bio
merges, FS operations and PI are examples where most of the sg lists
*will* be "gappy".

Trust me, it's fairly common to see those...


but it just turns on the feature unconditionally right away. This
incures 2x the overhead in the MR allocations and who knows what
performance impact on the adaptor side.


I ran various workloads with this, and performance seems to sustain.



It would be so much better if it could switch to this mode on a SG by
SG list basis.


It would, but unfortunately it can't.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 00/43] New fast registration API

2015-07-22 Thread Jason Gunthorpe
On Wed, Jul 22, 2015 at 10:10:23AM -0700, Christoph Hellwig wrote:
> The one thing I'm curious about is how we can support SRP with it's
> multiple MR support without too much boilerplate code.  One option
> would be that pass an array of MRs to the map routines, and while
> most callers would just pass in one it would handle multiple for those
> drivers that supply them.

What is SRP trying to accomplish with that?

The only reason that springs to mind is to emulate IB_MR_MAP_ARB_SG ?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 39/43] IB/core: Add arbitrary sg_list support

2015-07-22 Thread Jason Gunthorpe
On Wed, Jul 22, 2015 at 09:55:39AM +0300, Sagi Grimberg wrote:
> +enum ib_mr_flags {
> + IB_MR_MAP_ARB_SG = 1,
> +};

Something about this just seems ugly. We are back to what we were
trying to avoid: Adding more types of MRs..

Is this really necessary? Do you really need to know the MR type when
the MR is created, or can the adaptor change types on the fly during
registration?

iSER for example has a rarely used corner case where it needs this,
but it just turns on the feature unconditionally right away. This
incures 2x the overhead in the MR allocations and who knows what
performance impact on the adaptor side.

It would be so much better if it could switch to this mode on a SG by
SG list basis.

Same for signature.

In other words: It would be so much cleaner if ib_map_mr_sg set the
MR type based on the need.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 02/43] IB/mlx4: Support ib_alloc_mr verb

2015-07-22 Thread Sagi Grimberg

On 7/22/2015 7:58 PM, Jason Gunthorpe wrote:

On Wed, Jul 22, 2015 at 09:55:02AM +0300, Sagi Grimberg wrote:


+struct ib_mr *mlx4_ib_alloc_mr(struct ib_pd *pd,
+  enum ib_mr_type mr_type,
+  u32 max_entries,
+  u32 flags)
+{


This is just a copy of mlx4_ib_alloc_fast_reg_mr with
this added:


+   if (mr_type != IB_MR_TYPE_FAST_REG || flags)
+   return ERR_PTR(-EINVAL);


Are all the driver updates the same? It looks like it.

I'd suggest shortening this patch series, have the core provide the
wrapper immediately:

struct ib_mr *ib_alloc_mr(struct ib_pd *pd,
{
...

 if (pd->device->alloc_mr) {
mr = pd->device->alloc_mr(pd, mr_type, max_entries, flags);
 } else {
if (mr_type != IB_MR_TYPE_FAST_REG || flags ||
!ib_dev->alloc_fast_reg_mr)
return ERR_PTR(-ENOSYS);
mr = pd->device->alloc_fast_reg_mr(..);
 }
}

Then go through the series to remove ib_alloc_fast_reg_mr

Then go through one series to migrate the drivers from
alloc_fast_reg_mr to alloc_mr

Then entirely drop alloc_fast_reg_mr from the driver API.

That should be shorter and easier to read the driver diffs, which is
the major change here.


Yea, it would be better...

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 00/43] New fast registration API

2015-07-22 Thread Christoph Hellwig
Thanks Sagi,

this looks pretty good in general, various nitpicks nonwithstanding.

The one thing I'm curious about is how we can support SRP with it's
multiple MR support without too much boilerplate code.  One option
would be that pass an array of MRs to the map routines, and while
most callers would just pass in one it would handle multiple for those
drivers that supply them.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 39/43] IB/core: Add arbitrary sg_list support

2015-07-22 Thread Christoph Hellwig
> + IB_DEVICE_MAP_ARB_SG= (1ULL<<32),

> +enum ib_mr_flags {
> + IB_MR_MAP_ARB_SG = 1,
> +};
> +

s/ARB_SG/SG_GAPS/?

Also please try to document new flags.  I know the IB code currently
doesn't do it, but starting a trend there would be very useful.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 38/43] iser-target: Port to new memory registration API

2015-07-22 Thread Christoph Hellwig
> @@ -2585,11 +2517,9 @@ isert_fast_reg_mr(struct isert_conn *isert_conn,
>   struct isert_device *device = isert_conn->device;
>   struct ib_device *ib_dev = device->ib_device;
>   struct ib_mr *mr;
>   struct ib_send_wr fr_wr, inv_wr;
>   struct ib_send_wr *bad_wr, *wr = NULL;
> + int ret;
>  
>   if (mem->dma_nents == 1) {
>   sge->lkey = device->mr->lkey;
> @@ -2600,40 +2530,32 @@ isert_fast_reg_mr(struct isert_conn *isert_conn,
>   return 0;
>   }
>  
> + if (ind == ISERT_DATA_KEY_VALID)
>   /* Registering data buffer */
>   mr = fr_desc->data_mr;
> + else
>   /* Registering protection buffer */
>   mr = fr_desc->pi_ctx->prot_mr;
>  
>   if (!(fr_desc->ind & ind)) {
>   isert_inv_rkey(&inv_wr, mr);
>   wr = &inv_wr;
>   }
>  
> + ret = ib_map_mr_sg(mr, mem->sg, mem->nents, IB_ACCESS_LOCAL_WRITE);
> + if (ret) {
> + isert_err("failed to map sg %p with %d entries\n",
> +  mem->sg, mem->dma_nents);
> + return ret;
> + }
> +
> + isert_dbg("Use fr_desc %p sg_nents %d offset %u\n",
> +   fr_desc, mem->nents, mem->offset);
> +
>   /* Prepare FASTREG WR */
>   memset(&fr_wr, 0, sizeof(fr_wr));
> + ib_set_fastreg_wr(mr, mr->lkey, ISER_FASTREG_LI_WRID,
> +   false, &fr_wr);

Shouldn't ib_set_fastreg_wr take care of this memset?  Also it seems
instead of the singalled flag to it we might just set that or
other flags later if we really want to.

>  struct pi_context {
>   struct ib_mr   *prot_mr;
> - struct ib_fast_reg_page_list   *prot_frpl;
>   struct ib_mr   *sig_mr;
>  };
>  
>  struct fast_reg_descriptor {
>   struct list_headlist;
>   struct ib_mr   *data_mr;
> - struct ib_fast_reg_page_list   *data_frpl;
>   u8  ind;
>   struct pi_context  *pi_ctx;

As a follow on it might be worth to just kill off the separate
pi_context structure here.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 01/43] IB: Modify ib_create_mr API

2015-07-22 Thread Sagi Grimberg

On 7/22/2015 8:01 PM, Jason Gunthorpe wrote:

On Wed, Jul 22, 2015 at 07:59:16PM +0300, Sagi Grimberg wrote:

Do we want to pull ib_get_dma_mr() here with type IB_MR_TYPE_DMA?


I want to get rid of ib_get_dma_mr...


That's why I asked :)

So I'll take it as a no...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 01/43] IB: Modify ib_create_mr API

2015-07-22 Thread Jason Gunthorpe
On Wed, Jul 22, 2015 at 07:59:16PM +0300, Sagi Grimberg wrote:
> Do we want to pull ib_get_dma_mr() here with type IB_MR_TYPE_DMA?

I want to get rid of ib_get_dma_mr...

My plan was to get rid of it as my last series shows for all lkey
usages and then rename it to:

ib_get_insecure_all_physical_rkey

For the remaining usages, and a future kernel version will taint the
kernel if anyone calls it.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 01/43] IB: Modify ib_create_mr API

2015-07-22 Thread Sagi Grimberg

On 7/22/2015 7:34 PM, Jason Gunthorpe wrote:

+/**
+ * ib_alloc_mr() - Allocates a memory region
+ * @pd:protection domain associated with the region
+ * @mr_type:   memory region type
+ * @max_entries:   maximum registration entries available
+ * @flags: create flags
+ */


Can you update this comment to elaborate some more on what the
parameters are? 'max_entries' is the number of s/g elements or
something?


+enum ib_mr_type {
+   IB_MR_TYPE_FAST_REG,
+   IB_MR_TYPE_SIGNATURE,
  };


Sure would be nice to have some documentation for what these things
do..


Do we want to pull ib_get_dma_mr() here with type IB_MR_TYPE_DMA?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 37/43] xprtrdma: Port to new memory registration API

2015-07-22 Thread Christoph Hellwig
On Wed, Jul 22, 2015 at 11:03:49AM -0400, Chuck Lever wrote:
> I like this (and the matching ib_dma_unmap_sg). But why wouldn?t
> this function be called ib_dma_map_sg() ? The name ib_map_mr_sg()
> had me thinking for a moment that this API actually posted the
> FASTREG WR, but I see that it doesn?t.

We already have a ib_dma_map_sg, which is a wrapper around dma_map_sg
that allows ehc ipath amd qib to do naughty things instead of the
regular dma mapping.

But it seems maybe the dma_map_sg calls or the magic for those other
drivers should be folded into Sagi's new API as those HCA apparently
don't need physical addresses and thus the S/G list.

God knows what's they're doing with a list of virtual addresses, but
removing the struct scatterlist abuse there would be highly welcome.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 02/43] IB/mlx4: Support ib_alloc_mr verb

2015-07-22 Thread Jason Gunthorpe
On Wed, Jul 22, 2015 at 09:55:02AM +0300, Sagi Grimberg wrote:
>  
> +struct ib_mr *mlx4_ib_alloc_mr(struct ib_pd *pd,
> +enum ib_mr_type mr_type,
> +u32 max_entries,
> +u32 flags)
> +{

This is just a copy of mlx4_ib_alloc_fast_reg_mr with
this added:

> + if (mr_type != IB_MR_TYPE_FAST_REG || flags)
> + return ERR_PTR(-EINVAL);

Are all the driver updates the same? It looks like it.

I'd suggest shortening this patch series, have the core provide the
wrapper immediately:

struct ib_mr *ib_alloc_mr(struct ib_pd *pd,
{
...

if (pd->device->alloc_mr) {
mr = pd->device->alloc_mr(pd, mr_type, max_entries, flags);
} else {
if (mr_type != IB_MR_TYPE_FAST_REG || flags ||
!ib_dev->alloc_fast_reg_mr)
return ERR_PTR(-ENOSYS);
mr = pd->device->alloc_fast_reg_mr(..);
}
}

Then go through the series to remove ib_alloc_fast_reg_mr

Then go through one series to migrate the drivers from
alloc_fast_reg_mr to alloc_mr

Then entirely drop alloc_fast_reg_mr from the driver API.

That should be shorter and easier to read the driver diffs, which is
the major change here.

This whole section (up to 20) looks reasonable to me..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 01/43] IB: Modify ib_create_mr API

2015-07-22 Thread Sagi Grimberg

On 7/22/2015 7:44 PM, Christoph Hellwig wrote:

On Wed, Jul 22, 2015 at 10:34:05AM -0600, Jason Gunthorpe wrote:

+/**
+ * ib_alloc_mr() - Allocates a memory region
+ * @pd:protection domain associated with the region
+ * @mr_type:   memory region type
+ * @max_entries:   maximum registration entries available
+ * @flags: create flags
+ */


Can you update this comment to elaborate some more on what the
parameters are? 'max_entries' is the number of s/g elements or
something?


+enum ib_mr_type {
+   IB_MR_TYPE_FAST_REG,
+   IB_MR_TYPE_SIGNATURE,
  };


Sure would be nice to have some documentation for what these things
do..


Agreed on both counts.  Otherwise this looks pretty good to me.


I can add some more documentation here...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 28/43] IB/core: Introduce new fast registration API

2015-07-22 Thread Sagi Grimberg

On 7/22/2015 7:50 PM, Christoph Hellwig wrote:

+/**
+ * ib_map_mr_sg() - Populates MR with a dma mapped SG list
+ * @mr:memory region
+ * @sg:dma mapped scatterlist
+ * @sg_nents:  number of entries in sg
+ * @access:access permissions


I know moving the access flags here was my idea originally, but I seem
convinced by your argument that it might fit in better with the posting
helper.  Or did someone else come up with a better argument that mine
for moving it here?


Not really. I was and still pretty indifferent about it...




+int ib_map_mr_sg(struct ib_mr *mr,
+struct scatterlist *sg,
+unsigned short sg_nents,
+unsigned int access)
+{
+   int rc;
+
+   if (!mr->device->map_mr_sg)
+   return -ENOSYS;
+
+   rc = mr->device->map_mr_sg(mr, sg, sg_nents);


Do we really need a driver callout here?  It seems like we should
just do the map here, and then either have a flag for the mlx5 indirect
mapping, or if you want to keep the abstraction add the method at that
point but make it optional, so that all the other drivers don't need the
boilerplate code.


I commented on this bit in another reply. I think that several drivers
will want to use their own mappings. But I can change that if it's not
the case...



Also it seems like this returns 0/-error.  How do callers like SRP
see that it only did a partial mapping and it needs another MR?


Umm, I think SRP would need to iterate over the sg list and pass partial
SGs to the mapping (I can add a break; statement if we met sg_nents)

It's not perfect, but the idea was not to do backflips here.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 21/43] mlx5: Allocate a private page list in ib_alloc_mr

2015-07-22 Thread Sagi Grimberg

On 7/22/2015 7:46 PM, Christoph Hellwig wrote:

Just curious: what's the tradeoff between allocating the page list
in the core vs duplicating it in all the drivers?  Does the driver
variant give us any benefits?


It's not necessarily a page list... (i.e. a real scatterlist).
I it will make more sense in patch 41/43.

Moreover, as I wrote in the cover-letter. I noticed that several
drivers keep shadows anyway for various reasons. For example mlx4
sets the page list with a preset-bit (related to ODP...) so at
registration time we see the loop:

for (i = 0; i < mr->npages; ++i)
mr->mpl[i] = cpu_to_be64(mr->pl[i] | MLX4_MTT_FLAG_PRESENT);

Given that this not a single example, I'd expect drivers to skip this
duplication (hopefully).

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 28/43] IB/core: Introduce new fast registration API

2015-07-22 Thread Christoph Hellwig
> +/**
> + * ib_map_mr_sg() - Populates MR with a dma mapped SG list
> + * @mr:memory region
> + * @sg:dma mapped scatterlist
> + * @sg_nents:  number of entries in sg
> + * @access:access permissions

I know moving the access flags here was my idea originally, but I seem
convinced by your argument that it might fit in better with the posting
helper.  Or did someone else come up with a better argument that mine
for moving it here?

> +int ib_map_mr_sg(struct ib_mr *mr,
> +  struct scatterlist *sg,
> +  unsigned short sg_nents,
> +  unsigned int access)
> +{
> + int rc;
> +
> + if (!mr->device->map_mr_sg)
> + return -ENOSYS;
> +
> + rc = mr->device->map_mr_sg(mr, sg, sg_nents);

Do we really need a driver callout here?  It seems like we should
just do the map here, and then either have a flag for the mlx5 indirect
mapping, or if you want to keep the abstraction add the method at that
point but make it optional, so that all the other drivers don't need the
boilerplate code.

Also it seems like this returns 0/-error.  How do callers like SRP
see that it only did a partial mapping and it needs another MR?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 21/43] mlx5: Allocate a private page list in ib_alloc_mr

2015-07-22 Thread Christoph Hellwig
Just curious: what's the tradeoff between allocating the page list
in the core vs duplicating it in all the drivers?  Does the driver
variant give us any benefits?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 01/43] IB: Modify ib_create_mr API

2015-07-22 Thread Christoph Hellwig
On Wed, Jul 22, 2015 at 10:34:05AM -0600, Jason Gunthorpe wrote:
> > +/**
> > + * ib_alloc_mr() - Allocates a memory region
> > + * @pd:protection domain associated with the region
> > + * @mr_type:   memory region type
> > + * @max_entries:   maximum registration entries available
> > + * @flags: create flags
> > + */
> 
> Can you update this comment to elaborate some more on what the
> parameters are? 'max_entries' is the number of s/g elements or
> something?
> 
> > +enum ib_mr_type {
> > +   IB_MR_TYPE_FAST_REG,
> > +   IB_MR_TYPE_SIGNATURE,
> >  };
> 
> Sure would be nice to have some documentation for what these things
> do..

Agreed on both counts.  Otherwise this looks pretty good to me.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 01/43] IB: Modify ib_create_mr API

2015-07-22 Thread Jason Gunthorpe
> +/**
> + * ib_alloc_mr() - Allocates a memory region
> + * @pd:protection domain associated with the region
> + * @mr_type:   memory region type
> + * @max_entries:   maximum registration entries available
> + * @flags: create flags
> + */

Can you update this comment to elaborate some more on what the
parameters are? 'max_entries' is the number of s/g elements or
something?

> +enum ib_mr_type {
> + IB_MR_TYPE_FAST_REG,
> + IB_MR_TYPE_SIGNATURE,
>  };

Sure would be nice to have some documentation for what these things
do..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 37/43] xprtrdma: Port to new memory registration API

2015-07-22 Thread Chuck Lever

On Jul 22, 2015, at 11:41 AM, Sagi Grimberg  wrote:

> 
>>> +   for (i = 0; i < nsegs;) {
>>> +   sg_set_page(&frmr->sg[i], seg->mr_page,
>>> +   seg->mr_len, offset_in_page(seg->mr_offset));
>> 
>> Cautionary note: here we’re dealing with both the “contiguous
>> set of pages” case and the “small region of bytes in a single page”
>> case. See rpcrdma_convert_iovs(): sometimes RPC send or receive
>> buffers can be registered (RDMA_NOMSG).
> 
> I noticed that (I think). I think this is handled correctly.
> What exactly is the caution note here?

Well the sg is turned into a page list below your API. Just
want to make sure that we have tested your xprtrdma alterations
with all the ULP possibilities. When you are further along I
can pull this and run my functional tests.


>>> mr = frmr->fr_mr;
>>> +   access = writing ? IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
>>> +  IB_ACCESS_REMOTE_READ;
>>> +   rc = ib_map_mr_sg(mr, frmr->sg, frmr->sg_nents, access);
>> 
>> I like this (and the matching ib_dma_unmap_sg). But why wouldn’t
>> this function be called ib_dma_map_sg() ? The name ib_map_mr_sg()
>> had me thinking for a moment that this API actually posted the
>> FASTREG WR, but I see that it doesn’t.
> 
> Umm, ib_dma_map_sg is already taken :)
> 
> This is what I came up with, it maps the SG elements to the MR
> private context.
> 
> I'd like to keep the post API for now. It will be possible to
> to add a wrapper function that would do:
> - dma_map_sg
> - ib_map_mr_sg
> - init fastreg send_wr
> - post_send (maybe)

Where xprtrdma might improve is by setting up all the FASTREG
WRs for one RPC with a single chain and post_send. We could do
that with your INDIR_MR concept, for example.


>>> -   while (seg1->mr_nsegs--)
>>> -   rpcrdma_unmap_one(ia->ri_device, seg++);
>>> +   ib_dma_unmap_sg(ia->ri_device, frmr->sg, frmr->sg_nents, seg1->mr_dir);
>> 
>> ->mr_dir was previously set by rpcrdma_map_one(), which you’ve replaced
>> with ib_map_mr_sg(). So maybe frwr_op_map() needs to save “direction”
>> in the rpcrdma_frmr.
> 
> Yep, that's correct, if I had turned on dma mapping debug it would shout
> at me here...
> 
> Note, I added in the git repo a patch to allow arbitrary sg lists in
> frwr_op_map() which would allow you to skip the holes check... seems to
> work with mlx5...
> 
> I did noticed the mlx4 gives a protection error with after the conversion... 
> I'll look into that...

Should also get Steve and Devesh to try this with their adapters.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 37/43] xprtrdma: Port to new memory registration API

2015-07-22 Thread Sagi Grimberg



+   for (i = 0; i < nsegs;) {
+   sg_set_page(&frmr->sg[i], seg->mr_page,
+   seg->mr_len, offset_in_page(seg->mr_offset));


Cautionary note: here we’re dealing with both the “contiguous
set of pages” case and the “small region of bytes in a single page”
case. See rpcrdma_convert_iovs(): sometimes RPC send or receive
buffers can be registered (RDMA_NOMSG).


I noticed that (I think). I think this is handled correctly.
What exactly is the caution note here?


mr = frmr->fr_mr;
+   access = writing ? IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+  IB_ACCESS_REMOTE_READ;
+   rc = ib_map_mr_sg(mr, frmr->sg, frmr->sg_nents, access);


I like this (and the matching ib_dma_unmap_sg). But why wouldn’t
this function be called ib_dma_map_sg() ? The name ib_map_mr_sg()
had me thinking for a moment that this API actually posted the
FASTREG WR, but I see that it doesn’t.


Umm, ib_dma_map_sg is already taken :)

This is what I came up with, it maps the SG elements to the MR
private context.

I'd like to keep the post API for now. It will be possible to
to add a wrapper function that would do:
- dma_map_sg
- ib_map_mr_sg
- init fastreg send_wr
- post_send (maybe)



-   while (seg1->mr_nsegs--)
-   rpcrdma_unmap_one(ia->ri_device, seg++);
+   ib_dma_unmap_sg(ia->ri_device, frmr->sg, frmr->sg_nents, seg1->mr_dir);


->mr_dir was previously set by rpcrdma_map_one(), which you’ve replaced
with ib_map_mr_sg(). So maybe frwr_op_map() needs to save “direction”
in the rpcrdma_frmr.


Yep, that's correct, if I had turned on dma mapping debug it would shout
at me here...

Note, I added in the git repo a patch to allow arbitrary sg lists in
frwr_op_map() which would allow you to skip the holes check... seems to
work with mlx5...

I did noticed the mlx4 gives a protection error with after the 
conversion... I'll look into that...

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH WIP 37/43] xprtrdma: Port to new memory registration API

2015-07-22 Thread Chuck Lever

On Jul 22, 2015, at 2:55 AM, Sagi Grimberg  wrote:

> Signed-off-by: Sagi Grimberg 
> ---
> net/sunrpc/xprtrdma/frwr_ops.c  | 80 ++---
> net/sunrpc/xprtrdma/xprt_rdma.h |  4 ++-
> 2 files changed, 47 insertions(+), 37 deletions(-)
> 
> diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
> index 517efed..e28246b 100644
> --- a/net/sunrpc/xprtrdma/frwr_ops.c
> +++ b/net/sunrpc/xprtrdma/frwr_ops.c
> @@ -151,9 +151,13 @@ __frwr_init(struct rpcrdma_mw *r, struct ib_pd *pd, 
> struct ib_device *device,
>   f->fr_mr = ib_alloc_mr(pd, IB_MR_TYPE_FAST_REG, depth, 0);
>   if (IS_ERR(f->fr_mr))
>   goto out_mr_err;
> - f->fr_pgl = ib_alloc_fast_reg_page_list(device, depth);
> - if (IS_ERR(f->fr_pgl))
> +
> + f->sg = kcalloc(sizeof(*f->sg), depth, GFP_KERNEL);
> + if (IS_ERR(f->sg))
>   goto out_list_err;
> +
> + sg_init_table(f->sg, depth);
> +
>   return 0;
> 
> out_mr_err:
> @@ -163,7 +167,7 @@ out_mr_err:
>   return rc;
> 
> out_list_err:
> - rc = PTR_ERR(f->fr_pgl);
> + rc = -ENOMEM;
>   dprintk("RPC:   %s: ib_alloc_fast_reg_page_list status %i\n",
>   __func__, rc);
>   ib_dereg_mr(f->fr_mr);
> @@ -179,7 +183,7 @@ __frwr_release(struct rpcrdma_mw *r)
>   if (rc)
>   dprintk("RPC:   %s: ib_dereg_mr status %i\n",
>   __func__, rc);
> - ib_free_fast_reg_page_list(r->r.frmr.fr_pgl);
> + kfree(r->r.frmr.sg);
> }
> 
> static int
> @@ -320,10 +324,7 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
> rpcrdma_mr_seg *seg,
>   struct ib_send_wr fastreg_wr, *bad_wr;
>   u8 key;
>   int len, pageoff;
> - int i, rc;
> - int seg_len;
> - u64 pa;
> - int page_no;
> + int i, rc, access;
> 
>   mw = seg1->rl_mw;
>   seg1->rl_mw = NULL;
> @@ -344,39 +345,46 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
> rpcrdma_mr_seg *seg,
>   if (nsegs > ia->ri_max_frmr_depth)
>   nsegs = ia->ri_max_frmr_depth;
> 
> - for (page_no = i = 0; i < nsegs;) {
> - rpcrdma_map_one(device, seg, direction);
> - pa = seg->mr_dma;
> - for (seg_len = seg->mr_len; seg_len > 0; seg_len -= PAGE_SIZE) {
> - frmr->fr_pgl->page_list[page_no++] = pa;
> - pa += PAGE_SIZE;
> - }
> + for (i = 0; i < nsegs;) {
> + sg_set_page(&frmr->sg[i], seg->mr_page,
> + seg->mr_len, offset_in_page(seg->mr_offset));

Cautionary note: here we’re dealing with both the “contiguous
set of pages” case and the “small region of bytes in a single page”
case. See rpcrdma_convert_iovs(): sometimes RPC send or receive
buffers can be registered (RDMA_NOMSG).


>   len += seg->mr_len;
> - ++seg;
>   ++i;
> - /* Check for holes */
> + ++seg;
> +
> + /* Check for holes - needed?? */
>   if ((i < nsegs && offset_in_page(seg->mr_offset)) ||
>   offset_in_page((seg-1)->mr_offset + (seg-1)->mr_len))
>   break;
>   }
> +
> + frmr->sg_nents = i;
> + frmr->dma_nents = ib_dma_map_sg(device, frmr->sg,
> + frmr->sg_nents, direction);
> + if (!frmr->dma_nents) {
> + pr_err("RPC:   %s: failed to dma map sg %p sg_nents %d\n",
> + __func__, frmr->sg, frmr->sg_nents);
> + return -ENOMEM;
> + }
> +
>   dprintk("RPC:   %s: Using frmr %p to map %d segments (%d bytes)\n",
>   __func__, mw, i, len);
> 
> - memset(&fastreg_wr, 0, sizeof(fastreg_wr));
> - fastreg_wr.wr_id = (unsigned long)(void *)mw;
> - fastreg_wr.opcode = IB_WR_FAST_REG_MR;
> - fastreg_wr.wr.fast_reg.iova_start = seg1->mr_dma + pageoff;
> - fastreg_wr.wr.fast_reg.page_list = frmr->fr_pgl;
> - fastreg_wr.wr.fast_reg.page_shift = PAGE_SHIFT;
> - fastreg_wr.wr.fast_reg.page_list_len = page_no;
> - fastreg_wr.wr.fast_reg.length = len;
> - fastreg_wr.wr.fast_reg.access_flags = writing ?
> - IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
> - IB_ACCESS_REMOTE_READ;
>   mr = frmr->fr_mr;
> + access = writing ? IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
> +IB_ACCESS_REMOTE_READ;
> + rc = ib_map_mr_sg(mr, frmr->sg, frmr->sg_nents, access);

I like this (and the matching ib_dma_unmap_sg). But why wouldn’t
this function be called ib_dma_map_sg() ? The name ib_map_mr_sg()
had me thinking for a moment that this API actually posted the
FASTREG WR, but I see that it doesn’t.


> + if (rc) {
> + pr_err("RPC:   %s: failed to map mr %p rc %d\n",
> + __func__, frmr->fr_mr, rc);
> + return rc;
> + }
> +
>   key = (u8)(mr->rkey & 0x00FF);
>   

[infiniband-diags PATCH] rdma-ndd: fix compiler warnings.

2015-07-22 Thread Ana Guerrero López
This patch fixes -Wformat-security warnings, that in Debian (and Ubuntu)
are enabled by default:

src/rdma-ndd.c: In function 'update_node_desc':
src/rdma-ndd.c:149:3: error: format not a string literal and no format 
arguments [-Werror=format-security]
   fprintf(f, new_nd);
   ^
src/rdma-ndd.c: In function 'udev_log_fn':
src/rdma-ndd.c:247:2: error: format not a string literal and no format 
arguments [-Werror=format-security]
  syslog(LOG_ERR, msg);

Signed-off-by: Ana Guerrero López 
---
 src/rdma-ndd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/rdma-ndd.c b/src/rdma-ndd.c
index d40e756..94eb3d7 100644
--- a/src/rdma-ndd.c
+++ b/src/rdma-ndd.c
@@ -145,7 +145,7 @@ static int update_node_desc(const char *device, const char 
*hostname, int force)
syslog(LOG_INFO, "%s: change (%s) -> (%s)\n",
device, nd, new_nd);
rewind(f);
-   fprintf(f, new_nd);
+   fprintf(f, "%s", new_nd);
}
 
rc = 0;
@@ -242,7 +242,7 @@ static void udev_log_fn(struct udev *ud, int priority, 
const char *file, int lin
file, line, fn);
if (off < MSG_MAX-1)
vsnprintf(msg+off, MSG_MAX-off, format, args);
-   syslog(LOG_ERR, msg);
+   syslog(LOG_ERR, "%s", msg);
 }
 
 static void setup_udev(void)
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site

2015-07-22 Thread Steve Wise


> -Original Message-
> From: linux-nfs-ow...@vger.kernel.org 
> [mailto:linux-nfs-ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe
> Sent: Tuesday, July 21, 2015 5:55 PM
> To: Steve Wise
> Cc: 'Tom Talpey'; 'Chuck Lever'; linux-rdma@vger.kernel.org; 
> linux-...@vger.kernel.org
> Subject: Re: [PATCH v3 05/15] xprtrdma: Remove last ib_reg_phys_mr() call site
> 
> On Tue, Jul 21, 2015 at 05:41:22PM -0500, Steve Wise wrote:
> > On 7/20/2015 5:42 PM, Jason Gunthorpe wrote:
> > >On Mon, Jul 20, 2015 at 05:41:27PM -0500, Steve Wise wrote:
> > >>>B) why bother to check? Are machines with <4GB interesting, and worth
> > >>>supporting a special optimization?
> > >>No, but cxgb3 is still interesting to user applications, and perhaps 
> > >>NFSRDMA using FRMRs.
> > >Doesn't look like the NFS client will work. It requires an all
> > >physical memory lkey for SEND and RECV buffers..
> > >
> > >Jason
> >
> > Looks like cxgb3 supports LOCAL_DMA_LKEY and MEM_MGT_EXTENSIONS so dma mrs
> > aren't required for NFSRDMA:
> >
> > t4:~/linux-2.6/drivers/infiniband/hw/cxgb3 # grep IB_DEVICE_ iwch_provider.c
> > strlcpy(dev->ibdev.name, "cxgb3_%d", IB_DEVICE_NAME_MAX);
> > dev->device_cap_flags = IB_DEVICE_LOCAL_DMA_LKEY |
> > IB_DEVICE_MEM_WINDOW |
> > IB_DEVICE_MEM_MGT_EXTENSIONS;
> 
> Neat. Is the dma_mask set properly (I don't see any set at all)?
> 

iw_cxgb3 isn't a PCI driver.  It sits on top of cxgb3 which is the pci device 
driver and calls pci_set_dma_mask().

> > So cxgb3 can still support NFSRDMA and user verbs w/o get_dma_mr(). I'll
> > submit a patch soon to only support get_dma_mr() if unsigned long is 4
> > bytes...
> 
> So, NFS and RDS seem to be the only iWarp compatible ULPs?
> 
> NFS has worked, and will continue to work with the global lkey.
> 
> RDS looks like it relies on an insecure all physical rkey, so it won't
> work until that is fixed.
> 
> So, I'd just use sizeof(physaddr_t) > 4 as the test. The only people
> that could be impacted are RDS users using distro kernels on machines
> with less than 4G of ram. I somehow doubt there are any of those...
> 
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html