Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)

2015-09-27 Thread Or Gerlitz
On Sun, Sep 27, 2015 at 7:39 PM, Christoph Lameter  wrote:
> Ok but if Erez does not have the time to participate in code development
> and follow up on the patch as issues arise then I would rather rework the
> code so that it is easily understandable and I will continue to follow up
> on the issues with the code as they develop.

As I mentioned to you earlier on this thread, currently it's not a
matter of having time to participate but rather happily going through
the Jewish new year holidays, this time Sukkot with many people being
off for the whole of it till Oct 6.

Personally, up to few weeks ago, I was under the misimpression that
not only IPoIB joins as full member also on the sendonly flow, but
also that such group can be actually opened under that flow, and it
turns out they don't. Later you said that your production environment
was running a very old non upstream stack that had a knob to somehow
make it work and as of that didn't realize that something goes wrong
for years w.r.t a gateway functionality with upstream/inbox code, so
we all screwed up here over a time period with is few orders of
magnitude longer than a holiday duration.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IPoIB: serialize changing on tx_outstanding

2015-09-27 Thread Wengang Wang
The changing on tx_outstanding should be protected by spinlock or to be
atomic operations.

Such log is found in dmesg:

Sep 16 14:20:53 naep11x06 kernel: ib0: queue stopped 1, tx_head 1034733, 
tx_tail 1034733, tx_outstanding 359 ipoib_sendq_size: 512
Sep 16 14:21:33 naep11x06 kernel: ib0: transmit timeout: latency 9560 msecs
Sep 16 14:21:33 naep11x06 kernel: ib0: queue stopped 1, tx_head 1034854, 
tx_tail 1034854, tx_outstanding 511 ipoib_sendq_size: 512
Sep 16 14:21:38 naep11x06 kernel: ib0: transmit timeout: latency 14568 msecs
Sep 16 14:21:38 naep11x06 kernel: ib0: queue stopped 1, tx_head 1034854, 
tx_tail 1034854, tx_outstanding 511 ipoib_sendq_size: 512

And the send queue of ib0 kept full. When transmit timeout is reported,
queue is reported as "stopped", but the IPoIB stuff tx_head and tx_tail
points to same value. I am not able to see such numbers in ipoib_cm_tx
(for CM) because I have no vmcore. Though I am not quite sure it's caused
by parallel access of tx_outstanding(send path VS interrup path), we really
need to serialize the changeing on tx_outstanding.

This patch also make sure the increase of tx_outstanding prior to the
calling of post_send to avoid the possible decreasing before increasing in
case the running of increasing is scheduled later than the interrupt
handler.

Signed-off-by: Wengang Wang 
---
 drivers/infiniband/ulp/ipoib/ipoib_cm.c | 40 +++--
 drivers/infiniband/ulp/ipoib/ipoib_ib.c | 24 ++--
 2 files changed, 50 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index c78dc16..044da94 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -710,6 +710,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff 
*skb, struct ipoib_cm_
struct ipoib_dev_priv *priv = netdev_priv(dev);
struct ipoib_tx_buf *tx_req;
int rc;
+   unsigned long flags;
 
if (unlikely(skb->len > tx->mtu)) {
ipoib_warn(priv, "packet len %d (> %d) too long to send, 
dropping\n",
@@ -742,27 +743,36 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff 
*skb, struct ipoib_cm_
skb_orphan(skb);
skb_dst_drop(skb);
 
+   spin_lock_irqsave(&priv->lock, flags);
+   if (++priv->tx_outstanding == ipoib_sendq_size) {
+   ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net 
queue\n",
+ tx->qp->qp_num);
+   netif_stop_queue(dev);
+   }
+   spin_unlock_irqrestore(&priv->lock, flags);
+   if (netif_queue_stopped(dev)) {
+   rc = ib_req_notify_cq(priv->send_cq,
+   IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
+   if (rc < 0)
+   ipoib_warn(priv, "request notify on send CQ failed\n");
+   else if (rc)
+   ipoib_send_comp_handler(priv->send_cq, dev);
+   }
+
rc = post_send(priv, tx, tx->tx_head & (ipoib_sendq_size - 1), tx_req);
if (unlikely(rc)) {
ipoib_warn(priv, "post_send failed, error %d\n", rc);
++dev->stats.tx_errors;
+   spin_lock_irqsave(&priv->lock, flags);
+   --priv->tx_outstanding;
+   if (netif_queue_stopped(dev))
+   netif_wake_queue(dev);
+   spin_unlock_irqrestore(&priv->lock, flags);
ipoib_dma_unmap_tx(priv, tx_req);
dev_kfree_skb_any(skb);
} else {
dev->trans_start = jiffies;
++tx->tx_head;
-
-   if (++priv->tx_outstanding == ipoib_sendq_size) {
-   ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net 
queue\n",
- tx->qp->qp_num);
-   netif_stop_queue(dev);
-   rc = ib_req_notify_cq(priv->send_cq,
-   IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS);
-   if (rc < 0)
-   ipoib_warn(priv, "request notify on send CQ 
failed\n");
-   else if (rc)
-   ipoib_send_comp_handler(priv->send_cq, dev);
-   }
}
 }
 
@@ -796,10 +806,13 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct 
ib_wc *wc)
netif_tx_lock(dev);
 
++tx->tx_tail;
+
+   spin_lock_irqsave(&priv->lock, flags);
if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
netif_queue_stopped(dev) &&
test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
netif_wake_queue(dev);
+   spin_unlock_irqrestore(&priv->lock, flags);
 
if (wc->status != IB_WC_SUCCESS &&
wc->status != IB_WC_WR_FLUSH_ERR) {
@@ -1169,6 +1182,7 @@ static void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p)
struct ipoib_dev_priv *priv = netdev_priv

Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)

2015-09-27 Thread Christoph Lameter
On Sun, 27 Sep 2015, Doug Ledford wrote:

> Currently I'm testing your patch with a couple other patches.  I dropped
> the patch of mine that added a module option, and added two different
> patches.  However, I'm still waffling on this patch somewhat.  In the
> discussions that Jason and I had, I pretty much decided that I would
> like to see all send-only multicast sends be sent immediately with no
> backlog queue.  That means that if we had to start a send-only join, or
> if we started one and it hasn't completed yet, we would send the packet
> immediately via the broadcast group versus queueing.  Doing so might
> trip this new code up.

If we send immediately then we would need to check on each packet if the
multicast creation has been completed?

Also broadcast could cause a unecessary reception event on the NICs of
machines that have no interest in this traffic. We would like to keep
irrelevant traffic off the fabric as much as possible. An a reception
event that requires traffic to be thrown out will cause jitter in the
processing of inbound traffic that we also would like to avoid.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB/mlx4: correct order of variables in log

2015-09-27 Thread Wengang Wang
There is a mis-order in mlx4 log. Fix it.

Signed-off-by: Wengang Wang 
---
 drivers/net/ethernet/mellanox/mlx4/cmd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c 
b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index 0a32020..150fbb3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -1010,7 +1010,7 @@ static int mlx4_MAD_IFC_wrapper(struct mlx4_dev *dev, int 
slave,
if (!(smp->mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED &&
  smp->method == IB_MGMT_METHOD_GET) || network_view) {
mlx4_err(dev, "Unprivileged slave %d is trying to 
execute a Subnet MGMT MAD, class 0x%x, method 0x%x, view=%s for attr 0x%x. 
Rejecting\n",
-slave, smp->method, smp->mgmt_class,
+slave, smp->mgmt_class, smp->method,
 network_view ? "Network" : "Host",
 be16_to_cpu(smp->attr_id));
return -EPERM;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)

2015-09-27 Thread Doug Ledford
On 09/27/2015 12:39 PM, Christoph Lameter wrote:
> 
> On Sat, 26 Sep 2015, Or Gerlitz wrote:
> 
>> It's possible that this was done for a reason, so
> 
>> sounds good, so taking into account that Erez is away till Oct 6th, we
>> can probably pick your patch and later, if Erez proves us that there's
>> deep problem there, revert it and take his.
> 
> Ok but if Erez does not have the time to participate in code development
> and follow up on the patch as issues arise then I would rather rework the
> code so that it is easily understandable and I will continue to follow up
> on the issues with the code as they develop. This seems to be much more
> important to my company than Mellanox.
> 

Currently I'm testing your patch with a couple other patches.  I dropped
the patch of mine that added a module option, and added two different
patches.  However, I'm still waffling on this patch somewhat.  In the
discussions that Jason and I had, I pretty much decided that I would
like to see all send-only multicast sends be sent immediately with no
backlog queue.  That means that if we had to start a send-only join, or
if we started one and it hasn't completed yet, we would send the packet
immediately via the broadcast group versus queueing.  Doing so might
trip this new code up.

Right now, we start a join, we queue the packet on the mcast struct, and
in join_finish we create an ah, but that's it.  We then restart the send
by calling dev_queue_xmit on the skb we put in the backlog queue, which
takes us back around to mcast_send, where we not have both a mcast and a
mcast->ah, so *then* we alloc a new neigh entry, attach this mcast to
it, and send using it.

If I change mcast_send so that we start a join, but immediately send the
packet in the broadcast group, then I would have to change the
join_finish routine to alloc a neigh that has the right daddr so it can
be found in the future, without the benefit of the daddr passed into the
function mcast_send so missing the ipoib header and instead only having
the raw mgid in the mcmember struct.  But, we would have to have that
neigh struct so that the timeout would work in the case were we had a
packet or two that triggered a join but were all sent prior to the join
completing and so we never got a neigh alloc via mcast_send for this
mcast group.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)

2015-09-27 Thread Christoph Lameter

On Sat, 26 Sep 2015, Or Gerlitz wrote:

> It's possible that this was done for a reason, so

> sounds good, so taking into account that Erez is away till Oct 6th, we
> can probably pick your patch and later, if Erez proves us that there's
> deep problem there, revert it and take his.

Ok but if Erez does not have the time to participate in code development
and follow up on the patch as issues arise then I would rather rework the
code so that it is easily understandable and I will continue to follow up
on the issues with the code as they develop. This seems to be much more
important to my company than Mellanox.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RoCE RDMA CM handling after the demux patches

2015-09-27 Thread Haggai Eran
Hi,

I noticed that the "Demux IB CM requests in the rdma_cm module" patches,
while intended to only apply to InfiniBand connections, accidentally
also affect RoCE connections. The patches used a query to IPoIB for the
netdev associated with a request, and RoCE devices therefore fail to
answer that query and reject all requests.

The long term solution would probably be to use the new RoCE GID table
cache to find the relevant netdev, but since the patches to expose the
netdev through the GID table API haven't been accepted yet, I'm working
on a patch that will just waive the requirement for a valid netdev on
RoCE devices.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 12/26] xprtrdma: Port to new memory registration API

2015-09-27 Thread Sagi Grimberg

On 9/25/2015 7:51 PM, Chuck Lever wrote:



On Sep 24, 2015, at 10:34 AM, Sagi Grimberg  wrote:

Instead of maintaining a fastreg page list, keep an sg table
and convert an array of pages to a sg list. Then call ib_map_mr_sg
and construct ib_reg_wr.

Note that the next step would be to have NFS work with sg lists
as it maps well to sk_frags (see comment from hch
http://marc.info/?l=linux-rdma&m=143677002622296&w=2).


Fwiw, you would need to change tcp_sendpages() first.

One more comment below.



Signed-off-by: Sagi Grimberg 
Acked-by: Christoph Hellwig 
---
net/sunrpc/xprtrdma/frwr_ops.c  | 113 +++-
net/sunrpc/xprtrdma/xprt_rdma.h |   3 +-
2 files changed, 68 insertions(+), 48 deletions(-)

diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c
index 0d2f46f600b6..b80a82149977 100644
--- a/net/sunrpc/xprtrdma/frwr_ops.c
+++ b/net/sunrpc/xprtrdma/frwr_ops.c
@@ -151,9 +151,13 @@ __frwr_init(struct rpcrdma_mw *r, struct ib_pd *pd, struct 
ib_device *device,
f->fr_mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, depth);
if (IS_ERR(f->fr_mr))
goto out_mr_err;
-   f->fr_pgl = ib_alloc_fast_reg_page_list(device, depth);
-   if (IS_ERR(f->fr_pgl))
+
+   f->sg = kcalloc(depth, sizeof(*f->sg), GFP_KERNEL);
+   if (!f->sg)
goto out_list_err;
+
+   sg_init_table(f->sg, depth);
+
return 0;

out_mr_err:
@@ -163,7 +167,7 @@ out_mr_err:
return rc;

out_list_err:
-   rc = PTR_ERR(f->fr_pgl);
+   rc = -ENOMEM;
dprintk("RPC:   %s: ib_alloc_fast_reg_page_list status %i\n",
__func__, rc);
ib_dereg_mr(f->fr_mr);
@@ -179,7 +183,7 @@ __frwr_release(struct rpcrdma_mw *r)
if (rc)
dprintk("RPC:   %s: ib_dereg_mr status %i\n",
__func__, rc);
-   ib_free_fast_reg_page_list(r->r.frmr.fr_pgl);
+   kfree(r->r.frmr.sg);
}

static int
@@ -312,14 +316,11 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
struct rpcrdma_mw *mw;
struct rpcrdma_frmr *frmr;
struct ib_mr *mr;
-   struct ib_fast_reg_wr fastreg_wr;
+   struct ib_reg_wr reg_wr;
struct ib_send_wr *bad_wr;
+   unsigned int dma_nents;
u8 key;
-   int len, pageoff;
-   int i, rc;
-   int seg_len;
-   u64 pa;
-   int page_no;
+   int i, rc, len, n;

mw = seg1->rl_mw;
seg1->rl_mw = NULL;
@@ -332,64 +333,81 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct 
rpcrdma_mr_seg *seg,
} while (mw->r.frmr.fr_state != FRMR_IS_INVALID);
frmr = &mw->r.frmr;
frmr->fr_state = FRMR_IS_VALID;
+   mr = frmr->fr_mr;

-   pageoff = offset_in_page(seg1->mr_offset);
-   seg1->mr_offset -= pageoff;  /* start of page */
-   seg1->mr_len += pageoff;
-   len = -pageoff;
if (nsegs > ia->ri_max_frmr_depth)
nsegs = ia->ri_max_frmr_depth;

-   for (page_no = i = 0; i < nsegs;) {
-   rpcrdma_map_one(device, seg, direction);
-   pa = seg->mr_dma;
-   for (seg_len = seg->mr_len; seg_len > 0; seg_len -= PAGE_SIZE) {
-   frmr->fr_pgl->page_list[page_no++] = pa;
-   pa += PAGE_SIZE;
-   }
+   for (len = 0, i = 0; i < nsegs;) {
+   if (seg->mr_page)
+   sg_set_page(&frmr->sg[i],
+   seg->mr_page,
+   seg->mr_len,
+   offset_in_page(seg->mr_offset));
+   else
+   sg_set_buf(&frmr->sg[i], seg->mr_offset,
+  seg->mr_len);
+
len += seg->mr_len;
++seg;
++i;
+
/* Check for holes */
if ((i < nsegs && offset_in_page(seg->mr_offset)) ||
offset_in_page((seg-1)->mr_offset + (seg-1)->mr_len))
break;
}
+   frmr->sg_nents = i;
+
+   dma_nents = ib_dma_map_sg(device, frmr->sg, frmr->sg_nents, direction);
+   if (!dma_nents) {
+   pr_err("RPC:   %s: failed to dma map sg %p sg_nents %d\n",
+  __func__, frmr->sg, frmr->sg_nents);
+   return -ENOMEM;
+   }
+
+   n = ib_map_mr_sg(mr, frmr->sg, frmr->sg_nents, PAGE_SIZE);
+   if (unlikely(n != frmr->sg_nents)) {
+   pr_err("RPC:   %s: failed to map mr %p (%d/%d)\n",
+  __func__, frmr->fr_mr, n, frmr->sg_nents);
+   rc = n < 0 ? n : -EINVAL;
+   goto out_senderr;
+   }
+
dprintk("RPC:   %s: Using frmr %p to map %d segments (%d bytes)\n",
-   __func__, mw, i, len);
-
-   memset(&fastreg_wr, 0, sizeof(fastreg_wr));
-   fastreg_wr.wr.wr_id = (unsigned long)(void *)mw;
-   fastreg_wr.wr.opcode = IB_WR

Re: [PATCH v2 16/26] IB/srp: Convert to new registration API

2015-09-27 Thread Sagi Grimberg

On 9/25/2015 7:34 PM, Bart Van Assche wrote:

On 09/24/2015 10:35 AM, Sagi Grimberg wrote:

Remove srp_finish_mapping since no one is calling it.


Please move the srp_finish_mapping() removal into a separate patch.


Moved, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 15/26] IB/srp: Split srp_map_sg

2015-09-27 Thread Sagi Grimberg

On 9/25/2015 7:15 PM, Bart Van Assche wrote:

On 09/24/2015 10:35 AM, Sagi Grimberg wrote:

This is a perparation patch for the new registration API

 ^ preparation ?


Fixed, thanks.


+WARN_ON_ONCE(!dev->use_fmr);
+
+if (state->npages == 0)
+return 0;
+
+if (state->npages == 1 && target->global_mr) {
+srp_map_desc(state, state->base_dma_addr, state->dma_len,
+ target->global_mr->rkey);
+return 0;
+}
+


The above is not correct. The npages and dma_len variables have to be
reset before returning. How about changing the "return 0" statement into
"goto reset_state" and adding a "reset_state" label ?


Done, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html