Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
On Sun, Sep 27, 2015 at 7:39 PM, Christoph Lameter wrote: > Ok but if Erez does not have the time to participate in code development > and follow up on the patch as issues arise then I would rather rework the > code so that it is easily understandable and I will continue to follow up > on the issues with the code as they develop. As I mentioned to you earlier on this thread, currently it's not a matter of having time to participate but rather happily going through the Jewish new year holidays, this time Sukkot with many people being off for the whole of it till Oct 6. Personally, up to few weeks ago, I was under the misimpression that not only IPoIB joins as full member also on the sendonly flow, but also that such group can be actually opened under that flow, and it turns out they don't. Later you said that your production environment was running a very old non upstream stack that had a knob to somehow make it work and as of that didn't realize that something goes wrong for years w.r.t a gateway functionality with upstream/inbox code, so we all screwed up here over a time period with is few orders of magnitude longer than a holiday duration. Or. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] IPoIB: serialize changing on tx_outstanding
The changing on tx_outstanding should be protected by spinlock or to be atomic operations. Such log is found in dmesg: Sep 16 14:20:53 naep11x06 kernel: ib0: queue stopped 1, tx_head 1034733, tx_tail 1034733, tx_outstanding 359 ipoib_sendq_size: 512 Sep 16 14:21:33 naep11x06 kernel: ib0: transmit timeout: latency 9560 msecs Sep 16 14:21:33 naep11x06 kernel: ib0: queue stopped 1, tx_head 1034854, tx_tail 1034854, tx_outstanding 511 ipoib_sendq_size: 512 Sep 16 14:21:38 naep11x06 kernel: ib0: transmit timeout: latency 14568 msecs Sep 16 14:21:38 naep11x06 kernel: ib0: queue stopped 1, tx_head 1034854, tx_tail 1034854, tx_outstanding 511 ipoib_sendq_size: 512 And the send queue of ib0 kept full. When transmit timeout is reported, queue is reported as "stopped", but the IPoIB stuff tx_head and tx_tail points to same value. I am not able to see such numbers in ipoib_cm_tx (for CM) because I have no vmcore. Though I am not quite sure it's caused by parallel access of tx_outstanding(send path VS interrup path), we really need to serialize the changeing on tx_outstanding. This patch also make sure the increase of tx_outstanding prior to the calling of post_send to avoid the possible decreasing before increasing in case the running of increasing is scheduled later than the interrupt handler. Signed-off-by: Wengang Wang --- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 40 +++-- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 24 ++-- 2 files changed, 50 insertions(+), 14 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index c78dc16..044da94 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -710,6 +710,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ struct ipoib_dev_priv *priv = netdev_priv(dev); struct ipoib_tx_buf *tx_req; int rc; + unsigned long flags; if (unlikely(skb->len > tx->mtu)) { ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n", @@ -742,27 +743,36 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ skb_orphan(skb); skb_dst_drop(skb); + spin_lock_irqsave(&priv->lock, flags); + if (++priv->tx_outstanding == ipoib_sendq_size) { + ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net queue\n", + tx->qp->qp_num); + netif_stop_queue(dev); + } + spin_unlock_irqrestore(&priv->lock, flags); + if (netif_queue_stopped(dev)) { + rc = ib_req_notify_cq(priv->send_cq, + IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS); + if (rc < 0) + ipoib_warn(priv, "request notify on send CQ failed\n"); + else if (rc) + ipoib_send_comp_handler(priv->send_cq, dev); + } + rc = post_send(priv, tx, tx->tx_head & (ipoib_sendq_size - 1), tx_req); if (unlikely(rc)) { ipoib_warn(priv, "post_send failed, error %d\n", rc); ++dev->stats.tx_errors; + spin_lock_irqsave(&priv->lock, flags); + --priv->tx_outstanding; + if (netif_queue_stopped(dev)) + netif_wake_queue(dev); + spin_unlock_irqrestore(&priv->lock, flags); ipoib_dma_unmap_tx(priv, tx_req); dev_kfree_skb_any(skb); } else { dev->trans_start = jiffies; ++tx->tx_head; - - if (++priv->tx_outstanding == ipoib_sendq_size) { - ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net queue\n", - tx->qp->qp_num); - netif_stop_queue(dev); - rc = ib_req_notify_cq(priv->send_cq, - IB_CQ_NEXT_COMP | IB_CQ_REPORT_MISSED_EVENTS); - if (rc < 0) - ipoib_warn(priv, "request notify on send CQ failed\n"); - else if (rc) - ipoib_send_comp_handler(priv->send_cq, dev); - } } } @@ -796,10 +806,13 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) netif_tx_lock(dev); ++tx->tx_tail; + + spin_lock_irqsave(&priv->lock, flags); if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) && netif_queue_stopped(dev) && test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) netif_wake_queue(dev); + spin_unlock_irqrestore(&priv->lock, flags); if (wc->status != IB_WC_SUCCESS && wc->status != IB_WC_WR_FLUSH_ERR) { @@ -1169,6 +1182,7 @@ static void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p) struct ipoib_dev_priv *priv = netdev_priv
Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
On Sun, 27 Sep 2015, Doug Ledford wrote: > Currently I'm testing your patch with a couple other patches. I dropped > the patch of mine that added a module option, and added two different > patches. However, I'm still waffling on this patch somewhat. In the > discussions that Jason and I had, I pretty much decided that I would > like to see all send-only multicast sends be sent immediately with no > backlog queue. That means that if we had to start a send-only join, or > if we started one and it hasn't completed yet, we would send the packet > immediately via the broadcast group versus queueing. Doing so might > trip this new code up. If we send immediately then we would need to check on each packet if the multicast creation has been completed? Also broadcast could cause a unecessary reception event on the NICs of machines that have no interest in this traffic. We would like to keep irrelevant traffic off the fabric as much as possible. An a reception event that requires traffic to be thrown out will cause jitter in the processing of inbound traffic that we also would like to avoid. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] IB/mlx4: correct order of variables in log
There is a mis-order in mlx4 log. Fix it. Signed-off-by: Wengang Wang --- drivers/net/ethernet/mellanox/mlx4/cmd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c index 0a32020..150fbb3 100644 --- a/drivers/net/ethernet/mellanox/mlx4/cmd.c +++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c @@ -1010,7 +1010,7 @@ static int mlx4_MAD_IFC_wrapper(struct mlx4_dev *dev, int slave, if (!(smp->mgmt_class == IB_MGMT_CLASS_SUBN_LID_ROUTED && smp->method == IB_MGMT_METHOD_GET) || network_view) { mlx4_err(dev, "Unprivileged slave %d is trying to execute a Subnet MGMT MAD, class 0x%x, method 0x%x, view=%s for attr 0x%x. Rejecting\n", -slave, smp->method, smp->mgmt_class, +slave, smp->mgmt_class, smp->method, network_view ? "Network" : "Host", be16_to_cpu(smp->attr_id)); return -EPERM; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
On 09/27/2015 12:39 PM, Christoph Lameter wrote: > > On Sat, 26 Sep 2015, Or Gerlitz wrote: > >> It's possible that this was done for a reason, so > >> sounds good, so taking into account that Erez is away till Oct 6th, we >> can probably pick your patch and later, if Erez proves us that there's >> deep problem there, revert it and take his. > > Ok but if Erez does not have the time to participate in code development > and follow up on the patch as issues arise then I would rather rework the > code so that it is easily understandable and I will continue to follow up > on the issues with the code as they develop. This seems to be much more > important to my company than Mellanox. > Currently I'm testing your patch with a couple other patches. I dropped the patch of mine that added a module option, and added two different patches. However, I'm still waffling on this patch somewhat. In the discussions that Jason and I had, I pretty much decided that I would like to see all send-only multicast sends be sent immediately with no backlog queue. That means that if we had to start a send-only join, or if we started one and it hasn't completed yet, we would send the packet immediately via the broadcast group versus queueing. Doing so might trip this new code up. Right now, we start a join, we queue the packet on the mcast struct, and in join_finish we create an ah, but that's it. We then restart the send by calling dev_queue_xmit on the skb we put in the backlog queue, which takes us back around to mcast_send, where we not have both a mcast and a mcast->ah, so *then* we alloc a new neigh entry, attach this mcast to it, and send using it. If I change mcast_send so that we start a join, but immediately send the packet in the broadcast group, then I would have to change the join_finish routine to alloc a neigh that has the right daddr so it can be found in the future, without the benefit of the daddr passed into the function mcast_send so missing the ipoib header and instead only having the raw mgid in the mcmember struct. But, we would have to have that neigh struct so that the timeout would work in the case were we had a packet or two that triggered a join but were all sent prior to the join completing and so we never got a neigh alloc via mcast_send for this mcast group. -- Doug Ledford GPG KeyID: 0E572FDD signature.asc Description: OpenPGP digital signature
Re: [PATCH] Expire sendonly joins (was Re: [PATCH rdma-rc 0/2] Add mechanism for ipoib neigh state change notifications)
On Sat, 26 Sep 2015, Or Gerlitz wrote: > It's possible that this was done for a reason, so > sounds good, so taking into account that Erez is away till Oct 6th, we > can probably pick your patch and later, if Erez proves us that there's > deep problem there, revert it and take his. Ok but if Erez does not have the time to participate in code development and follow up on the patch as issues arise then I would rather rework the code so that it is easily understandable and I will continue to follow up on the issues with the code as they develop. This seems to be much more important to my company than Mellanox. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RoCE RDMA CM handling after the demux patches
Hi, I noticed that the "Demux IB CM requests in the rdma_cm module" patches, while intended to only apply to InfiniBand connections, accidentally also affect RoCE connections. The patches used a query to IPoIB for the netdev associated with a request, and RoCE devices therefore fail to answer that query and reject all requests. The long term solution would probably be to use the new RoCE GID table cache to find the relevant netdev, but since the patches to expose the netdev through the GID table API haven't been accepted yet, I'm working on a patch that will just waive the requirement for a valid netdev on RoCE devices. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 12/26] xprtrdma: Port to new memory registration API
On 9/25/2015 7:51 PM, Chuck Lever wrote: On Sep 24, 2015, at 10:34 AM, Sagi Grimberg wrote: Instead of maintaining a fastreg page list, keep an sg table and convert an array of pages to a sg list. Then call ib_map_mr_sg and construct ib_reg_wr. Note that the next step would be to have NFS work with sg lists as it maps well to sk_frags (see comment from hch http://marc.info/?l=linux-rdma&m=143677002622296&w=2). Fwiw, you would need to change tcp_sendpages() first. One more comment below. Signed-off-by: Sagi Grimberg Acked-by: Christoph Hellwig --- net/sunrpc/xprtrdma/frwr_ops.c | 113 +++- net/sunrpc/xprtrdma/xprt_rdma.h | 3 +- 2 files changed, 68 insertions(+), 48 deletions(-) diff --git a/net/sunrpc/xprtrdma/frwr_ops.c b/net/sunrpc/xprtrdma/frwr_ops.c index 0d2f46f600b6..b80a82149977 100644 --- a/net/sunrpc/xprtrdma/frwr_ops.c +++ b/net/sunrpc/xprtrdma/frwr_ops.c @@ -151,9 +151,13 @@ __frwr_init(struct rpcrdma_mw *r, struct ib_pd *pd, struct ib_device *device, f->fr_mr = ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, depth); if (IS_ERR(f->fr_mr)) goto out_mr_err; - f->fr_pgl = ib_alloc_fast_reg_page_list(device, depth); - if (IS_ERR(f->fr_pgl)) + + f->sg = kcalloc(depth, sizeof(*f->sg), GFP_KERNEL); + if (!f->sg) goto out_list_err; + + sg_init_table(f->sg, depth); + return 0; out_mr_err: @@ -163,7 +167,7 @@ out_mr_err: return rc; out_list_err: - rc = PTR_ERR(f->fr_pgl); + rc = -ENOMEM; dprintk("RPC: %s: ib_alloc_fast_reg_page_list status %i\n", __func__, rc); ib_dereg_mr(f->fr_mr); @@ -179,7 +183,7 @@ __frwr_release(struct rpcrdma_mw *r) if (rc) dprintk("RPC: %s: ib_dereg_mr status %i\n", __func__, rc); - ib_free_fast_reg_page_list(r->r.frmr.fr_pgl); + kfree(r->r.frmr.sg); } static int @@ -312,14 +316,11 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg, struct rpcrdma_mw *mw; struct rpcrdma_frmr *frmr; struct ib_mr *mr; - struct ib_fast_reg_wr fastreg_wr; + struct ib_reg_wr reg_wr; struct ib_send_wr *bad_wr; + unsigned int dma_nents; u8 key; - int len, pageoff; - int i, rc; - int seg_len; - u64 pa; - int page_no; + int i, rc, len, n; mw = seg1->rl_mw; seg1->rl_mw = NULL; @@ -332,64 +333,81 @@ frwr_op_map(struct rpcrdma_xprt *r_xprt, struct rpcrdma_mr_seg *seg, } while (mw->r.frmr.fr_state != FRMR_IS_INVALID); frmr = &mw->r.frmr; frmr->fr_state = FRMR_IS_VALID; + mr = frmr->fr_mr; - pageoff = offset_in_page(seg1->mr_offset); - seg1->mr_offset -= pageoff; /* start of page */ - seg1->mr_len += pageoff; - len = -pageoff; if (nsegs > ia->ri_max_frmr_depth) nsegs = ia->ri_max_frmr_depth; - for (page_no = i = 0; i < nsegs;) { - rpcrdma_map_one(device, seg, direction); - pa = seg->mr_dma; - for (seg_len = seg->mr_len; seg_len > 0; seg_len -= PAGE_SIZE) { - frmr->fr_pgl->page_list[page_no++] = pa; - pa += PAGE_SIZE; - } + for (len = 0, i = 0; i < nsegs;) { + if (seg->mr_page) + sg_set_page(&frmr->sg[i], + seg->mr_page, + seg->mr_len, + offset_in_page(seg->mr_offset)); + else + sg_set_buf(&frmr->sg[i], seg->mr_offset, + seg->mr_len); + len += seg->mr_len; ++seg; ++i; + /* Check for holes */ if ((i < nsegs && offset_in_page(seg->mr_offset)) || offset_in_page((seg-1)->mr_offset + (seg-1)->mr_len)) break; } + frmr->sg_nents = i; + + dma_nents = ib_dma_map_sg(device, frmr->sg, frmr->sg_nents, direction); + if (!dma_nents) { + pr_err("RPC: %s: failed to dma map sg %p sg_nents %d\n", + __func__, frmr->sg, frmr->sg_nents); + return -ENOMEM; + } + + n = ib_map_mr_sg(mr, frmr->sg, frmr->sg_nents, PAGE_SIZE); + if (unlikely(n != frmr->sg_nents)) { + pr_err("RPC: %s: failed to map mr %p (%d/%d)\n", + __func__, frmr->fr_mr, n, frmr->sg_nents); + rc = n < 0 ? n : -EINVAL; + goto out_senderr; + } + dprintk("RPC: %s: Using frmr %p to map %d segments (%d bytes)\n", - __func__, mw, i, len); - - memset(&fastreg_wr, 0, sizeof(fastreg_wr)); - fastreg_wr.wr.wr_id = (unsigned long)(void *)mw; - fastreg_wr.wr.opcode = IB_WR
Re: [PATCH v2 16/26] IB/srp: Convert to new registration API
On 9/25/2015 7:34 PM, Bart Van Assche wrote: On 09/24/2015 10:35 AM, Sagi Grimberg wrote: Remove srp_finish_mapping since no one is calling it. Please move the srp_finish_mapping() removal into a separate patch. Moved, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 15/26] IB/srp: Split srp_map_sg
On 9/25/2015 7:15 PM, Bart Van Assche wrote: On 09/24/2015 10:35 AM, Sagi Grimberg wrote: This is a perparation patch for the new registration API ^ preparation ? Fixed, thanks. +WARN_ON_ONCE(!dev->use_fmr); + +if (state->npages == 0) +return 0; + +if (state->npages == 1 && target->global_mr) { +srp_map_desc(state, state->base_dma_addr, state->dma_len, + target->global_mr->rkey); +return 0; +} + The above is not correct. The npages and dma_len variables have to be reset before returning. How about changing the "return 0" statement into "goto reset_state" and adding a "reset_state" label ? Done, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html