[PATCHv2 net-next 23/31] iw_cxgb4: drop RX_DATA packets if the endpoint is gone.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/cm.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index cbeaa58..2930e91 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -1544,6 +1544,8 @@ static int rx_data(struct c4iw_dev *dev, struct sk_buff 
*skb)
__u8 status = hdr->status;
 
ep = lookup_tid(t, tid);
+   if (!ep)
+   return 0;
PDBG("%s ep %p tid %u dlen %u\n", __func__, ep, ep->hwtid, dlen);
skb_pull(skb, sizeof(*hdr));
skb_trim(skb, dlen);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 26/31] iw_cxgb4: rmb() after reading valid gen bit.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

Some HW platforms can reorder read operations, so we must rmb() after
we see a valid gen bit in a CQE but before we read any other fields
from the CQE.

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/t4.h |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h
index edab0e9..67cd09e 100644
--- a/drivers/infiniband/hw/cxgb4/t4.h
+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -622,6 +622,7 @@ static inline int t4_next_hw_cqe(struct t4_cq *cq, struct 
t4_cqe **cqe)
printk(KERN_ERR MOD "cq overflow cqid %u\n", cq->cqid);
BUG_ON(1);
} else if (t4_valid_cqe(cq, &cq->queue[cq->cidx])) {
+   rmb();
*cqe = &cq->queue[cq->cidx];
ret = 0;
} else
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 25/31] iw_cxgb4: endpoint timeout fixes.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

1) timedout endpoint processing can be starved. If there is continual
CPL messages flowing into the driver, the endpoint timeout processing
can be starved.  This condition exposed the other bugs below.

Solution: In process_work(), call process_timedout_eps() after each CPL
is processed.

2) Connection events can be processed even though the endpoint is on
the timeout list.  If the endpoint is scheduled for timeout processing,
then we must ignore MPA Start Requests and Replies.

Solution: Change stop_ep_timer() to return 1 if the ep has already been
queued for timeout processing.  All the callers of stop_ep_timer() need
to check this and act accordingly.  There are just a few cases where
the caller needs to do something different if stop_ep_timer() returns 1:

1) in process_mpa_reply(), ignore the reply and  process_timeout()
will abort the connection.

2) in process_mpa_request, ignore the request and process_timeout()
will abort the connection.

It is ok for callers of stop_ep_timer() to abort the connection since
that will leave the state in ABORTING or DEAD, and process_timeout()
now ignores timeouts when the ep is in these states.

3) Double insertion on the timeout list.  Since the endpoint timers are
used for connection setup and teardown, we need to guard against the
possibility that an endpoint is already on the timeout list.  This is
a rare condition and only seen under heavy load and in the presense of
the above 2 bugs.

Solution: In ep_timeout(), don't queue the endpoint if it is already on
the queue.

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/cm.c |   89 --
 1 files changed, 56 insertions(+), 33 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index f30ed32..bb48510 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -178,12 +178,15 @@ static void start_ep_timer(struct c4iw_ep *ep)
add_timer(&ep->timer);
 }
 
-static void stop_ep_timer(struct c4iw_ep *ep)
+static int stop_ep_timer(struct c4iw_ep *ep)
 {
PDBG("%s ep %p stopping\n", __func__, ep);
del_timer_sync(&ep->timer);
-   if (!test_and_set_bit(TIMEOUT, &ep->com.flags))
+   if (!test_and_set_bit(TIMEOUT, &ep->com.flags)) {
c4iw_put_ep(&ep->com);
+   return 0;
+   }
+   return 1;
 }
 
 static int c4iw_l2t_send(struct c4iw_rdev *rdev, struct sk_buff *skb,
@@ -1188,12 +1191,11 @@ static void process_mpa_reply(struct c4iw_ep *ep, 
struct sk_buff *skb)
PDBG("%s ep %p tid %u\n", __func__, ep, ep->hwtid);
 
/*
-* Stop mpa timer.  If it expired, then the state has
-* changed and we bail since ep_timeout already aborted
-* the connection.
+* Stop mpa timer.  If it expired, then
+* we ignore the MPA reply.  process_timeout()
+* will abort the connection.
 */
-   stop_ep_timer(ep);
-   if (ep->com.state != MPA_REQ_SENT)
+   if (stop_ep_timer(ep))
return;
 
/*
@@ -1398,15 +1400,12 @@ static void process_mpa_request(struct c4iw_ep *ep, 
struct sk_buff *skb)
 
PDBG("%s ep %p tid %u\n", __func__, ep, ep->hwtid);
 
-   if (ep->com.state != MPA_REQ_WAIT)
-   return;
-
/*
 * If we get more than the supported amount of private data
 * then we must fail this connection.
 */
if (ep->mpa_pkt_len + skb->len > sizeof(ep->mpa_pkt)) {
-   stop_ep_timer(ep);
+   (void)stop_ep_timer(ep);
abort_connection(ep, skb, GFP_KERNEL);
return;
}
@@ -1436,13 +1435,13 @@ static void process_mpa_request(struct c4iw_ep *ep, 
struct sk_buff *skb)
if (mpa->revision > mpa_rev) {
printk(KERN_ERR MOD "%s MPA version mismatch. Local = %d,"
   " Received = %d\n", __func__, mpa_rev, mpa->revision);
-   stop_ep_timer(ep);
+   (void)stop_ep_timer(ep);
abort_connection(ep, skb, GFP_KERNEL);
return;
}
 
if (memcmp(mpa->key, MPA_KEY_REQ, sizeof(mpa->key))) {
-   stop_ep_timer(ep);
+   (void)stop_ep_timer(ep);
abort_connection(ep, skb, GFP_KERNEL);
return;
}
@@ -1453,7 +1452,7 @@ static void process_mpa_request(struct c4iw_ep *ep, 
struct sk_buff *skb)
 * Fail if there's too much private data.
 */
if (plen > MPA_MAX_PRIVATE_DATA) {
-   stop_ep_timer(ep);
+   (void)stop_ep_timer(ep);
abort_connection(ep, skb, GFP_KERNEL);
return;
}
@@ -1462,7 +1461,7 @@ static void process_mpa_request(struct c4iw_ep *ep, 
struct sk_buff *skb)
 * If plen does not account for pkt size
 */
if (ep->mpa_pkt_len > (sizeof(*mpa) + plen)) {
-   stop_ep_timer(ep);
+   

[PATCHv2 net-next 31/31] Revert "cxgb4: Don't assume LSO only uses SGL path in t4_eth_xmit()"

2014-03-02 Thread Hariprasad Shenai
Commit 0034b29 (cxgb4: Don't assume LSO only uses SGL path in t4_eth_xmit())
introduced a regression causing chip-hang. This patch needs more debugging and
more work. So reverting for now.

Signed-off-by: Hariprasad Shenai 
---
 drivers/net/ethernet/chelsio/cxgb4/sge.c |   32 +
 1 files changed, 10 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index c2e142d..25f8981 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -716,17 +716,11 @@ static inline unsigned int flits_to_desc(unsigned int n)
  * @skb: the packet
  *
  * Returns whether an Ethernet packet is small enough to fit as
- * immediate data. Return value corresponds to headroom required.
+ * immediate data.
  */
 static inline int is_eth_imm(const struct sk_buff *skb)
 {
-   int hdrlen = skb_shinfo(skb)->gso_size ?
-   sizeof(struct cpl_tx_pkt_lso_core) : 0;
-
-   hdrlen += sizeof(struct cpl_tx_pkt);
-   if (skb->len <= MAX_IMM_TX_PKT_LEN - hdrlen)
-   return hdrlen;
-   return 0;
+   return skb->len <= MAX_IMM_TX_PKT_LEN - sizeof(struct cpl_tx_pkt);
 }
 
 /**
@@ -739,10 +733,9 @@ static inline int is_eth_imm(const struct sk_buff *skb)
 static inline unsigned int calc_tx_flits(const struct sk_buff *skb)
 {
unsigned int flits;
-   int hdrlen = is_eth_imm(skb);
 
-   if (hdrlen)
-   return DIV_ROUND_UP(skb->len + hdrlen, sizeof(__be64));
+   if (is_eth_imm(skb))
+   return DIV_ROUND_UP(skb->len + sizeof(struct cpl_tx_pkt), 8);
 
flits = sgl_len(skb_shinfo(skb)->nr_frags + 1) + 4;
if (skb_shinfo(skb)->gso_size)
@@ -990,7 +983,6 @@ static inline void txq_advance(struct sge_txq *q, unsigned 
int n)
  */
 netdev_tx_t t4_eth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
-   int len;
u32 wr_mid;
u64 cntrl, *end;
int qidx, credits;
@@ -1002,7 +994,6 @@ netdev_tx_t t4_eth_xmit(struct sk_buff *skb, struct 
net_device *dev)
struct cpl_tx_pkt_core *cpl;
const struct skb_shared_info *ssi;
dma_addr_t addr[MAX_SKB_FRAGS + 1];
-   bool immediate = false;
 
/*
 * The chip min packet length is 10 octets but play safe and reject
@@ -1032,10 +1023,7 @@ out_free:dev_kfree_skb(skb);
return NETDEV_TX_BUSY;
}
 
-   if (is_eth_imm(skb))
-   immediate = true;
-
-   if (!immediate &&
+   if (!is_eth_imm(skb) &&
unlikely(map_skb(adap->pdev_dev, skb, addr) < 0)) {
q->mapping_err++;
goto out_free;
@@ -1052,8 +1040,6 @@ out_free: dev_kfree_skb(skb);
wr->r3 = cpu_to_be64(0);
end = (u64 *)wr + flits;
 
-   len = immediate ? skb->len : 0;
-   len += sizeof(*cpl);
ssi = skb_shinfo(skb);
if (ssi->gso_size) {
struct cpl_tx_pkt_lso *lso = (void *)wr;
@@ -1061,9 +1047,8 @@ out_free: dev_kfree_skb(skb);
int l3hdr_len = skb_network_header_len(skb);
int eth_xtra_len = skb_network_offset(skb) - ETH_HLEN;
 
-   len += sizeof(*lso);
wr->op_immdlen = htonl(FW_WR_OP(FW_ETH_TX_PKT_WR) |
-  FW_WR_IMMDLEN(len));
+  FW_WR_IMMDLEN(sizeof(*lso)));
lso->c.lso_ctrl = htonl(LSO_OPCODE(CPL_TX_PKT_LSO) |
LSO_FIRST_SLICE | LSO_LAST_SLICE |
LSO_IPV6(v6) |
@@ -1081,6 +1066,9 @@ out_free: dev_kfree_skb(skb);
q->tso++;
q->tx_cso += ssi->gso_segs;
} else {
+   int len;
+
+   len = is_eth_imm(skb) ? skb->len + sizeof(*cpl) : sizeof(*cpl);
wr->op_immdlen = htonl(FW_WR_OP(FW_ETH_TX_PKT_WR) |
   FW_WR_IMMDLEN(len));
cpl = (void *)(wr + 1);
@@ -1102,7 +1090,7 @@ out_free: dev_kfree_skb(skb);
cpl->len = htons(skb->len);
cpl->ctrl1 = cpu_to_be64(cntrl);
 
-   if (immediate) {
+   if (is_eth_imm(skb)) {
inline_tx_skb(skb, &q->q, cpl + 1);
dev_kfree_skb(skb);
} else {
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 27/31] iw_cxgb4: wc_wmb() needed after DB writes.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

Need to do an sfence after both the WC and regular PIDX DB write.
Otherwise the host might reorder things and cause work request corruption
(seen with NFSRDMA).

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/t4.h |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h
index 67cd09e..ace3154 100644
--- a/drivers/infiniband/hw/cxgb4/t4.h
+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -470,12 +470,12 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 
inc, u8 t5,
PDBG("%s: WC wq->sq.pidx = %d\n",
 __func__, wq->sq.pidx);
pio_copy(wq->sq.udb + 7, (void *)wqe);
-   wc_wmb();
} else {
PDBG("%s: DB wq->sq.pidx = %d\n",
 __func__, wq->sq.pidx);
writel(PIDX_T5(inc), wq->sq.udb);
}
+   wc_wmb();
return;
}
writel(QID(wq->sq.qid) | PIDX(inc), wq->db);
@@ -490,12 +490,12 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 
inc, u8 t5,
PDBG("%s: WC wq->rq.pidx = %d\n",
 __func__, wq->rq.pidx);
pio_copy(wq->rq.udb + 7, (void *)wqe);
-   wc_wmb();
} else {
PDBG("%s: DB wq->rq.pidx = %d\n",
 __func__, wq->rq.pidx);
writel(PIDX_T5(inc), wq->rq.udb);
}
+   wc_wmb();
return;
}
writel(QID(wq->rq.qid) | PIDX(inc), wq->db);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 17/31] iw_cxgb4: fix possible memory leak in RX_PKT processing.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

If cxgb4_ofld_send() returns < 0, then send_fw_pass_open_req() must
free the request skb and the saved skb with the tcp header.

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/cm.c |9 -
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index d21ac15..c5c42d0 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -3205,6 +3205,7 @@ static void send_fw_pass_open_req(struct c4iw_dev *dev, 
struct sk_buff *skb,
struct sk_buff *req_skb;
struct fw_ofld_connection_wr *req;
struct cpl_pass_accept_req *cpl = cplhdr(skb);
+   int ret;
 
req_skb = alloc_skb(sizeof(struct fw_ofld_connection_wr), GFP_KERNEL);
req = (struct fw_ofld_connection_wr *)__skb_put(req_skb, sizeof(*req));
@@ -3241,7 +3242,13 @@ static void send_fw_pass_open_req(struct c4iw_dev *dev, 
struct sk_buff *skb,
req->cookie = (unsigned long)skb;
 
set_wr_txq(req_skb, CPL_PRIORITY_CONTROL, port_id);
-   cxgb4_ofld_send(dev->rdev.lldi.ports[0], req_skb);
+   ret = cxgb4_ofld_send(dev->rdev.lldi.ports[0], req_skb);
+   if (ret < 0) {
+   pr_err("%s - cxgb4_ofld_send error %d - dropping\n", __func__,
+  ret);
+   kfree_skb(skb);
+   kfree_skb(req_skb);
+   }
 }
 
 /*
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 20/31] iw_cxgb4: adjust tcp snd/rcv window based on link speed.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

40G devices need a bigger windows, so default 40G devices to snd 512K
rcv 1024K.

Fixed a bug that shows up with recv window sizes that exceed the size of
the RCV_BUFSIZ field in opt0 (>= 1024K :).  If the recv window exceeds
this, then we specify the max possible in opt0, add add the rest in via
a RX_DATA_ACK credits.

Added module option named adjust_win, defaulted to 1, that allows
disabling the 40G window bump.  This allows a user to specify the exact
default window sizes via module options snd_win and rcv_win.

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/cm.c|   63 +--
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h  |2 +
 drivers/net/ethernet/chelsio/cxgb4/t4_msg.h |1 +
 3 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 452ae3a..81fbc6e 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -134,6 +134,11 @@ static int snd_win = 128 * 1024;
 module_param(snd_win, int, 0644);
 MODULE_PARM_DESC(snd_win, "TCP send window in bytes (default=128KB)");
 
+static int adjust_win = 1;
+module_param(adjust_win, int, 0644);
+MODULE_PARM_DESC(adjust_win,
+"Adjust TCP window based on link speed (default=1)");
+
 static struct workqueue_struct *workq;
 
 static struct sk_buff_head rxq;
@@ -465,7 +470,7 @@ static void send_flowc(struct c4iw_ep *ep, struct sk_buff 
*skb)
flowc->mnemval[5].mnemonic = FW_FLOWC_MNEM_RCVNXT;
flowc->mnemval[5].val = cpu_to_be32(ep->rcv_seq);
flowc->mnemval[6].mnemonic = FW_FLOWC_MNEM_SNDBUF;
-   flowc->mnemval[6].val = cpu_to_be32(snd_win);
+   flowc->mnemval[6].val = cpu_to_be32(ep->snd_win);
flowc->mnemval[7].mnemonic = FW_FLOWC_MNEM_MSS;
flowc->mnemval[7].val = cpu_to_be32(ep->emss);
/* Pad WR to 16 byte boundary */
@@ -547,6 +552,7 @@ static int send_connect(struct c4iw_ep *ep)
struct sockaddr_in *ra = (struct sockaddr_in *)&ep->com.remote_addr;
struct sockaddr_in6 *la6 = (struct sockaddr_in6 *)&ep->com.local_addr;
struct sockaddr_in6 *ra6 = (struct sockaddr_in6 *)&ep->com.remote_addr;
+   int win;
 
wrlen = (ep->com.remote_addr.ss_family == AF_INET) ?
roundup(sizev4, 16) :
@@ -564,6 +570,15 @@ static int send_connect(struct c4iw_ep *ep)
 
cxgb4_best_mtu(ep->com.dev->rdev.lldi.mtus, ep->mtu, &mtu_idx);
wscale = compute_wscale(rcv_win);
+
+   /*
+* Specify the largest window that will fit in opt0. The
+* remainder will be specified in the rx_data_ack.
+*/
+   win = ep->rcv_win >> 10;
+   if (win > RCV_BUFSIZ_MASK)
+   win = RCV_BUFSIZ_MASK;
+
opt0 = (nocong ? NO_CONG(1) : 0) |
   KEEP_ALIVE(1) |
   DELACK(1) |
@@ -574,7 +589,7 @@ static int send_connect(struct c4iw_ep *ep)
   SMAC_SEL(ep->smac_idx) |
   DSCP(ep->tos) |
   ULP_MODE(ULP_MODE_TCPDDP) |
-  RCV_BUFSIZ(rcv_win>>10);
+  RCV_BUFSIZ(win);
opt2 = RX_CHANNEL(0) |
   CCTRL_ECN(enable_ecn) |
   RSS_QUEUE_VALID | RSS_QUEUE(ep->rss_qid);
@@ -1134,6 +1149,14 @@ static int update_rx_credits(struct c4iw_ep *ep, u32 
credits)
return 0;
}
 
+   /*
+* If we couldn't specify the entire rcv window at connection setup
+* due to the limit in the number of bits in the RCV_BUFSIZ field,
+* then add the overage in to the credits returned.
+*/
+   if (ep->rcv_win > RCV_BUFSIZ_MASK * 1024)
+   credits += ep->rcv_win - RCV_BUFSIZ_MASK * 1024;
+
req = (struct cpl_rx_data_ack *) skb_put(skb, wrlen);
memset(req, 0, wrlen);
INIT_TP_WR(req, ep->hwtid);
@@ -1592,6 +1615,7 @@ static void send_fw_act_open_req(struct c4iw_ep *ep, 
unsigned int atid)
unsigned int mtu_idx;
int wscale;
struct sockaddr_in *sin;
+   int win;
 
skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
req = (struct fw_ofld_connection_wr *)__skb_put(skb, sizeof(*req));
@@ -1616,6 +1640,15 @@ static void send_fw_act_open_req(struct c4iw_ep *ep, 
unsigned int atid)
req->tcb.rcv_adv = htons(1);
cxgb4_best_mtu(ep->com.dev->rdev.lldi.mtus, ep->mtu, &mtu_idx);
wscale = compute_wscale(rcv_win);
+
+   /*
+* Specify the largest window that will fit in opt0. The
+* remainder will be specified in the rx_data_ack.
+*/
+   win = ep->rcv_win >> 10;
+   if (win > RCV_BUFSIZ_MASK)
+   win = RCV_BUFSIZ_MASK;
+
req->tcb.opt0 = (__force __be64) (TCAM_BYPASS(1) |
(nocong ? NO_CONG(1) : 0) |
KEEP_ALIVE(1) |
@@ -1627,7 +1660,7 @@ static void send_fw_act_open_req(struct c4iw_ep *ep, 
unsigned int atid)
SMAC_SEL(ep->smac_idx) |
   

[PATCHv2 net-next 18/31] iw_cxgb4: ignore read reponse type 1 CQEs.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

These are generated by HW in some error cases and need to be
silently discarded.

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/cq.c |   20 
 1 files changed, 20 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index 59f7601..e310762 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -366,6 +366,14 @@ void c4iw_flush_hw_cq(struct c4iw_cq *chp)
if (CQE_OPCODE(hw_cqe) == FW_RI_READ_RESP) {
 
/*
+* If we have reached here because of async
+* event or other error, and have egress error
+* then drop
+*/
+   if (CQE_TYPE(hw_cqe) == 1)
+   goto next_cqe;
+
+   /*
 * drop peer2peer RTR reads.
 */
if (CQE_WRID_STAG(hw_cqe) == 1)
@@ -512,6 +520,18 @@ static int poll_cq(struct t4_wq *wq, struct t4_cq *cq, 
struct t4_cqe *cqe,
if (RQ_TYPE(hw_cqe) && (CQE_OPCODE(hw_cqe) == FW_RI_READ_RESP)) {
 
/*
+* If we have reached here because of async
+* event or other error, and have egress error
+* then drop
+*/
+   if (CQE_TYPE(hw_cqe) == 1) {
+   if (CQE_STATUS(hw_cqe))
+   t4_set_wq_in_error(wq);
+   ret = -EAGAIN;
+   goto skip_cqe;
+   }
+
+   /*
 * If this is an unsolicited read response, then the read
 * was generated by the kernel driver as part of peer-2-peer
 * connection setup.  So ignore the completion.
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 21/31] iw_cxgb4: update snd_seq when sending MPA messages.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/cm.c |3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 81fbc6e..11c99a6 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -777,6 +777,7 @@ static void send_mpa_req(struct c4iw_ep *ep, struct sk_buff 
*skb,
start_ep_timer(ep);
state_set(&ep->com, MPA_REQ_SENT);
ep->mpa_attr.initiator = 1;
+   ep->snd_seq += mpalen;
return;
 }
 
@@ -856,6 +857,7 @@ static int send_mpa_reject(struct c4iw_ep *ep, const void 
*pdata, u8 plen)
t4_set_arp_err_handler(skb, NULL, arp_failure_discard);
BUG_ON(ep->mpa_skb);
ep->mpa_skb = skb;
+   ep->snd_seq += mpalen;
return c4iw_l2t_send(&ep->com.dev->rdev, skb, ep->l2t);
 }
 
@@ -940,6 +942,7 @@ static int send_mpa_reply(struct c4iw_ep *ep, const void 
*pdata, u8 plen)
t4_set_arp_err_handler(skb, NULL, arp_failure_discard);
ep->mpa_skb = skb;
state_set(&ep->com, MPA_REP_SENT);
+   ep->snd_seq += mpalen;
return c4iw_l2t_send(&ep->com.dev->rdev, skb, ep->l2t);
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 05/31] cxgb4: use spinlock_irqsave/spinlock_irqrestore for db lock.

2014-03-02 Thread Hariprasad Shenai
From: Kumar Sanghvi 

Currently ring_tx_db() can deadlock if a db_full interrupt fires and is
run on the same while ring_tx_db() has the db lock held.  It needs to
disable interrupts since it serializes with an interrupt handler.

Based on original work by Steve Wise 

Signed-off-by: Kumar Sanghvi 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |   11 +++
 drivers/net/ethernet/chelsio/cxgb4/sge.c|5 +++--
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index da4edc1..73dbf81 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -3585,9 +3585,11 @@ static void disable_txq_db(struct sge_txq *q)
 
 static void enable_txq_db(struct sge_txq *q)
 {
-   spin_lock_irq(&q->db_lock);
+   unsigned long flags;
+
+   spin_lock_irqsave(&q->db_lock, flags);
q->db_disabled = 0;
-   spin_unlock_irq(&q->db_lock);
+   spin_unlock_irqrestore(&q->db_lock, flags);
 }
 
 static void disable_dbs(struct adapter *adap)
@@ -3617,9 +3619,10 @@ static void enable_dbs(struct adapter *adap)
 static void sync_txq_pidx(struct adapter *adap, struct sge_txq *q)
 {
u16 hw_pidx, hw_cidx;
+   unsigned long flags;
int ret;
 
-   spin_lock_bh(&q->db_lock);
+   spin_lock_irqsave(&q->db_lock, flags);
ret = read_eq_indices(adap, (u16)q->cntxt_id, &hw_pidx, &hw_cidx);
if (ret)
goto out;
@@ -3636,7 +3639,7 @@ static void sync_txq_pidx(struct adapter *adap, struct 
sge_txq *q)
}
 out:
q->db_disabled = 0;
-   spin_unlock_bh(&q->db_lock);
+   spin_unlock_irqrestore(&q->db_lock, flags);
if (ret)
CH_WARN(adap, "DB drop recovery failed.\n");
 }
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index e0376cd..392baee 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -860,9 +860,10 @@ static void cxgb_pio_copy(u64 __iomem *dst, u64 *src)
 static inline void ring_tx_db(struct adapter *adap, struct sge_txq *q, int n)
 {
unsigned int *wr, index;
+   unsigned long flags;
 
wmb();/* write descriptors before telling HW */
-   spin_lock(&q->db_lock);
+   spin_lock_irqsave(&q->db_lock, flags);
if (!q->db_disabled) {
if (is_t4(adap->params.chip)) {
t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL),
@@ -880,7 +881,7 @@ static inline void ring_tx_db(struct adapter *adap, struct 
sge_txq *q, int n)
}
}
q->db_pidx = q->pidx;
-   spin_unlock(&q->db_lock);
+   spin_unlock_irqrestore(&q->db_lock, flags);
 }
 
 /**
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 11/31] iw_cxgb4: use the BAR2/WC path for kernel QPs and T5 devices.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/device.c   |   41 +
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h |2 +
 drivers/infiniband/hw/cxgb4/qp.c   |   59 --
 drivers/infiniband/hw/cxgb4/t4.h   |   62 +--
 4 files changed, 132 insertions(+), 32 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/device.c 
b/drivers/infiniband/hw/cxgb4/device.c
index 84a78f2..9542ccc 100644
--- a/drivers/infiniband/hw/cxgb4/device.c
+++ b/drivers/infiniband/hw/cxgb4/device.c
@@ -695,7 +695,10 @@ static void c4iw_dealloc(struct uld_ctx *ctx)
idr_destroy(&ctx->dev->hwtid_idr);
idr_destroy(&ctx->dev->stid_idr);
idr_destroy(&ctx->dev->atid_idr);
-   iounmap(ctx->dev->rdev.oc_mw_kva);
+   if (ctx->dev->rdev.bar2_kva)
+   iounmap(ctx->dev->rdev.bar2_kva);
+   if (ctx->dev->rdev.oc_mw_kva)
+   iounmap(ctx->dev->rdev.oc_mw_kva);
ib_dealloc_device(&ctx->dev->ibdev);
ctx->dev = NULL;
 }
@@ -735,11 +738,31 @@ static struct c4iw_dev *c4iw_alloc(const struct 
cxgb4_lld_info *infop)
}
devp->rdev.lldi = *infop;
 
-   devp->rdev.oc_mw_pa = pci_resource_start(devp->rdev.lldi.pdev, 2) +
-   (pci_resource_len(devp->rdev.lldi.pdev, 2) -
-roundup_pow_of_two(devp->rdev.lldi.vr->ocq.size));
-   devp->rdev.oc_mw_kva = ioremap_wc(devp->rdev.oc_mw_pa,
-  devp->rdev.lldi.vr->ocq.size);
+   /*
+* For T5 devices, we map all of BAR2 with WC.
+* For T4 devices with onchip qp mem, we map only that part
+* of BAR2 with WC.
+*/
+   devp->rdev.bar2_pa = pci_resource_start(devp->rdev.lldi.pdev, 2);
+   if (is_t5(devp->rdev.lldi.adapter_type)) {
+   devp->rdev.bar2_kva = ioremap_wc(devp->rdev.bar2_pa,
+   pci_resource_len(devp->rdev.lldi.pdev, 2));
+   if (!devp->rdev.bar2_kva) {
+   printk(KERN_ERR MOD "Unable to ioremap BAR2\n");
+   return ERR_PTR(-EINVAL);
+   }
+   } else if (ocqp_supported(infop)) {
+   devp->rdev.oc_mw_pa =
+   pci_resource_start(devp->rdev.lldi.pdev, 2) +
+   pci_resource_len(devp->rdev.lldi.pdev, 2) -
+   roundup_pow_of_two(devp->rdev.lldi.vr->ocq.size);
+   devp->rdev.oc_mw_kva = ioremap_wc(devp->rdev.oc_mw_pa,
+   devp->rdev.lldi.vr->ocq.size);
+   if (!devp->rdev.oc_mw_kva) {
+   pr_err(MOD "Unable to ioremap onchip mem\n");
+   return ERR_PTR(-EINVAL);
+   }
+   }
 
PDBG(KERN_INFO MOD "ocq memory: "
   "hw_start 0x%x size %u mw_pa 0x%lx mw_kva %p\n",
@@ -1014,9 +1037,11 @@ static int enable_qp_db(int id, void *p, void *data)
 static void resume_rc_qp(struct c4iw_qp *qp)
 {
spin_lock(&qp->lock);
-   t4_ring_sq_db(&qp->wq, qp->wq.sq.wq_pidx_inc);
+   t4_ring_sq_db(&qp->wq, qp->wq.sq.wq_pidx_inc,
+ is_t5(qp->rhp->rdev.lldi.adapter_type), NULL);
qp->wq.sq.wq_pidx_inc = 0;
-   t4_ring_rq_db(&qp->wq, qp->wq.rq.wq_pidx_inc);
+   t4_ring_rq_db(&qp->wq, qp->wq.rq.wq_pidx_inc,
+ is_t5(qp->rhp->rdev.lldi.adapter_type), NULL);
qp->wq.rq.wq_pidx_inc = 0;
spin_unlock(&qp->lock);
 }
diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h 
b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index e9ecbfa..c05c875 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -149,6 +149,8 @@ struct c4iw_rdev {
struct gen_pool *ocqp_pool;
u32 flags;
struct cxgb4_lld_info lldi;
+   unsigned long bar2_pa;
+   void __iomem *bar2_kva;
unsigned long oc_mw_pa;
void __iomem *oc_mw_kva;
struct c4iw_stats stats;
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index 3b62eb5..c16866c 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -58,6 +58,10 @@ static int max_fr_immd = T4_MAX_FR_IMMD;
 module_param(max_fr_immd, int, 0644);
 MODULE_PARM_DESC(max_fr_immd, "fastreg threshold for using DSGL instead of 
immedate");
 
+int t5_en_wc = 1;
+module_param(t5_en_wc, int, 0644);
+MODULE_PARM_DESC(t5_en_wc, "Use BAR2/WC path for kernel users (default 1)");
+
 static void set_state(struct c4iw_qp *qhp, enum c4iw_qp_state state)
 {
unsigned long flag;
@@ -212,13 +216,23 @@ static int create_qp(struct c4iw_rdev *rdev, struct t4_wq 
*wq,
 
wq->db = rdev->lldi.db_reg;
wq->gts = rdev->lldi.gts_reg;
-   if (user) {
-   wq->sq.udb = (u64)pci_resource_start(rdev->lldi.pdev, 2) +
-   (wq->sq.qid << rdev->qpshift);
-   wq->sq.udb &= PAGE_MAS

[PATCHv2 net-next 14/31] iw_cxgb4: default peer2peer mode to 1.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/cm.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 9387f74..d21ac15 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -98,9 +98,9 @@ int c4iw_debug;
 module_param(c4iw_debug, int, 0644);
 MODULE_PARM_DESC(c4iw_debug, "Enable debug logging (default=0)");
 
-static int peer2peer;
+static int peer2peer = 1;
 module_param(peer2peer, int, 0644);
-MODULE_PARM_DESC(peer2peer, "Support peer2peer ULPs (default=0)");
+MODULE_PARM_DESC(peer2peer, "Support peer2peer ULPs (default=1)");
 
 static int p2p_type = FW_RI_INIT_P2PTYPE_READ_REQ;
 module_param(p2p_type, int, 0644);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 10/31] cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

The current logic suffers from a slow response time to disable user DB
usage, and also fails to avoid DB FIFO drops under heavy load. This commit
fixes these deficiencies and makes the avoidance logic more optimal.
This is done by more efficiently notifying the ULDs of potential DB
problems, and implements a smoother flow control algorithm in iw_cxgb4,
which is the ULD that puts the most load on the DB fifo.

Design:

cxgb4:

Direct ULD notification when a DB FULL/DROP interrupt fires.  This allows
the ULD to stop doing user DB writes as quickly as possible.

While user DB usage is disabled, the LLD will accumulate DB write events
for its queues.  Then once DB usage is reenabled, a single DB write is
done for each queue with its accumulated write count.  This reduces the
load put on the DB fifo when reenabling.

iw_cxgb4:

Instead of marking each qp to indicate DB writes are disabled, we create
a device-global status page that each user process maps.  This allows
iw_cxgb4 to only set this single bit to disable all DB write for all
user QPs vs traversing all the active QPs.  If the libcxgb4 doesn't
support this, then we fall back to the old approach of marking each QP.
Thus we allow the new driver to work with an older libcxgb4.

When the LLD upcalls indicating DB FULL, we disable all DB writes
via the status page and transition the DB state to STOPPED.  As user
processes see that DB writes are disabled, they call into the iw_cxgb4
submit their DB write events.  Since the DB state is in STOPPED,
the QP trying to write gets enqueued on a new DB "flow control" list.
As subsequent DB writes are submitted for this flow controlled QP, the
amount of writes are accumulated for each QP on the flow control list.
So all the user QPs that are actively ringing the DB get put on this
list and the number of writes they request are accumulated.

When the LLD upcalls indicating DB EMPTY, which is in a workq context, we
change the DB state to FLOW_CONTROL, and begin resuming all the QPs that
are on the flow control list.  This logic runs on until the flow control
list is empty or we exit FLOW_CONTROL mode (due to a DB DROP upcall,
for example).  QPs are removed from this list, and their accumulated
DB write counts written to the DB FIFO.  Sets of QPs, called chunks in
the code, are removed at one time. This chunk size is a module option,
db_fc_resume_size, and defaults to 64.  So 64 QPs are resumed at a time,
and before the next chunk is resumed, the logic waits (blocks) for the
DB FIFO to drain.  This prevents resuming to quickly and overflowing
the FIFO.  Once the flow control list is empty, the db state transitions
back to NORMAL and user QPs are again allowed to write directly to the
user DB register.

The algorithm is designed such that if the DB write load is high enough,
then all the DB writes get submitted by the kernel using this flow
controlled approach to avoid DB drops.  As the load lightens though, we
resume to normal DB writes directly by user applications.

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/device.c|  188 ++-
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h  |   11 +-
 drivers/infiniband/hw/cxgb4/provider.c  |   44 +-
 drivers/infiniband/hw/cxgb4/qp.c|  140 --
 drivers/infiniband/hw/cxgb4/t4.h|6 +
 drivers/infiniband/hw/cxgb4/user.h  |5 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h  |1 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |   73 +
 drivers/net/ethernet/chelsio/cxgb4/sge.c|3 +-
 9 files changed, 286 insertions(+), 185 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/device.c 
b/drivers/infiniband/hw/cxgb4/device.c
index 4a03385..84a78f2 100644
--- a/drivers/infiniband/hw/cxgb4/device.c
+++ b/drivers/infiniband/hw/cxgb4/device.c
@@ -55,6 +55,23 @@ module_param(allow_db_coalescing_on_t5, int, 0644);
 MODULE_PARM_DESC(allow_db_coalescing_on_t5,
 "Allow DB Coalescing on T5 (default = 0)");
 
+static int db_fc_resume_size = 64;
+module_param(db_fc_resume_size, int, 0644);
+MODULE_PARM_DESC(db_fc_resume_size, "qps are resumed from db flow control in "
+"this size chunks (default = 64)");
+
+static int db_fc_resume_delay = 1;
+module_param(db_fc_resume_delay, int, 0644);
+MODULE_PARM_DESC(db_fc_resume_delay, "how long to delay between removing qps "
+"from the fc list (default is 1 jiffy)");
+
+static int db_fc_drain_thresh;
+module_param(db_fc_drain_thresh, int, 0644);
+MODULE_PARM_DESC(db_fc_drain_thresh,
+"relative threshold at which a chunk will be resumed"
+"from the fc list (default is 0 (int_thresh << "
+"db_fc_drain_thresh))");
+
 struct uld_ctx {
struct list_head entry;
struct cxgb4_lld_info lldi;
@@ -311,9 +328,10 @@ static int stats_show(struct seq_file *seq, void *v)
seq_printf(seq, "  D

[PATCHv2 net-next 13/31] iw_cxgb4: Mind the sq_sig_all/sq_sig_type QP attributes.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h |1 +
 drivers/infiniband/hw/cxgb4/qp.c   |6 --
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h 
b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index c05c875..8c32088 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -448,6 +448,7 @@ struct c4iw_qp {
atomic_t refcnt;
wait_queue_head_t wait;
struct timer_list timer;
+   int sq_sig_all;
 };
 
 static inline struct c4iw_qp *to_c4iw_qp(struct ib_qp *ibqp)
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index c16866c..3f2065c 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -732,7 +732,7 @@ int c4iw_post_send(struct ib_qp *ibqp, struct ib_send_wr 
*wr,
fw_flags = 0;
if (wr->send_flags & IB_SEND_SOLICITED)
fw_flags |= FW_RI_SOLICITED_EVENT_FLAG;
-   if (wr->send_flags & IB_SEND_SIGNALED)
+   if (wr->send_flags & IB_SEND_SIGNALED || qhp->sq_sig_all)
fw_flags |= FW_RI_COMPLETION_FLAG;
swsqe = &qhp->wq.sq.sw_sq[qhp->wq.sq.pidx];
switch (wr->opcode) {
@@ -793,7 +793,8 @@ int c4iw_post_send(struct ib_qp *ibqp, struct ib_send_wr 
*wr,
}
swsqe->idx = qhp->wq.sq.pidx;
swsqe->complete = 0;
-   swsqe->signaled = (wr->send_flags & IB_SEND_SIGNALED);
+   swsqe->signaled = (wr->send_flags & IB_SEND_SIGNALED) ||
+ qhp->sq_sig_all;
swsqe->flushed = 0;
swsqe->wr_id = wr->wr_id;
 
@@ -1620,6 +1621,7 @@ struct ib_qp *c4iw_create_qp(struct ib_pd *pd, struct 
ib_qp_init_attr *attrs,
qhp->attr.enable_bind = 1;
qhp->attr.max_ord = 1;
qhp->attr.max_ird = 1;
+   qhp->sq_sig_all = attrs->sq_sig_type == IB_SIGNAL_ALL_WR;
spin_lock_init(&qhp->lock);
mutex_init(&qhp->mutex);
init_waitqueue_head(&qhp->wait);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 30/31] iw_cxgb4: Max fastreg depth depends on DSGL support.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

The max depth of a fastreg mr depends on  whether the device supports DSGL or 
not.  So
compute it dynamically based on the device support and the module use_dsgl 
option.

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/provider.c |2 +-
 drivers/infiniband/hw/cxgb4/qp.c   |3 ++-
 drivers/infiniband/hw/cxgb4/t4.h   |9 -
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/provider.c 
b/drivers/infiniband/hw/cxgb4/provider.c
index d1565a4..9e1a409 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -327,7 +327,7 @@ static int c4iw_query_device(struct ib_device *ibdev,
props->max_mr = c4iw_num_stags(&dev->rdev);
props->max_pd = T4_MAX_NUM_PD;
props->local_ca_ack_delay = 0;
-   props->max_fast_reg_page_list_len = T4_MAX_FR_DEPTH;
+   props->max_fast_reg_page_list_len = t4_max_fr_depth(use_dsgl);
 
return 0;
 }
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index ca2f753..f6ffb04 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -572,7 +572,8 @@ static int build_fastreg(struct t4_sq *sq, union t4_wr *wqe,
int pbllen = roundup(wr->wr.fast_reg.page_list_len * sizeof(u64), 32);
int rem;
 
-   if (wr->wr.fast_reg.page_list_len > T4_MAX_FR_DEPTH)
+   if (wr->wr.fast_reg.page_list_len >
+   t4_max_fr_depth(use_dsgl))
return -EINVAL;
 
wqe->fr.qpbinde_to_dcacpu = 0;
diff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h
index ace3154..1543d6b 100644
--- a/drivers/infiniband/hw/cxgb4/t4.h
+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -84,7 +84,14 @@ struct t4_status_page {
sizeof(struct fw_ri_isgl)) / sizeof(struct fw_ri_sge))
 #define T4_MAX_FR_IMMD ((T4_SQ_NUM_BYTES - sizeof(struct fw_ri_fr_nsmr_wr) - \
sizeof(struct fw_ri_immd)) & ~31UL)
-#define T4_MAX_FR_DEPTH (1024 / sizeof(u64))
+#define T4_MAX_FR_IMMD_DEPTH (T4_MAX_FR_IMMD / sizeof(u64))
+#define T4_MAX_FR_DSGL 1024
+#define T4_MAX_FR_DSGL_DEPTH (T4_MAX_FR_DSGL / sizeof(u64))
+
+static inline int t4_max_fr_depth(int use_dsgl)
+{
+   return use_dsgl ? T4_MAX_FR_DSGL_DEPTH : T4_MAX_FR_IMMD_DEPTH;
+}
 
 #define T4_RQ_NUM_SLOTS 2
 #define T4_RQ_NUM_BYTES (T4_EQ_ENTRY_SIZE * T4_RQ_NUM_SLOTS)
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 22/31] iw_cxgb4: lock around accept/reject downcalls.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

There is a race between ULP threads doing an accept/reject, and the
ingress processing thread handling close/abort for the same connection.
The accept/reject path needs to hold the lock to serialize these paths.

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/cm.c |   31 +--
 1 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 11c99a6..cbeaa58 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -775,7 +775,7 @@ static void send_mpa_req(struct c4iw_ep *ep, struct sk_buff 
*skb,
ep->mpa_skb = skb;
c4iw_l2t_send(&ep->com.dev->rdev, skb, ep->l2t);
start_ep_timer(ep);
-   state_set(&ep->com, MPA_REQ_SENT);
+   __state_set(&ep->com, MPA_REQ_SENT);
ep->mpa_attr.initiator = 1;
ep->snd_seq += mpalen;
return;
@@ -941,7 +941,7 @@ static int send_mpa_reply(struct c4iw_ep *ep, const void 
*pdata, u8 plen)
skb_get(skb);
t4_set_arp_err_handler(skb, NULL, arp_failure_discard);
ep->mpa_skb = skb;
-   state_set(&ep->com, MPA_REP_SENT);
+   __state_set(&ep->com, MPA_REP_SENT);
ep->snd_seq += mpalen;
return c4iw_l2t_send(&ep->com.dev->rdev, skb, ep->l2t);
 }
@@ -959,6 +959,7 @@ static int act_establish(struct c4iw_dev *dev, struct 
sk_buff *skb)
PDBG("%s ep %p tid %u snd_isn %u rcv_isn %u\n", __func__, ep, tid,
 be32_to_cpu(req->snd_isn), be32_to_cpu(req->rcv_isn));
 
+   mutex_lock(&ep->com.mutex);
dst_confirm(ep->dst);
 
/* setup the hwtid for this connection */
@@ -982,7 +983,7 @@ static int act_establish(struct c4iw_dev *dev, struct 
sk_buff *skb)
send_mpa_req(ep, skb, 1);
else
send_mpa_req(ep, skb, mpa_rev);
-
+   mutex_unlock(&ep->com.mutex);
return 0;
 }
 
@@ -2567,22 +2568,28 @@ static int fw4_ack(struct c4iw_dev *dev, struct sk_buff 
*skb)
 
 int c4iw_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len)
 {
-   int err;
+   int err = 0;
+   int disconnect = 0;
struct c4iw_ep *ep = to_ep(cm_id);
PDBG("%s ep %p tid %u\n", __func__, ep, ep->hwtid);
 
-   if (state_read(&ep->com) == DEAD) {
+
+   mutex_lock(&ep->com.mutex);
+   if (ep->com.state == DEAD) {
c4iw_put_ep(&ep->com);
return -ECONNRESET;
}
set_bit(ULP_REJECT, &ep->com.history);
-   BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD);
+   BUG_ON(ep->com.state != MPA_REQ_RCVD);
if (mpa_rev == 0)
abort_connection(ep, NULL, GFP_KERNEL);
else {
err = send_mpa_reject(ep, pdata, pdata_len);
-   err = c4iw_ep_disconnect(ep, 0, GFP_KERNEL);
+   disconnect = 1;
}
+   mutex_unlock(&ep->com.mutex);
+   if (disconnect)
+   err = c4iw_ep_disconnect(ep, 0, GFP_KERNEL);
c4iw_put_ep(&ep->com);
return 0;
 }
@@ -2597,12 +2604,14 @@ int c4iw_accept_cr(struct iw_cm_id *cm_id, struct 
iw_cm_conn_param *conn_param)
struct c4iw_qp *qp = get_qhp(h, conn_param->qpn);
 
PDBG("%s ep %p tid %u\n", __func__, ep, ep->hwtid);
-   if (state_read(&ep->com) == DEAD) {
+
+   mutex_lock(&ep->com.mutex);
+   if (ep->com.state == DEAD) {
err = -ECONNRESET;
goto err;
}
 
-   BUG_ON(state_read(&ep->com) != MPA_REQ_RCVD);
+   BUG_ON(ep->com.state != MPA_REQ_RCVD);
BUG_ON(!qp);
 
set_bit(ULP_ACCEPT, &ep->com.history);
@@ -2671,14 +2680,16 @@ int c4iw_accept_cr(struct iw_cm_id *cm_id, struct 
iw_cm_conn_param *conn_param)
if (err)
goto err1;
 
-   state_set(&ep->com, FPDU_MODE);
+   __state_set(&ep->com, FPDU_MODE);
established_upcall(ep);
+   mutex_unlock(&ep->com.mutex);
c4iw_put_ep(&ep->com);
return 0;
 err1:
ep->com.cm_id = NULL;
cm_id->rem_ref(cm_id);
 err:
+   mutex_unlock(&ep->com.mutex);
c4iw_put_ep(&ep->com);
return err;
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 29/31] iw_cxgb4: minor fixes/cleanup.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

Added some missing debug stats.

Use uninitialized_var().

Rate limit warning printks.

Initialize reserved fields in a FW work request.

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/cq.c   |2 +-
 drivers/infiniband/hw/cxgb4/mem.c  |6 +-
 drivers/infiniband/hw/cxgb4/qp.c   |2 ++
 drivers/infiniband/hw/cxgb4/resource.c |   10 +++---
 4 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index 351c8e0..6cbe4c5 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -676,7 +676,7 @@ skip_cqe:
 static int c4iw_poll_cq_one(struct c4iw_cq *chp, struct ib_wc *wc)
 {
struct c4iw_qp *qhp = NULL;
-   struct t4_cqe cqe = {0, 0}, *rd_cqe;
+   struct t4_cqe uninitialized_var(cqe), *rd_cqe;
struct t4_wq *wq;
u32 credit = 0;
u8 cqe_flushed;
diff --git a/drivers/infiniband/hw/cxgb4/mem.c 
b/drivers/infiniband/hw/cxgb4/mem.c
index cdaf257..ecdee18 100644
--- a/drivers/infiniband/hw/cxgb4/mem.c
+++ b/drivers/infiniband/hw/cxgb4/mem.c
@@ -259,8 +259,12 @@ static int write_tpt_entry(struct c4iw_rdev *rdev, u32 
reset_tpt_entry,
 
if ((!reset_tpt_entry) && (*stag == T4_STAG_UNSET)) {
stag_idx = c4iw_get_resource(&rdev->resource.tpt_table);
-   if (!stag_idx)
+   if (!stag_idx) {
+   mutex_lock(&rdev->stats.lock);
+   rdev->stats.stag.fail++;
+   mutex_unlock(&rdev->stats.lock);
return -ENOMEM;
+   }
mutex_lock(&rdev->stats.lock);
rdev->stats.stag.cur += 32;
if (rdev->stats.stag.cur > rdev->stats.stag.max)
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index 34a0bd2..ca2f753 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -440,6 +440,8 @@ static int build_rdma_send(struct t4_sq *sq, union t4_wr 
*wqe,
default:
return -EINVAL;
}
+   wqe->send.r3 = 0;
+   wqe->send.r4 = 0;
 
plen = 0;
if (wr->num_sge) {
diff --git a/drivers/infiniband/hw/cxgb4/resource.c 
b/drivers/infiniband/hw/cxgb4/resource.c
index cdef4d7..69e57d0 100644
--- a/drivers/infiniband/hw/cxgb4/resource.c
+++ b/drivers/infiniband/hw/cxgb4/resource.c
@@ -179,8 +179,12 @@ u32 c4iw_get_qpid(struct c4iw_rdev *rdev, struct 
c4iw_dev_ucontext *uctx)
kfree(entry);
} else {
qid = c4iw_get_resource(&rdev->resource.qid_table);
-   if (!qid)
+   if (!qid) {
+   mutex_lock(&rdev->stats.lock);
+   rdev->stats.qid.fail++;
+   mutex_unlock(&rdev->stats.lock);
goto out;
+   }
mutex_lock(&rdev->stats.lock);
rdev->stats.qid.cur += rdev->qpmask + 1;
mutex_unlock(&rdev->stats.lock);
@@ -322,8 +326,8 @@ u32 c4iw_rqtpool_alloc(struct c4iw_rdev *rdev, int size)
unsigned long addr = gen_pool_alloc(rdev->rqt_pool, size << 6);
PDBG("%s addr 0x%x size %d\n", __func__, (u32)addr, size << 6);
if (!addr)
-   printk_ratelimited(KERN_WARNING MOD "%s: Out of RQT memory\n",
-  pci_name(rdev->lldi.pdev));
+   pr_warn_ratelimited(KERN_WARNING MOD "%s: Out of RQT memory\n",
+   pci_name(rdev->lldi.pdev));
mutex_lock(&rdev->stats.lock);
if (addr) {
rdev->stats.rqt.cur += roundup(size << 6, 1 << MIN_RQT_SHIFT);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 15/31] iw_cxgb4: save the correct map length for fast_reg_page_lists.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

We cannot save the mapped length using the rdma max_page_list_len field
of the ib_fast_reg_page_list struct because the core code uses it.  This
results in an incorrect unmap of the page list in c4iw_free_fastreg_pbl().

I found this with dma map debugging enabled in the kernel.  The fix is
to save the length in the c4iw_fr_page_list struct.

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h |1 +
 drivers/infiniband/hw/cxgb4/mem.c  |   12 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h 
b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index 8c32088..b75f8f5 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -375,6 +375,7 @@ struct c4iw_fr_page_list {
DEFINE_DMA_UNMAP_ADDR(mapping);
dma_addr_t dma_addr;
struct c4iw_dev *dev;
+   int pll_len;
 };
 
 static inline struct c4iw_fr_page_list *to_c4iw_fr_page_list(
diff --git a/drivers/infiniband/hw/cxgb4/mem.c 
b/drivers/infiniband/hw/cxgb4/mem.c
index 41b1195..cdaf257 100644
--- a/drivers/infiniband/hw/cxgb4/mem.c
+++ b/drivers/infiniband/hw/cxgb4/mem.c
@@ -903,7 +903,11 @@ struct ib_fast_reg_page_list 
*c4iw_alloc_fastreg_pbl(struct ib_device *device,
dma_unmap_addr_set(c4pl, mapping, dma_addr);
c4pl->dma_addr = dma_addr;
c4pl->dev = dev;
-   c4pl->ibpl.max_page_list_len = pll_len;
+   c4pl->pll_len = pll_len;
+
+   PDBG("%s c4pl %p pll_len %u page_list %p dma_addr %p\n",
+__func__, c4pl, c4pl->pll_len, c4pl->ibpl.page_list,
+(void *)c4pl->dma_addr);
 
return &c4pl->ibpl;
 }
@@ -912,8 +916,12 @@ void c4iw_free_fastreg_pbl(struct ib_fast_reg_page_list 
*ibpl)
 {
struct c4iw_fr_page_list *c4pl = to_c4iw_fr_page_list(ibpl);
 
+   PDBG("%s c4pl %p pll_len %u page_list %p dma_addr %p\n",
+__func__, c4pl, c4pl->pll_len, c4pl->ibpl.page_list,
+(void *)c4pl->dma_addr);
+
dma_free_coherent(&c4pl->dev->rdev.lldi.pdev->dev,
- c4pl->ibpl.max_page_list_len,
+ c4pl->pll_len,
  c4pl->ibpl.page_list, dma_unmap_addr(c4pl, mapping));
kfree(c4pl);
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 19/31] iw_cxgb4: connect_request_upcall fixes.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

When processing an MPA Start Request, if the listening
endpoint is DEAD, then abort the connection.

If the IWCM returns an error, then we must abort the connection and
release resources.  Also abort_connection() should not post a CLOSE
event, so clean that up too.

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/cm.c |   40 ++---
 1 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index c5c42d0..452ae3a 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -968,13 +968,14 @@ static int act_establish(struct c4iw_dev *dev, struct 
sk_buff *skb)
return 0;
 }
 
-static void close_complete_upcall(struct c4iw_ep *ep)
+static void close_complete_upcall(struct c4iw_ep *ep, int status)
 {
struct iw_cm_event event;
 
PDBG("%s ep %p tid %u\n", __func__, ep, ep->hwtid);
memset(&event, 0, sizeof(event));
event.event = IW_CM_EVENT_CLOSE;
+   event.status = status;
if (ep->com.cm_id) {
PDBG("close complete delivered ep %p cm_id %p tid %u\n",
 ep, ep->com.cm_id, ep->hwtid);
@@ -988,7 +989,6 @@ static void close_complete_upcall(struct c4iw_ep *ep)
 static int abort_connection(struct c4iw_ep *ep, struct sk_buff *skb, gfp_t gfp)
 {
PDBG("%s ep %p tid %u\n", __func__, ep, ep->hwtid);
-   close_complete_upcall(ep);
state_set(&ep->com, ABORTING);
set_bit(ABORT_CONN, &ep->com.history);
return send_abort(ep, skb, gfp);
@@ -1067,9 +1067,10 @@ static void connect_reply_upcall(struct c4iw_ep *ep, int 
status)
}
 }
 
-static void connect_request_upcall(struct c4iw_ep *ep)
+static int connect_request_upcall(struct c4iw_ep *ep)
 {
struct iw_cm_event event;
+   int ret;
 
PDBG("%s ep %p tid %u\n", __func__, ep, ep->hwtid);
memset(&event, 0, sizeof(event));
@@ -1094,15 +1095,14 @@ static void connect_request_upcall(struct c4iw_ep *ep)
event.private_data_len = ep->plen;
event.private_data = ep->mpa_pkt + sizeof(struct mpa_message);
}
-   if (state_read(&ep->parent_ep->com) != DEAD) {
-   c4iw_get_ep(&ep->com);
-   ep->parent_ep->com.cm_id->event_handler(
-   ep->parent_ep->com.cm_id,
-   &event);
-   }
+   c4iw_get_ep(&ep->com);
+   ret = ep->parent_ep->com.cm_id->event_handler(ep->parent_ep->com.cm_id,
+ &event);
+   if (ret)
+   c4iw_put_ep(&ep->com);
set_bit(CONNREQ_UPCALL, &ep->com.history);
c4iw_put_ep(&ep->parent_ep->com);
-   ep->parent_ep = NULL;
+   return ret;
 }
 
 static void established_upcall(struct c4iw_ep *ep)
@@ -1401,7 +1401,6 @@ static void process_mpa_request(struct c4iw_ep *ep, 
struct sk_buff *skb)
return;
 
PDBG("%s enter (%s line %u)\n", __func__, __FILE__, __LINE__);
-   stop_ep_timer(ep);
mpa = (struct mpa_message *) ep->mpa_pkt;
 
/*
@@ -1494,9 +1493,17 @@ static void process_mpa_request(struct c4iw_ep *ep, 
struct sk_buff *skb)
 ep->mpa_attr.p2p_type);
 
state_set(&ep->com, MPA_REQ_RCVD);
+   stop_ep_timer(ep);
 
/* drive upcall */
-   connect_request_upcall(ep);
+   mutex_lock(&ep->parent_ep->com.mutex);
+   if (ep->parent_ep->com.state != DEAD) {
+   if (connect_request_upcall(ep))
+   abort_connection(ep, skb, GFP_KERNEL);
+   } else {
+   abort_connection(ep, skb, GFP_KERNEL);
+   }
+   mutex_unlock(&ep->parent_ep->com.mutex);
return;
 }
 
@@ -2257,7 +2264,7 @@ static int peer_close(struct c4iw_dev *dev, struct 
sk_buff *skb)
c4iw_modify_qp(ep->com.qp->rhp, ep->com.qp,
   C4IW_QP_ATTR_NEXT_STATE, &attrs, 1);
}
-   close_complete_upcall(ep);
+   close_complete_upcall(ep, 0);
__state_set(&ep->com, DEAD);
release = 1;
disconnect = 0;
@@ -2427,7 +2434,7 @@ static int close_con_rpl(struct c4iw_dev *dev, struct 
sk_buff *skb)
 C4IW_QP_ATTR_NEXT_STATE,
 &attrs, 1);
}
-   close_complete_upcall(ep);
+   close_complete_upcall(ep, 0);
__state_set(&ep->com, DEAD);
release = 1;
break;
@@ -2982,7 +2989,7 @@ int c4iw_ep_disconnect(struct c4iw_ep *ep, int abrupt, 
gfp_t gfp)
rdev = &ep->com.dev->rdev;
if (c4iw_fatal_error(rdev)) {
fatal = 1;
-   close_complete_upcall(ep);
+   close_complete_upcall(ep, -EIO);

[PATCHv2 net-next 16/31] iw_cxgb4: don't leak skb in c4iw_uld_rx_handler().

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/device.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/device.c 
b/drivers/infiniband/hw/cxgb4/device.c
index 9542ccc..f73fea4 100644
--- a/drivers/infiniband/hw/cxgb4/device.c
+++ b/drivers/infiniband/hw/cxgb4/device.c
@@ -936,9 +936,11 @@ static int c4iw_uld_rx_handler(void *handle, const __be64 
*rsp,
opcode = *(u8 *)rsp;
if (c4iw_handlers[opcode])
c4iw_handlers[opcode](dev, skb);
-   else
+   else {
pr_info("%s no handler opcode 0x%x...\n", __func__,
   opcode);
+   kfree_skb(skb);
+   }
 
return 0;
 nomem:
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 24/31] iw_cxgb4: rx_data() needs to hold the ep mutex.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

To avoid racing with other threads doing close/flush/whatever, rx_data()
should hold the endpoint mutex.

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/cm.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 2930e91..f30ed32 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -1193,7 +1193,7 @@ static void process_mpa_reply(struct c4iw_ep *ep, struct 
sk_buff *skb)
 * the connection.
 */
stop_ep_timer(ep);
-   if (state_read(&ep->com) != MPA_REQ_SENT)
+   if (ep->com.state != MPA_REQ_SENT)
return;
 
/*
@@ -1268,7 +1268,7 @@ static void process_mpa_reply(struct c4iw_ep *ep, struct 
sk_buff *skb)
 * start reply message including private data. And
 * the MPA header is valid.
 */
-   state_set(&ep->com, FPDU_MODE);
+   __state_set(&ep->com, FPDU_MODE);
ep->mpa_attr.crc_enabled = (mpa->flags & MPA_CRC) | crc_enabled ? 1 : 0;
ep->mpa_attr.recv_marker_enabled = markers_enabled;
ep->mpa_attr.xmit_marker_enabled = mpa->flags & MPA_MARKERS ? 1 : 0;
@@ -1383,7 +1383,7 @@ static void process_mpa_reply(struct c4iw_ep *ep, struct 
sk_buff *skb)
}
goto out;
 err:
-   state_set(&ep->com, ABORTING);
+   __state_set(&ep->com, ABORTING);
send_abort(ep, skb, GFP_KERNEL);
 out:
connect_reply_upcall(ep, err);
@@ -1398,7 +1398,7 @@ static void process_mpa_request(struct c4iw_ep *ep, 
struct sk_buff *skb)
 
PDBG("%s ep %p tid %u\n", __func__, ep, ep->hwtid);
 
-   if (state_read(&ep->com) != MPA_REQ_WAIT)
+   if (ep->com.state != MPA_REQ_WAIT)
return;
 
/*
@@ -1519,7 +1519,7 @@ static void process_mpa_request(struct c4iw_ep *ep, 
struct sk_buff *skb)
 ep->mpa_attr.xmit_marker_enabled, ep->mpa_attr.version,
 ep->mpa_attr.p2p_type);
 
-   state_set(&ep->com, MPA_REQ_RCVD);
+   __state_set(&ep->com, MPA_REQ_RCVD);
stop_ep_timer(ep);
 
/* drive upcall */
@@ -1549,11 +1549,12 @@ static int rx_data(struct c4iw_dev *dev, struct sk_buff 
*skb)
PDBG("%s ep %p tid %u dlen %u\n", __func__, ep, ep->hwtid, dlen);
skb_pull(skb, sizeof(*hdr));
skb_trim(skb, dlen);
+   mutex_lock(&ep->com.mutex);
 
/* update RX credits */
update_rx_credits(ep, dlen);
 
-   switch (state_read(&ep->com)) {
+   switch (ep->com.state) {
case MPA_REQ_SENT:
ep->rcv_seq += dlen;
process_mpa_reply(ep, skb);
@@ -1569,7 +1570,7 @@ static int rx_data(struct c4iw_dev *dev, struct sk_buff 
*skb)
pr_err("%s Unexpected streaming data." \
   " qpid %u ep %p state %d tid %u status %d\n",
   __func__, ep->com.qp->wq.sq.qid, ep,
-  state_read(&ep->com), ep->hwtid, status);
+  ep->com.state, ep->hwtid, status);
attrs.next_state = C4IW_QP_STATE_TERMINATE;
c4iw_modify_qp(ep->com.qp->rhp, ep->com.qp,
   C4IW_QP_ATTR_NEXT_STATE, &attrs, 0);
@@ -1578,6 +1579,7 @@ static int rx_data(struct c4iw_dev *dev, struct sk_buff 
*skb)
default:
break;
}
+   mutex_unlock(&ep->com.mutex);
return 0;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 12/31] iw_cxgb4: Fix incorrect BUG_ON conditions.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

Based on original work from Jay Hernandez 

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/cq.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index c0673ac..59f7601 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -603,7 +603,7 @@ proc_cqe:
 */
if (SQ_TYPE(hw_cqe)) {
int idx = CQE_WRID_SQ_IDX(hw_cqe);
-   BUG_ON(idx > wq->sq.size);
+   BUG_ON(idx >= wq->sq.size);
 
/*
* Account for any unsignaled completions completed by
@@ -617,7 +617,7 @@ proc_cqe:
wq->sq.in_use -= wq->sq.size + idx - wq->sq.cidx;
else
wq->sq.in_use -= idx - wq->sq.cidx;
-   BUG_ON(wq->sq.in_use < 0 && wq->sq.in_use < wq->sq.size);
+   BUG_ON(wq->sq.in_use <= 0 && wq->sq.in_use >= wq->sq.size);
 
wq->sq.cidx = (uint16_t)idx;
PDBG("%s completing sq idx %u\n", __func__, wq->sq.cidx);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 28/31] iw_cxgb4: SQ flush fix.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

There is a race when moving a QP from RTS->CLOSING where a SQ work
request could be posted after the FW receives the RDMA_RI/FINI WR.
The SQ work request will never get processed, and should be completed
with FLUSHED status.  Function c4iw_flush_sq(), however was dropping
the oldest SQ work request when in CLOSING or IDLE states, instead of
completing the pending work request. If that oldest pending work request
was actually complete and has a CQE in the CQ, then when that CQE is
proceessed in poll_cq, we'll BUG_ON() due to the inconsistent SQ/CQ state.

This is a very small timing hole and has only been hit once so far.

The fix is two-fold:

1) c4iw_flush_sq() MUST always flush all non-completed WRs with FLUSHED
status regardless of the QP state.

2) In c4iw_modify_rc_qp(), always set the "in error" bit on the queue
before moving the state out of RTS.  This ensures that the state
transition will not happen while another thread is in post_rc_send(),
because set_state() and post_rc_send() both aquire the qp spinlock.
Also, once we transition the state out of RTS, subsequent calls to
post_rc_send() will fail because the "in error" bit is set.  I don't
think this fully closes the race where the FW can get a FINI followed a
SQ work request being posted (because they are posted to differente EQs),
but the #1 fix will handle the issue by flushing the SQ work request.

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/cq.c |   22 --
 drivers/infiniband/hw/cxgb4/qp.c |6 +++---
 2 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index e310762..351c8e0 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -235,27 +235,21 @@ int c4iw_flush_sq(struct c4iw_qp *qhp)
struct t4_cq *cq = &chp->cq;
int idx;
struct t4_swsqe *swsqe;
-   int error = (qhp->attr.state != C4IW_QP_STATE_CLOSING &&
-   qhp->attr.state != C4IW_QP_STATE_IDLE);
 
if (wq->sq.flush_cidx == -1)
wq->sq.flush_cidx = wq->sq.cidx;
idx = wq->sq.flush_cidx;
BUG_ON(idx >= wq->sq.size);
while (idx != wq->sq.pidx) {
-   if (error) {
-   swsqe = &wq->sq.sw_sq[idx];
-   BUG_ON(swsqe->flushed);
-   swsqe->flushed = 1;
-   insert_sq_cqe(wq, cq, swsqe);
-   if (wq->sq.oldest_read == swsqe) {
-   BUG_ON(swsqe->opcode != FW_RI_READ_REQ);
-   advance_oldest_read(wq);
-   }
-   flushed++;
-   } else {
-   t4_sq_consume(wq);
+   swsqe = &wq->sq.sw_sq[idx];
+   BUG_ON(swsqe->flushed);
+   swsqe->flushed = 1;
+   insert_sq_cqe(wq, cq, swsqe);
+   if (wq->sq.oldest_read == swsqe) {
+   BUG_ON(swsqe->opcode != FW_RI_READ_REQ);
+   advance_oldest_read(wq);
}
+   flushed++;
if (++idx == wq->sq.size)
idx = 0;
}
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index 3f2065c..34a0bd2 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -1371,6 +1371,7 @@ int c4iw_modify_qp(struct c4iw_dev *rhp, struct c4iw_qp 
*qhp,
switch (attrs->next_state) {
case C4IW_QP_STATE_CLOSING:
BUG_ON(atomic_read(&qhp->ep->com.kref.refcount) < 2);
+   t4_set_wq_in_error(&qhp->wq);
set_state(qhp, C4IW_QP_STATE_CLOSING);
ep = qhp->ep;
if (!internal) {
@@ -1378,16 +1379,15 @@ int c4iw_modify_qp(struct c4iw_dev *rhp, struct c4iw_qp 
*qhp,
disconnect = 1;
c4iw_get_ep(&qhp->ep->com);
}
-   t4_set_wq_in_error(&qhp->wq);
ret = rdma_fini(rhp, qhp, ep);
if (ret)
goto err;
break;
case C4IW_QP_STATE_TERMINATE:
+   t4_set_wq_in_error(&qhp->wq);
set_state(qhp, C4IW_QP_STATE_TERMINATE);
qhp->attr.layer_etype = attrs->layer_etype;
qhp->attr.ecode = attrs->ecode;
-   t4_set_wq_in_error(&qhp->wq);
ep = qhp->ep;
disconnect = 1;
if (!internal)
@@ -1400,8 +1400,8 @@ int c4iw_modify_qp(struct c4iw_dev *rhp, struct c4iw_qp 
*qhp,
c4iw_get_ep(&qhp->ep->com);
break;
case C4IW_QP_STA

[PATCHv2 net-next 01/31] cxgb4: Fix some small bugs in t4_sge_init_soft() when our Page Size is 64KB

2014-03-02 Thread Hariprasad Shenai
From: Kumar Sanghvi 

We'd come in with SGE_FL_BUFFER_SIZE[0] and [1] both equal to 64KB and the
extant logic would flag that as an error.

Based on original work by Casey Leedom 

Signed-off-by: Kumar Sanghvi 
---
 drivers/net/ethernet/chelsio/cxgb4/sge.c |   12 ++--
 1 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index af76b25..3a2ecd8 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -2596,11 +2596,19 @@ static int t4_sge_init_soft(struct adapter *adap)
fl_small_mtu = READ_FL_BUF(RX_SMALL_MTU_BUF);
fl_large_mtu = READ_FL_BUF(RX_LARGE_MTU_BUF);
 
+   /* We only bother using the Large Page logic if the Large Page Buffer
+* is larger than our Page Size Buffer.
+*/
+   if (fl_large_pg <= fl_small_pg)
+   fl_large_pg = 0;
+
#undef READ_FL_BUF
 
+   /* The Page Size Buffer must be exactly equal to our Page Size and the
+* Large Page Size Buffer should be 0 (per above) or a power of 2.
+*/
if (fl_small_pg != PAGE_SIZE ||
-   (fl_large_pg != 0 && (fl_large_pg < fl_small_pg ||
- (fl_large_pg & (fl_large_pg-1)) != 0))) {
+   (fl_large_pg & (fl_large_pg-1)) != 0) {
dev_err(adap->pdev_dev, "bad SGE FL page buffer sizes [%d, 
%d]\n",
fl_small_pg, fl_large_pg);
return -EINVAL;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 04/31] cxgb4: Updates for T5 SGE's Egress Congestion Threshold

2014-03-02 Thread Hariprasad Shenai
From: Kumar Sanghvi 

Based on original work by Casey Leedom 

Signed-off-by: Kumar Sanghvi 
---
 drivers/net/ethernet/chelsio/cxgb4/sge.c |   18 +-
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h |6 ++
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index 809ab60..e0376cd 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -2777,8 +2777,8 @@ static int t4_sge_init_hard(struct adapter *adap)
 int t4_sge_init(struct adapter *adap)
 {
struct sge *s = &adap->sge;
-   u32 sge_control;
-   int ret;
+   u32 sge_control, sge_conm_ctrl;
+   int ret, egress_threshold;
 
/*
 * Ingress Padding Boundary and Egress Status Page Size are set up by
@@ -2803,10 +2803,18 @@ int t4_sge_init(struct adapter *adap)
 * SGE's Egress Congestion Threshold.  If it isn't, then we can get
 * stuck waiting for new packets while the SGE is waiting for us to
 * give it more Free List entries.  (Note that the SGE's Egress
-* Congestion Threshold is in units of 2 Free List pointers.)
+* Congestion Threshold is in units of 2 Free List pointers.) For T4,
+* there was only a single field to control this.  For T5 there's the
+* original field which now only applies to Unpacked Mode Free List
+* buffers and a new field which only applies to Packed Mode Free List
+* buffers.
 */
-   s->fl_starve_thres
-   = EGRTHRESHOLD_GET(t4_read_reg(adap, SGE_CONM_CTRL))*2 + 1;
+   sge_conm_ctrl = t4_read_reg(adap, SGE_CONM_CTRL);
+   if (is_t4(adap->params.chip))
+   egress_threshold = EGRTHRESHOLD_GET(sge_conm_ctrl);
+   else
+   egress_threshold = EGRTHRESHOLDPACKING_GET(sge_conm_ctrl);
+   s->fl_starve_thres = 2*egress_threshold + 1;
 
setup_timer(&s->rx_timer, sge_rx_timer_cb, (unsigned long)adap);
setup_timer(&s->tx_timer, sge_tx_timer_cb, (unsigned long)adap);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h 
b/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
index 33cf9ef..225ad8a 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
@@ -230,6 +230,12 @@
 #define  EGRTHRESHOLD(x) ((x) << EGRTHRESHOLDshift)
 #define  EGRTHRESHOLD_GET(x) (((x) & EGRTHRESHOLD_MASK) >> EGRTHRESHOLDshift)
 
+#define EGRTHRESHOLDPACKING_MASK   0x3fU
+#define EGRTHRESHOLDPACKING_SHIFT  14
+#define EGRTHRESHOLDPACKING(x) ((x) << EGRTHRESHOLDPACKING_SHIFT)
+#define EGRTHRESHOLDPACKING_GET(x) (((x) >> EGRTHRESHOLDPACKING_SHIFT) & \
+ EGRTHRESHOLDPACKING_MASK)
+
 #define SGE_DBFIFO_STATUS 0x10a4
 #define  HP_INT_THRESH_SHIFT 28
 #define  HP_INT_THRESH_MASK  0xfU
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 07/31] iw_cxgb4: Allow loopback connections.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

find_route() must treat loopback as a valid
egress interface.

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/cm.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index d286bde..360807e 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -400,7 +400,8 @@ static struct dst_entry *find_route(struct c4iw_dev *dev, 
__be32 local_ip,
n = dst_neigh_lookup(&rt->dst, &peer_ip);
if (!n)
return NULL;
-   if (!our_interface(dev, n->dev)) {
+   if (!our_interface(dev, n->dev) &&
+   !(n->dev->flags & IFF_LOOPBACK)) {
dst_release(&rt->dst);
return NULL;
}
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 08/31] iw_cxgb4: release neigh entry in error paths.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

Always release the neigh entry in rx_pkt().

Based on original work by Santosh Rastapur .

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/cm.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 360807e..74a2250 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -3350,10 +3350,9 @@ static int rx_pkt(struct c4iw_dev *dev, struct sk_buff 
*skb)
if (!e) {
pr_err("%s - failed to allocate l2t entry!\n",
   __func__);
-   goto free_dst;
+   goto free_neigh;
}
 
-   neigh_release(neigh);
step = dev->rdev.lldi.nrxq / dev->rdev.lldi.nchan;
rss_qid = dev->rdev.lldi.rxq_ids[pi->port_id * step];
window = (__force u16) htons((__force u16)tcph->window);
@@ -3373,6 +3372,8 @@ static int rx_pkt(struct c4iw_dev *dev, struct sk_buff 
*skb)
  tcph->source, ntohl(tcph->seq), filter, window,
  rss_qid, pi->port_id);
cxgb4_l2t_release(e);
+free_neigh:
+   neigh_release(neigh);
 free_dst:
dst_release(dst);
 reject:
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 06/31] iw_cxgb4: cap CQ size at T4_MAX_IQ_SIZE.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/cq.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index 88de3aa..c0673ac 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -881,7 +881,7 @@ struct ib_cq *c4iw_create_cq(struct ib_device *ibdev, int 
entries,
/*
 * Make actual HW queue 2x to avoid cdix_inc overflows.
 */
-   hwentries = entries * 2;
+   hwentries = min(entries * 2, T4_MAX_IQ_SIZE);
 
/*
 * Make HW queue at least 64 entries so GTS updates aren't too
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 02/31] cxgb4: Add code to dump SGE registers when hitting idma hangs

2014-03-02 Thread Hariprasad Shenai
From: Kumar Sanghvi 

Based on original work by Casey Leedom 

Signed-off-by: Kumar Sanghvi 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h   |1 +
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c   |  106 ++
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h |3 +
 3 files changed, 110 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 944f2cb..509c976 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -1032,4 +1032,5 @@ void t4_db_dropped(struct adapter *adapter);
 int t4_mem_win_read_len(struct adapter *adap, u32 addr, __be32 *data, int len);
 int t4_fwaddrspace_write(struct adapter *adap, unsigned int mbox,
 u32 addr, u32 val);
+void t4_sge_decode_idma_state(struct adapter *adapter, int state);
 #endif /* __CXGB4_H__ */
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c 
b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index d3c2a51..fb2fe65 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -2597,6 +2597,112 @@ int t4_mdio_wr(struct adapter *adap, unsigned int mbox, 
unsigned int phy_addr,
 }
 
 /**
+ * t4_sge_decode_idma_state - decode the idma state
+ * @adap: the adapter
+ * @state: the state idma is stuck in
+ */
+void t4_sge_decode_idma_state(struct adapter *adapter, int state)
+{
+   static const char * const t4_decode[] = {
+   "IDMA_IDLE",
+   "IDMA_PUSH_MORE_CPL_FIFO",
+   "IDMA_PUSH_CPL_MSG_HEADER_TO_FIFO",
+   "Not used",
+   "IDMA_PHYSADDR_SEND_PCIEHDR",
+   "IDMA_PHYSADDR_SEND_PAYLOAD_FIRST",
+   "IDMA_PHYSADDR_SEND_PAYLOAD",
+   "IDMA_SEND_FIFO_TO_IMSG",
+   "IDMA_FL_REQ_DATA_FL_PREP",
+   "IDMA_FL_REQ_DATA_FL",
+   "IDMA_FL_DROP",
+   "IDMA_FL_H_REQ_HEADER_FL",
+   "IDMA_FL_H_SEND_PCIEHDR",
+   "IDMA_FL_H_PUSH_CPL_FIFO",
+   "IDMA_FL_H_SEND_CPL",
+   "IDMA_FL_H_SEND_IP_HDR_FIRST",
+   "IDMA_FL_H_SEND_IP_HDR",
+   "IDMA_FL_H_REQ_NEXT_HEADER_FL",
+   "IDMA_FL_H_SEND_NEXT_PCIEHDR",
+   "IDMA_FL_H_SEND_IP_HDR_PADDING",
+   "IDMA_FL_D_SEND_PCIEHDR",
+   "IDMA_FL_D_SEND_CPL_AND_IP_HDR",
+   "IDMA_FL_D_REQ_NEXT_DATA_FL",
+   "IDMA_FL_SEND_PCIEHDR",
+   "IDMA_FL_PUSH_CPL_FIFO",
+   "IDMA_FL_SEND_CPL",
+   "IDMA_FL_SEND_PAYLOAD_FIRST",
+   "IDMA_FL_SEND_PAYLOAD",
+   "IDMA_FL_REQ_NEXT_DATA_FL",
+   "IDMA_FL_SEND_NEXT_PCIEHDR",
+   "IDMA_FL_SEND_PADDING",
+   "IDMA_FL_SEND_COMPLETION_TO_IMSG",
+   "IDMA_FL_SEND_FIFO_TO_IMSG",
+   "IDMA_FL_REQ_DATAFL_DONE",
+   "IDMA_FL_REQ_HEADERFL_DONE",
+   };
+   static const char * const t5_decode[] = {
+   "IDMA_IDLE",
+   "IDMA_ALMOST_IDLE",
+   "IDMA_PUSH_MORE_CPL_FIFO",
+   "IDMA_PUSH_CPL_MSG_HEADER_TO_FIFO",
+   "IDMA_SGEFLRFLUSH_SEND_PCIEHDR",
+   "IDMA_PHYSADDR_SEND_PCIEHDR",
+   "IDMA_PHYSADDR_SEND_PAYLOAD_FIRST",
+   "IDMA_PHYSADDR_SEND_PAYLOAD",
+   "IDMA_SEND_FIFO_TO_IMSG",
+   "IDMA_FL_REQ_DATA_FL",
+   "IDMA_FL_DROP",
+   "IDMA_FL_DROP_SEND_INC",
+   "IDMA_FL_H_REQ_HEADER_FL",
+   "IDMA_FL_H_SEND_PCIEHDR",
+   "IDMA_FL_H_PUSH_CPL_FIFO",
+   "IDMA_FL_H_SEND_CPL",
+   "IDMA_FL_H_SEND_IP_HDR_FIRST",
+   "IDMA_FL_H_SEND_IP_HDR",
+   "IDMA_FL_H_REQ_NEXT_HEADER_FL",
+   "IDMA_FL_H_SEND_NEXT_PCIEHDR",
+   "IDMA_FL_H_SEND_IP_HDR_PADDING",
+   "IDMA_FL_D_SEND_PCIEHDR",
+   "IDMA_FL_D_SEND_CPL_AND_IP_HDR",
+   "IDMA_FL_D_REQ_NEXT_DATA_FL",
+   "IDMA_FL_SEND_PCIEHDR",
+   "IDMA_FL_PUSH_CPL_FIFO",
+   "IDMA_FL_SEND_CPL",
+   "IDMA_FL_SEND_PAYLOAD_FIRST",
+   "IDMA_FL_SEND_PAYLOAD",
+   "IDMA_FL_REQ_NEXT_DATA_FL",
+   "IDMA_FL_SEND_NEXT_PCIEHDR",
+   "IDMA_FL_SEND_PADDING",
+   "IDMA_FL_SEND_COMPLETION_TO_IMSG",
+   };
+   static const u32 sge_regs[] = {
+   SGE_DEBUG_DATA_LOW_INDEX_2,
+   SGE_DEBUG_DATA_LOW_INDEX_3,
+   SGE_DEBUG_DATA_HIGH_INDEX_10,
+   };
+   const char **sge_idma_decode;
+   int sge_idma_decode_nstates;
+   int i;
+
+   if (is_t4(adapter->params.chip)) {
+   sge_idma_decode = (const char **)t4_decode;
+   sge_idma_decode_nstates = ARRAY_SIZE(t4_decode);
+   } else {
+   

[PATCHv2 net-next 03/31] cxgb4: Rectify emitting messages about SGE Ingress DMA channels being potentially stuck

2014-03-02 Thread Hariprasad Shenai
From: Kumar Sanghvi 

Based on original work by Casey Leedom 

Signed-off-by: Kumar Sanghvi 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h |9 ++-
 drivers/net/ethernet/chelsio/cxgb4/sge.c   |   91 ++--
 2 files changed, 80 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 509c976..50abe1d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -556,8 +556,13 @@ struct sge {
u32 pktshift;   /* padding between CPL & packet data */
u32 fl_align;   /* response queue message alignment */
u32 fl_starve_thres;/* Free List starvation threshold */
-   unsigned int starve_thres;
-   u8 idma_state[2];
+
+   /* State variables for detecting an SGE Ingress DMA hang */
+   unsigned int idma_1s_thresh;/* SGE same State Counter 1s threshold */
+   unsigned int idma_stalled[2];/* SGE synthesized stalled timers in HZ */
+   unsigned int idma_state[2]; /* SGE IDMA Hang detect state */
+   unsigned int idma_qid[2];   /* SGE IDMA Hung Ingress Queue ID */
+
unsigned int egr_start;
unsigned int ingr_start;
void *egr_map[MAX_EGRQ];/* qid->queue egress queue map */
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index 3a2ecd8..809ab60 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -93,6 +93,16 @@
  */
 #define TX_QCHECK_PERIOD (HZ / 2)
 
+/* SGE Hung Ingress DMA Threshold Warning time (in Hz) and Warning Repeat Rate
+ * (in RX_QCHECK_PERIOD multiples).  If we find one of the SGE Ingress DMA
+ * State Machines in the same state for this amount of time (in HZ) then we'll
+ * issue a warning about a potential hang.  We'll repeat the warning as the
+ * SGE Ingress DMA Channel appears to be hung every N RX_QCHECK_PERIODs till
+ * the situation clears.  If the situation clears, we'll note that as well.
+ */
+#define SGE_IDMA_WARN_THRESH (1 * HZ)
+#define SGE_IDMA_WARN_REPEAT (20 * RX_QCHECK_PERIOD)
+
 /*
  * Max number of Tx descriptors to be reclaimed by the Tx timer.
  */
@@ -2008,7 +2018,7 @@ irq_handler_t t4_intr_handler(struct adapter *adap)
 static void sge_rx_timer_cb(unsigned long data)
 {
unsigned long m;
-   unsigned int i, cnt[2];
+   unsigned int i, idma_same_state_cnt[2];
struct adapter *adap = (struct adapter *)data;
struct sge *s = &adap->sge;
 
@@ -2031,21 +2041,65 @@ static void sge_rx_timer_cb(unsigned long data)
}
 
t4_write_reg(adap, SGE_DEBUG_INDEX, 13);
-   cnt[0] = t4_read_reg(adap, SGE_DEBUG_DATA_HIGH);
-   cnt[1] = t4_read_reg(adap, SGE_DEBUG_DATA_LOW);
-
-   for (i = 0; i < 2; i++)
-   if (cnt[i] >= s->starve_thres) {
-   if (s->idma_state[i] || cnt[i] == 0x)
-   continue;
-   s->idma_state[i] = 1;
-   t4_write_reg(adap, SGE_DEBUG_INDEX, 11);
-   m = t4_read_reg(adap, SGE_DEBUG_DATA_LOW) >> (i * 16);
-   dev_warn(adap->pdev_dev,
-"SGE idma%u starvation detected for "
-"queue %lu\n", i, m & 0x);
-   } else if (s->idma_state[i])
-   s->idma_state[i] = 0;
+   idma_same_state_cnt[0] = t4_read_reg(adap, SGE_DEBUG_DATA_HIGH);
+   idma_same_state_cnt[1] = t4_read_reg(adap, SGE_DEBUG_DATA_LOW);
+
+   for (i = 0; i < 2; i++) {
+   u32 debug0, debug11;
+
+   /* If the Ingress DMA Same State Counter ("timer") is less
+* than 1s, then we can reset our synthesized Stall Timer and
+* continue.  If we have previously emitted warnings about a
+* potential stalled Ingress Queue, issue a note indicating
+* that the Ingress Queue has resumed forward progress.
+*/
+   if (idma_same_state_cnt[i] < s->idma_1s_thresh) {
+   if (s->idma_stalled[i] >= SGE_IDMA_WARN_THRESH)
+   CH_WARN(adap, "SGE idma%d, queue%u,resumed 
after %d sec",
+   i, s->idma_qid[i],
+   s->idma_stalled[i]/HZ);
+   s->idma_stalled[i] = 0;
+   continue;
+   }
+
+   /* Synthesize an SGE Ingress DMA Same State Timer in the Hz
+* domain.  The first time we get here it'll be because we
+* passed the 1s Threshold; each additional time it'll be
+* because the RX Timer Callback is being fired on its regular
+* schedule.
+*
+* If the stall is below our Potential Hung Ingress Qu

[PATCHv2 net-next 09/31] iw_cxgb4: Treat CPL_ERR_KEEPALV_NEG_ADVICE as negative advice.

2014-03-02 Thread Hariprasad Shenai
From: Steve Wise 

Based on original work by Anand Priyadarshee .

Signed-off-by: Steve Wise 
---
 drivers/infiniband/hw/cxgb4/cm.c|   25 +
 drivers/net/ethernet/chelsio/cxgb4/t4_msg.h |1 +
 2 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 74a2250..9387f74 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -1648,6 +1648,16 @@ static inline int act_open_has_tid(int status)
   status != CPL_ERR_ARP_MISS;
 }
 
+/*
+ * Returns whether a CPL status conveys negative advice.
+ */
+static int is_neg_adv(unsigned int status)
+{
+   return status == CPL_ERR_RTX_NEG_ADVICE ||
+  status == CPL_ERR_PERSIST_NEG_ADVICE ||
+  status == CPL_ERR_KEEPALV_NEG_ADVICE;
+}
+
 #define ACT_OPEN_RETRY_COUNT 2
 
 static int import_ep(struct c4iw_ep *ep, int iptype, __u8 *peer_ip,
@@ -1836,7 +1846,7 @@ static int act_open_rpl(struct c4iw_dev *dev, struct 
sk_buff *skb)
PDBG("%s ep %p atid %u status %u errno %d\n", __func__, ep, atid,
 status, status2errno(status));
 
-   if (status == CPL_ERR_RTX_NEG_ADVICE) {
+   if (is_neg_adv(status)) {
printk(KERN_WARNING MOD "Connection problems for atid %u\n",
atid);
return 0;
@@ -2266,15 +2276,6 @@ static int peer_close(struct c4iw_dev *dev, struct 
sk_buff *skb)
return 0;
 }
 
-/*
- * Returns whether an ABORT_REQ_RSS message is a negative advice.
- */
-static int is_neg_adv_abort(unsigned int status)
-{
-   return status == CPL_ERR_RTX_NEG_ADVICE ||
-  status == CPL_ERR_PERSIST_NEG_ADVICE;
-}
-
 static int peer_abort(struct c4iw_dev *dev, struct sk_buff *skb)
 {
struct cpl_abort_req_rss *req = cplhdr(skb);
@@ -2288,7 +2289,7 @@ static int peer_abort(struct c4iw_dev *dev, struct 
sk_buff *skb)
unsigned int tid = GET_TID(req);
 
ep = lookup_tid(t, tid);
-   if (is_neg_adv_abort(req->status)) {
+   if (is_neg_adv(req->status)) {
PDBG("%s neg_adv_abort ep %p tid %u\n", __func__, ep,
 ep->hwtid);
return 0;
@@ -3572,7 +3573,7 @@ static int peer_abort_intr(struct c4iw_dev *dev, struct 
sk_buff *skb)
kfree_skb(skb);
return 0;
}
-   if (is_neg_adv_abort(req->status)) {
+   if (is_neg_adv(req->status)) {
PDBG("%s neg_adv_abort ep %p tid %u\n", __func__, ep,
 ep->hwtid);
kfree_skb(skb);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h 
b/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
index cd6874b..f2738c7 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
@@ -116,6 +116,7 @@ enum CPL_error {
CPL_ERR_KEEPALIVE_TIMEDOUT = 34,
CPL_ERR_RTX_NEG_ADVICE = 35,
CPL_ERR_PERSIST_NEG_ADVICE = 36,
+   CPL_ERR_KEEPALV_NEG_ADVICE = 37,
CPL_ERR_ABORT_FAILED   = 42,
CPL_ERR_IWARP_FLM  = 50,
 };
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCHv2 net-next 00/31] Misc. fixes for cxgb4 and iw_cxgb4

2014-03-02 Thread Hariprasad Shenai
Hi All,

This patch series provides miscelleneous fixes for Chelsio T4/T5 adapters
related to cxgb4 related to sge and mtu. And includes DB Drop avoidance
and other misc. fixes on iw-cxgb4.

The patches series is created against David Miller's 'net-next' tree.
And includes patches on cxgb4 and iw_cxgb4 driver.

We would like to request this patch series to get merged via David Miller's
'net-next' tree.

We have included all the maintainers of respective drivers. Kindly review the
change and let us know in case of any review comments.

Thanks

V2:
   Dont drop existing module parameters. 
   (cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes.)

Hariprasad Shenai (1):
  Revert "cxgb4: Don't assume LSO only uses SGL path in t4_eth_xmit()"

Kumar Sanghvi (5):
  cxgb4: Fix some small bugs in t4_sge_init_soft() when our Page Size
is 64KB
  cxgb4: Add code to dump SGE registers when hitting idma hangs
  cxgb4: Rectify emitting messages about SGE Ingress DMA channels being
potentially stuck
  cxgb4: Updates for T5 SGE's Egress Congestion Threshold
  cxgb4: use spinlock_irqsave/spinlock_irqrestore for db lock.

Steve Wise (25):
  iw_cxgb4: cap CQ size at T4_MAX_IQ_SIZE.
  iw_cxgb4: Allow loopback connections.
  iw_cxgb4: release neigh entry in error paths.
  iw_cxgb4: Treat CPL_ERR_KEEPALV_NEG_ADVICE as negative advice.
  cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes.
  iw_cxgb4: use the BAR2/WC path for kernel QPs and T5 devices.
  iw_cxgb4: Fix incorrect BUG_ON conditions.
  iw_cxgb4: Mind the sq_sig_all/sq_sig_type QP attributes.
  iw_cxgb4: default peer2peer mode to 1.
  iw_cxgb4: save the correct map length for fast_reg_page_lists.
  iw_cxgb4: don't leak skb in c4iw_uld_rx_handler().
  iw_cxgb4: fix possible memory leak in RX_PKT processing.
  iw_cxgb4: ignore read reponse type 1 CQEs.
  iw_cxgb4: connect_request_upcall fixes.
  iw_cxgb4: adjust tcp snd/rcv window based on link speed.
  iw_cxgb4: update snd_seq when sending MPA messages.
  iw_cxgb4: lock around accept/reject downcalls.
  iw_cxgb4: drop RX_DATA packets if the endpoint is gone.
  iw_cxgb4: rx_data() needs to hold the ep mutex.
  iw_cxgb4: endpoint timeout fixes.
  iw_cxgb4: rmb() after reading valid gen bit.
  iw_cxgb4: wc_wmb() needed after DB writes.
  iw_cxgb4: SQ flush fix.
  iw_cxgb4: minor fixes/cleanup.
  iw_cxgb4: Max fastreg depth depends on DSGL support.

 drivers/infiniband/hw/cxgb4/cm.c|  270 ---
 drivers/infiniband/hw/cxgb4/cq.c|   50 +++--
 drivers/infiniband/hw/cxgb4/device.c|  229 +---
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h  |   17 +-
 drivers/infiniband/hw/cxgb4/mem.c   |   18 ++-
 drivers/infiniband/hw/cxgb4/provider.c  |   46 -
 drivers/infiniband/hw/cxgb4/qp.c|  204 +
 drivers/infiniband/hw/cxgb4/resource.c  |   10 +-
 drivers/infiniband/hw/cxgb4/t4.h|   78 ++-
 drivers/infiniband/hw/cxgb4/user.h  |5 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h  |   11 +-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |   84 ---
 drivers/net/ethernet/chelsio/cxgb4/sge.c|  161 +-
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c  |  106 +
 drivers/net/ethernet/chelsio/cxgb4/t4_msg.h |2 +
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h|9 +
 16 files changed, 925 insertions(+), 375 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 10/13] IB/iser: Support T10-PI operations

2014-03-02 Thread Mike Christie
On 02/27/2014 05:13 AM, Sagi Grimberg wrote:
> diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c 
> b/drivers/infiniband/ulp/iser/iser_initiator.c
> index 58e14c7..7fd95fe 100644
> --- a/drivers/infiniband/ulp/iser/iser_initiator.c
> +++ b/drivers/infiniband/ulp/iser/iser_initiator.c
> @@ -62,6 +62,17 @@ static int iser_prepare_read_cmd(struct iscsi_task *task,
>   if (err)
>   return err;
>  
> + if (scsi_prot_sg_count(iser_task->sc)) {
> + struct iser_data_buf *pbuf_in = &iser_task->prot[ISER_DIR_IN];
> +
> + err = iser_dma_map_task_data(iser_task,
> +  pbuf_in,
> +  ISER_DIR_IN,
> +  DMA_FROM_DEVICE);
> + if (err)
> + return err;
> + }
> +
>   if (edtl > iser_task->data[ISER_DIR_IN].data_len) {
>   iser_err("Total data length: %ld, less than EDTL: "
>"%d, in READ cmd BHS itt: %d, conn: 0x%p\n",
> @@ -113,6 +124,17 @@ iser_prepare_write_cmd(struct iscsi_task *task,
>   if (err)
>   return err;
>  
> + if (scsi_prot_sg_count(iser_task->sc)) {
> + struct iser_data_buf *pbuf_out = &iser_task->prot[ISER_DIR_OUT];
> +
> + err = iser_dma_map_task_data(iser_task,
> +  pbuf_out,
> +  ISER_DIR_OUT,
> +  DMA_TO_DEVICE);
> + if (err)
> + return err;
> + }
> +


The xmit_task callout does handle failures like EINVAL. If the above map
calls fail then you would get infinite retries. You would currently want
to do the mapping in the init_task callout instead.

If it makes it easier on the driver implementation then it is ok to
modify the xmit_task callers so that they handle multiple error codes
for drivers like iser that have the xmit_task callout called from
iscsi_queuecommand.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 11/13] SCSI/libiscsi: Add check_protection callback for transports

2014-03-02 Thread Mike Christie
On 02/27/2014 05:13 AM, Sagi Grimberg wrote:
> diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
> index 4046241..a58a6bb 100644
> --- a/drivers/scsi/libiscsi.c
> +++ b/drivers/scsi/libiscsi.c
> @@ -395,6 +395,10 @@ static int iscsi_prep_scsi_cmd_pdu(struct iscsi_task 
> *task)
>   if (rc)
>   return rc;
>   }
> +
> + if (scsi_get_prot_op(sc) != SCSI_PROT_NORMAL)
> + task->protected = true;
> +
>   if (sc->sc_data_direction == DMA_TO_DEVICE) {
>   unsigned out_len = scsi_out(sc)->length;
>   struct iscsi_r2t_info *r2t = &task->unsol_r2t;
> @@ -823,6 +827,33 @@ static void iscsi_scsi_cmd_rsp(struct iscsi_conn *conn, 
> struct iscsi_hdr *hdr,
>  
>   sc->result = (DID_OK << 16) | rhdr->cmd_status;
>  
> + if (task->protected) {
> + sector_t sector;
> + u8 ascq;
> +
> + /**
> +  * Transports that didn't implement check_protection
> +  * callback but still published T10-PI support to scsi-mid
> +  * deserve this BUG_ON.
> +  **/
> +  BUG_ON(!session->tt->check_protection);

Extra space before BUG_ON.

> +
> + ascq = session->tt->check_protection(task, §or);
> + if (ascq) {
> + sc->result = DRIVER_SENSE << 24 | DID_ABORT << 16 |
> +  SAM_STAT_CHECK_CONDITION;

I am not sure what the reason for the DID_ABORT is here. I do not think
we want that, because we just want scsi-ml to evaluate the sense error
part of the failure. It works ok today, but the DID_ABORT error can
possibly be evaluated before the sense so you might miss passing that
info to upper layers.


> + scsi_build_sense_buffer(1, sc->sense_buffer,
> + ILLEGAL_REQUEST, 0x10, ascq);
> + sc->sense_buffer[7] = 0xc; /* Additional sense length */
> + sc->sense_buffer[8] = 0;   /* Information desc type */
> + sc->sense_buffer[9] = 0xa; /* Additional desc length */
> + sc->sense_buffer[10] = 0x80; /* Validity bit */
> +
> + put_unaligned_be64(sector, &sc->sense_buffer[12]);
> + goto out;
> + }
> + }
> +
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Proposal for simplifying NFS/RDMA client memory registration

2014-03-02 Thread Chuck Lever

On Mar 1, 2014, at 4:29 PM, Tom Tucker  wrote:

> Hi Chuck,
> 
> I have a patch for the server side that simplifies the memory registration 
> and fixes a bug where the server ignores the FRMR hardware limits. This bug 
> is actually upstream now.
> 
> I have been sitting on it because it's a big patch and will require a lot of 
> testing/review to get it upstream. This is Just an FYI in case there is 
> someone on your team who has the bandwidth to take this work and finish it up.

Why not post what you have, and then we can see what can be done.


> 
> Thanks,
> Tom
> 
> On 2/28/14 8:59 PM, Chuck Lever wrote:
>> Hi Wendy-
>> 
>> On Feb 28, 2014, at 5:26 PM, Wendy Cheng  wrote:
>> 
>>> On Fri, Feb 28, 2014 at 2:20 PM, Wendy Cheng  
>>> wrote:
 ni i...On Fri, Feb 28, 2014 at 1:41 PM, Tom Talpey  wrote:
> On 2/26/2014 8:44 AM, Chuck Lever wrote:
>> Hi-
>> 
>> Shirley Ma and I are reviving work on the NFS/RDMA client code base in
>> the Linux kernel.  So far we've built and run functional tests to 
>> determine
>> what is working and what is broken.
>> 
>> [snip]
 
>> ALLPHYSICAL - Usually fast, but not safe as it exposes client memory.
>> All HCAs support this mode.
> 
> Not safe is an understatement. It exposes all of client physical
> memory to the peer, for both read and write. A simple pointer error
> on the server will silently corrupt the client. This mode was
> intended only for testing, and in experimental deployments.
>>> (sorry, resend .. previous reply bounced back due to gmail html format)
>>> 
>>> Please keep "ALLPHYSICAL" for now  - as our embedded system needs it.
>> This is just the client side.  Confirming that you still need support for 
>> the ALLPHYSICAL memory registration mode in the NFS/RDMA client.
>> 
>> Do you have plans to move to a mode that is less risky?  If not, can we 
>> depend on you to perform regular testing with ALLPHYSICAL as we update the 
>> client code?  Do you have any bug fixes you’d like to merge upstream?
>> 
>> --
>> Chuck Lever
>> chuck[dot]lever[at]oracle[dot]com
>> 
>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 10/20] IB/mlx5: Enhance UMR support to allow partial page table update

2014-03-02 Thread Haggai Eran
The current UMR interface doesn't allow partial updates to a memory region's
page tables. This patch changes the interface to allow that.

It also changes the way the UMR operation validates the memory region's state.
When set, IB_SEND_UMR_CHECK_FREE will cause the UMR operation to fail if the
MKEY is in the free state. When it is unchecked the operation will check that
it isn't in the free state.

Signed-off-by: Haggai Eran 
Signed-off-by: Shachar Raindel 
---
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 15 ++
 drivers/infiniband/hw/mlx5/mr.c  | 22 
 drivers/infiniband/hw/mlx5/qp.c  | 97 +++-
 include/linux/mlx5/device.h  |  9 
 4 files changed, 100 insertions(+), 43 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 5054158..afe39e7 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -111,6 +111,8 @@ struct mlx5_ib_pd {
  */
 
 #define MLX5_IB_SEND_UMR_UNREG IB_SEND_RESERVED_START
+#define MLX5_IB_SEND_UMR_CHECK_FREE (IB_SEND_RESERVED_START << 1)
+#define MLX5_IB_SEND_UMR_UPDATE_MTT (IB_SEND_RESERVED_START << 2)
 #define MLX5_IB_QPT_REG_UMRIB_QPT_RESERVED1
 #define MLX5_IB_WR_UMR IB_WR_RESERVED1
 
@@ -206,6 +208,19 @@ enum mlx5_ib_qp_flags {
MLX5_IB_QP_SIGNATURE_HANDLING   = 1 << 1,
 };
 
+struct mlx5_umr_wr {
+   union {
+   u64 virt_addr;
+   u64 offset;
+   } target;
+   struct ib_pd   *pd;
+   unsigned intpage_shift;
+   unsigned intnpages;
+   u32 length;
+   int access_flags;
+   u32 mkey;
+};
+
 struct mlx5_shared_mr_info {
int mr_id;
struct ib_umem  *umem;
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 66b7290..dd6a4bb 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "mlx5_ib.h"
 
 enum {
@@ -675,6 +676,7 @@ static void prep_umr_reg_wqe(struct ib_pd *pd, struct 
ib_send_wr *wr,
 {
struct mlx5_ib_dev *dev = to_mdev(pd->device);
struct ib_mr *mr = dev->umrc.mr;
+   struct mlx5_umr_wr *umrwr = (struct mlx5_umr_wr *)&wr->wr.fast_reg;
 
sg->addr = dma;
sg->length = ALIGN(sizeof(u64) * n, 64);
@@ -689,21 +691,23 @@ static void prep_umr_reg_wqe(struct ib_pd *pd, struct 
ib_send_wr *wr,
wr->num_sge = 0;
 
wr->opcode = MLX5_IB_WR_UMR;
-   wr->wr.fast_reg.page_list_len = n;
-   wr->wr.fast_reg.page_shift = page_shift;
-   wr->wr.fast_reg.rkey = key;
-   wr->wr.fast_reg.iova_start = virt_addr;
-   wr->wr.fast_reg.length = len;
-   wr->wr.fast_reg.access_flags = access_flags;
-   wr->wr.fast_reg.page_list = (struct ib_fast_reg_page_list *)pd;
+
+   umrwr->npages = n;
+   umrwr->page_shift = page_shift;
+   umrwr->mkey = key;
+   umrwr->target.virt_addr = virt_addr;
+   umrwr->length = len;
+   umrwr->access_flags = access_flags;
+   umrwr->pd = pd;
 }
 
 static void prep_umr_unreg_wqe(struct mlx5_ib_dev *dev,
   struct ib_send_wr *wr, u32 key)
 {
-   wr->send_flags = MLX5_IB_SEND_UMR_UNREG;
+   struct mlx5_umr_wr *umrwr = (struct mlx5_umr_wr *)&wr->wr.fast_reg;
+   wr->send_flags = MLX5_IB_SEND_UMR_UNREG | MLX5_IB_SEND_UMR_CHECK_FREE;
wr->opcode = MLX5_IB_WR_UMR;
-   wr->wr.fast_reg.rkey = key;
+   umrwr->mkey = key;
 }
 
 void mlx5_umr_cq_handler(struct ib_cq *cq, void *cq_context)
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 335bcbe..e48b699 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -70,15 +70,6 @@ static const u32 mlx5_ib_opcode[] = {
[MLX5_IB_WR_UMR]= MLX5_OPCODE_UMR,
 };
 
-struct umr_wr {
-   u64 virt_addr;
-   struct ib_pd   *pd;
-   unsigned intpage_shift;
-   unsigned intnpages;
-   u32 length;
-   int access_flags;
-   u32 mkey;
-};
 
 static int is_qp0(enum ib_qp_type qp_type)
 {
@@ -1818,37 +1809,71 @@ static void set_frwr_umr_segment(struct 
mlx5_wqe_umr_ctrl_seg *umr,
umr->mkey_mask = frwr_mkey_mask();
 }
 
+
+static __be64 get_umr_reg_mr_mask(void)
+{
+   u64 result;
+
+   result = MLX5_MKEY_MASK_LEN |
+MLX5_MKEY_MASK_PAGE_SIZE   |
+MLX5_MKEY_MASK_START_ADDR  |
+MLX5_MKEY_MASK_PD  |
+MLX5_MKEY_MASK_LR  |
+ 

[RFC 04/20] IB/core: Add support for on demand paging regions

2014-03-02 Thread Haggai Eran
From: Shachar Raindel 

* Extend the umem struct to keep the ODP related data.
* Allocate and initialize the ODP related information in the umem
  (page_list, dma_list) and freeing as needed in the end of the run.
* Store a reference to the process PID struct in the ucontext. Used to
  safely obtain the task_struct and the mm during fault handling, without
  preventing the task destruction if needed.
* Add 2 helper functions: ib_umem_odp_map_dma_pages and
  ib_umem_odp_unmap_dma_pages. These functions get the DMA addresses of
  specific pages of the umem (and, currently, pin them).
* Support for page faults only - IB core will keep the reference on the pages
  used and call put_page when freeing an ODP umem area. Invalidations support
  will be added in a later patch.

Signed-off-by: Sagi Grimberg 
Signed-off-by: Shachar Raindel 
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/core/Makefile  |   1 +
 drivers/infiniband/core/umem.c|  24 +++
 drivers/infiniband/core/umem_odp.c| 310 ++
 drivers/infiniband/core/uverbs_cmd.c  |   5 +
 drivers/infiniband/core/uverbs_main.c |   2 +
 include/rdma/ib_umem.h|   2 +
 include/rdma/ib_umem_odp.h| 100 +++
 include/rdma/ib_verbs.h   |   3 +
 8 files changed, 447 insertions(+)
 create mode 100644 drivers/infiniband/core/umem_odp.c
 create mode 100644 include/rdma/ib_umem_odp.h

diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index 3ab3865..5af6f4a 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=   ib_uverbs.o 
ib_ucm.o \
 ib_core-y :=   packer.o ud_header.o verbs.o sysfs.o \
device.o fmr_pool.o cache.o netlink.o
 ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
+ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o
 
 ib_mad-y :=mad.o smi.o agent.o mad_rmpp.o
 
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 138442a..e9798e0 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "uverbs.h"
 
@@ -69,6 +70,10 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
 
 /**
  * ib_umem_get - Pin and DMA map userspace memory.
+ *
+ * If access flags indicate ODP memory, avoid pinning. Instead, stores
+ * the mm for future page fault handling.
+ *
  * @context: userspace context to pin memory for
  * @addr: userspace virtual address to start at
  * @size: length of region to pin
@@ -116,6 +121,17 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
(IB_ACCESS_LOCAL_WRITE   | IB_ACCESS_REMOTE_WRITE |
 IB_ACCESS_REMOTE_ATOMIC | IB_ACCESS_MW_BIND));
 
+   if (access & IB_ACCESS_ON_DEMAND) {
+   ret = ib_umem_odp_get(context, umem);
+   if (ret) {
+   kfree(umem);
+   return ERR_PTR(ret);
+   }
+   return umem;
+   }
+
+   umem->odp_data = NULL;
+
/* We assume the memory is from hugetlb until proved otherwise */
umem->hugetlb   = 1;
 
@@ -234,6 +250,11 @@ void ib_umem_release(struct ib_umem *umem)
struct mm_struct *mm;
unsigned long diff;
 
+   if (umem->odp_data) {
+   ib_umem_odp_release(umem);
+   return;
+   }
+
__ib_umem_release(umem->context->device, umem, 1);
 
mm = get_task_mm(current);
@@ -278,6 +299,9 @@ int ib_umem_page_count(struct ib_umem *umem)
int n;
struct scatterlist *sg;
 
+   if (umem->odp_data)
+   return ib_umem_num_pages(umem);
+
shift = ilog2(umem->page_size);
 
n = 0;
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
new file mode 100644
index 000..67b7816
--- /dev/null
+++ b/drivers/infiniband/core/umem_odp.c
@@ -0,0 +1,310 @@
+/*
+ * Copyright (c) 2014 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *dis

[RFC 05/20] IB/core: Implement support for MMU notifiers regarding on demand paging regions

2014-03-02 Thread Haggai Eran
From: Shachar Raindel 

* Add an interval tree implementation for ODP umems. Create an interval tree
  for each ucontext (including a count of the number of ODP MRs in this
  context, mutex, etc.), and register ODP umems in the interval tree.
* Add MMU notifiers handling functions, using the interval tree to notify only
  the relevant umems and underlying MRs.
* Register to receive MMU notifier events from the MM subsystem upon ODP MR
  registration (and unregister accordingly).
* Add a completion object to synchronize the destruction of ODP umems.
* Add mechanism to abort page faults when there's a concurrent invalidation,
  and modify mlx5_ib to match the new interface to
  ib_umem_odp_map_dma_single_page.

The way we synchronize between concurrent invalidations and page faults is by
keeping a counter of currently running invalidations, and a sequence number
that is incremented whenever an invalidation is caught. The page fault code
checks the counter and also verifies that the sequence number hasn't
progressed before it updates the umem's page tables. This is similar to what
the kvm module does.

There's currently a rare race in the code when registering a umem in the
middle of an ongoing notifier. The proper fix is to either serialize the
insertion to our umem tree with mm_lock_all or use a ucontext wide running
notifiers count for retries decision. Either is ugly and can lead to some sort
of starvation. The current workaround is ugly as well - now the user can end
up with mapped addresses that are not in the user's address space (although it
is highly unlikely).

Signed-off-by: Sagi Grimberg 
Signed-off-by: Shachar Raindel 
Signed-off-by: Haggai Eran 
Signed-off-by: Yuval Dagan 
---
 drivers/infiniband/Kconfig|   1 +
 drivers/infiniband/core/Makefile  |   2 +-
 drivers/infiniband/core/umem.c|   2 +-
 drivers/infiniband/core/umem_odp.c| 317 +-
 drivers/infiniband/core/umem_rbtree.c |  94 ++
 drivers/infiniband/core/uverbs_cmd.c  |  16 ++
 include/rdma/ib_umem_odp.h|  56 ++
 include/rdma/ib_verbs.h   |  16 ++
 8 files changed, 494 insertions(+), 10 deletions(-)
 create mode 100644 drivers/infiniband/core/umem_rbtree.c

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 089a2c2..b899531 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -41,6 +41,7 @@ config INFINIBAND_USER_MEM
 config INFINIBAND_ON_DEMAND_PAGING
bool "InfiniBand on-demand paging support"
depends on INFINIBAND_USER_MEM
+   select MMU_NOTIFIER
default y
---help---
  On demand paging support for the InfiniBand subsystem.
diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index 5af6f4a..021c563 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -11,7 +11,7 @@ obj-$(CONFIG_INFINIBAND_USER_ACCESS) +=   ib_uverbs.o 
ib_ucm.o \
 ib_core-y :=   packer.o ud_header.o verbs.o sysfs.o \
device.o fmr_pool.o cache.o netlink.o
 ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
-ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o
+ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o
 
 ib_mad-y :=mad.o smi.o agent.o mad_rmpp.o
 
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index e9798e0..014977f 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -72,7 +72,7 @@ static void __ib_umem_release(struct ib_device *dev, struct 
ib_umem *umem, int d
  * ib_umem_get - Pin and DMA map userspace memory.
  *
  * If access flags indicate ODP memory, avoid pinning. Instead, stores
- * the mm for future page fault handling.
+ * the mm for future page fault handling in conjuction with MMU notifiers.
  *
  * @context: userspace context to pin memory for
  * @addr: userspace virtual address to start at
diff --git a/drivers/infiniband/core/umem_odp.c 
b/drivers/infiniband/core/umem_odp.c
index 67b7816..f16559c 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -41,26 +41,204 @@
 #include 
 #include 
 
+void ib_umem_notifier_start_account(struct ib_umem *item)
+{
+   int notifiers_count;
+   mutex_lock(&item->odp_data->umem_mutex);
+   /*
+* Avoid performing another locked operation, as we are
+* already protected by the wrapping mutex.
+*/
+   notifiers_count = atomic_read(&item->odp_data->notifiers_count) + 1;
+   if (notifiers_count == 1)
+   reinit_completion(&item->odp_data->notifier_completion);
+   atomic_set(&item->odp_data->notifiers_count,
+  notifiers_count);
+   mutex_unlock(&item->odp_data->umem_mutex);
+}
+EXPORT_SYMBOL(ib_umem_notifier_start_account);
+
+void ib_umem_notifier_end_account(struct ib_umem *item)
+{
+   int noti

[RFC 06/20] IB/mlx5: Fix error handling in reg_umr

2014-03-02 Thread Haggai Eran
If ib_post_send fails when posting the UMR work request in reg_umr, the code
doesn't release the temporary pas buffer allocated, and doesn't dma_unmap it.

Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/mr.c | 31 ---
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 43fb96d..24a68aa 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -730,7 +730,7 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct 
ib_umem *umem,
struct mlx5_ib_mr *mr;
struct ib_sge sg;
int size = sizeof(u64) * npages;
-   int err;
+   int err = 0;
int i;
 
for (i = 0; i < 1; i++) {
@@ -751,7 +751,7 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct 
ib_umem *umem,
mr->pas = kmalloc(size + MLX5_UMR_ALIGN - 1, GFP_KERNEL);
if (!mr->pas) {
err = -ENOMEM;
-   goto error;
+   goto free_mr;
}
 
mlx5_ib_populate_pas(dev, umem, page_shift,
@@ -760,9 +760,8 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct 
ib_umem *umem,
mr->dma = dma_map_single(ddev, mr_align(mr->pas, MLX5_UMR_ALIGN), size,
 DMA_TO_DEVICE);
if (dma_mapping_error(ddev, mr->dma)) {
-   kfree(mr->pas);
err = -ENOMEM;
-   goto error;
+   goto free_pas;
}
 
memset(&wr, 0, sizeof(wr));
@@ -778,26 +777,28 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, 
struct ib_umem *umem,
err = ib_post_send(umrc->qp, &wr, &bad);
if (err) {
mlx5_ib_warn(dev, "post send failed, err %d\n", err);
-   up(&umrc->sem);
-   goto error;
+   goto unmap_dma;
}
wait_for_completion(&mr->done);
-   up(&umrc->sem);
+   if (mr->status != IB_WC_SUCCESS) {
+   mlx5_ib_warn(dev, "reg umr failed\n");
+   err = -EFAULT;
+   }
 
+unmap_dma:
+   up(&umrc->sem);
dma_unmap_single(ddev, mr->dma, size, DMA_TO_DEVICE);
+
+free_pas:
kfree(mr->pas);
 
-   if (mr->status != IB_WC_SUCCESS) {
-   mlx5_ib_warn(dev, "reg umr failed\n");
-   err = -EFAULT;
-   goto error;
+free_mr:
+   if (err) {
+   free_cached_mr(dev, mr);
+   return ERR_PTR(err);
}
 
return mr;
-
-error:
-   free_cached_mr(dev, mr);
-   return ERR_PTR(err);
 }
 
 static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, u64 virt_addr,
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 16/20] IB/mlx5: Add function to read WQE from user-space

2014-03-02 Thread Haggai Eran
Add a helper function mlx5_ib_read_user_wqe to read information from
user-space owned work queues. The function will be used in a later patch by
the page-fault handling code in mlx5_ib.

Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  2 +
 drivers/infiniband/hw/mlx5/qp.c  | 72 
 include/linux/mlx5/qp.h  |  3 ++
 3 files changed, 77 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index f48a511..8d20408 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -525,6 +525,8 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr 
*wr,
 int mlx5_ib_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *wr,
  struct ib_recv_wr **bad_wr);
 void *mlx5_get_send_wqe(struct mlx5_ib_qp *qp, int n);
+int mlx5_ib_read_user_wqe(struct mlx5_ib_qp *qp, int send, int wqe_index,
+  void *buffer, u32 length);
 struct ib_cq *mlx5_ib_create_cq(struct ib_device *ibdev, int entries,
int vector, struct ib_ucontext *context,
struct ib_udata *udata);
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 385501b..535fea3 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -101,6 +101,78 @@ void *mlx5_get_send_wqe(struct mlx5_ib_qp *qp, int n)
return get_wqe(qp, qp->sq.offset + (n << MLX5_IB_SQ_STRIDE));
 }
 
+/*
+ * Copy a user-space WQE to kernel space.
+ *
+ * Copies at least a single WQE, but may copy more data.
+ *
+ * qp - QP to copy from.
+ * send - copy from the send queue when non-zero, use the receive queue
+ *   otherwise.
+ * wqe_index - index to start copying from. For send work queues, the
+ *   wqe_index is in units of MLX5_SEND_WQE_BB. For receive work queue, it is
+ *   the number of work queue element in the queue.
+ * buffer - destination buffer.
+ * length - maximum number of bytes to copy.
+ *
+ * Return the number of bytes copied, or an error code.
+ */
+int mlx5_ib_read_user_wqe(struct mlx5_ib_qp *qp, int send, int wqe_index,
+  void *buffer, u32 length)
+{
+   struct ib_device *ibdev = qp->ibqp.device;
+   struct mlx5_ib_dev *dev = to_mdev(ibdev);
+   struct mlx5_ib_wq *wq = send ? &qp->sq : &qp->rq;
+   size_t offset;
+   size_t wq_end;
+   struct ib_umem *umem = qp->umem;
+   u32 first_copy_length;
+   int wqe_length;
+   int copied;
+   int ret;
+
+   if (wq->wqe_cnt == 0) {
+   mlx5_ib_dbg(dev, "mlx5_ib_read_user_wqe for a QP with wqe_cnt 
== 0. qp_type: 0x%x\n",
+   qp->ibqp.qp_type);
+   return -EINVAL;
+   }
+
+   offset = wq->offset + ((wqe_index % wq->wqe_cnt) << wq->wqe_shift);
+   wq_end = wq->offset + (wq->wqe_cnt << wq->wqe_shift);
+
+   if (send && length < sizeof(struct mlx5_wqe_ctrl_seg))
+   return -EINVAL;
+
+   if (offset > umem->length ||
+   (send && offset + sizeof(struct mlx5_wqe_ctrl_seg) > umem->length))
+   return -EINVAL;
+
+   first_copy_length = min_t(u32, offset + length, wq_end) - offset;
+   copied = ib_umem_copy_from(umem, offset, buffer, first_copy_length);
+   if (copied < first_copy_length)
+   return copied;
+
+   if (send) {
+   struct mlx5_wqe_ctrl_seg *ctrl = buffer;
+   int ds = be32_to_cpu(ctrl->qpn_ds) & MLX5_WQE_CTRL_DS_MASK;
+   wqe_length = ds * MLX5_WQE_DS_UNITS;
+   } else {
+   wqe_length = 1 << wq->wqe_shift;
+   }
+
+   if (wqe_length <= first_copy_length)
+   return first_copy_length;
+
+   ret = ib_umem_copy_from(umem, wq->offset,
+   buffer + first_copy_length, wqe_length -
+   first_copy_length);
+   if (ret < 0)
+   return ret;
+   copied += ret;
+
+   return copied;
+}
+
 static void mlx5_ib_qp_event(struct mlx5_core_qp *qp, int type)
 {
struct ib_qp *ibqp = &to_mibqp(qp)->ibqp;
diff --git a/include/linux/mlx5/qp.h b/include/linux/mlx5/qp.h
index 8db515c..52a2ea8 100644
--- a/include/linux/mlx5/qp.h
+++ b/include/linux/mlx5/qp.h
@@ -182,6 +182,9 @@ struct mlx5_wqe_ctrl_seg {
__be32  imm;
 };
 
+#define MLX5_WQE_CTRL_DS_MASK 0x3f
+#define MLX5_WQE_DS_UNITS 16
+
 struct mlx5_wqe_xrc_seg {
__be32  xrc_srqn;
u8  rsvd[12];
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 11/20] IB/mlx5: Refactor UMR to have its own context struct

2014-03-02 Thread Haggai Eran
From: Shachar Raindel 

Instead of having the UMR context part of each memory region,
allocate a struct on the stack. This allows queuing multiple UMRs that access
the same memory region.

Signed-off-by: Shachar Raindel 
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 13 ++--
 drivers/infiniband/hw/mlx5/mr.c  | 40 ++--
 2 files changed, 31 insertions(+), 22 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index afe39e7..29f58c1 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -279,8 +279,6 @@ struct mlx5_ib_mr {
__be64  *pas;
dma_addr_t  dma;
int npages;
-   struct completion   done;
-   enum ib_wc_status   status;
struct mlx5_ib_dev *dev;
struct mlx5_create_mkey_mbox_out out;
struct mlx5_core_sig_ctx*sig;
@@ -292,6 +290,17 @@ struct mlx5_ib_fast_reg_page_list {
dma_addr_t  map;
 };
 
+struct mlx5_ib_umr_context {
+   enum ib_wc_status   status;
+   struct completion   done;
+};
+
+static inline void mlx5_ib_init_umr_context(struct mlx5_ib_umr_context 
*context)
+{
+   context->status = -1;
+   init_completion(&context->done);
+}
+
 struct umr_common {
struct ib_pd*pd;
struct ib_cq*cq;
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index dd6a4bb..fa0bcd6 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -712,7 +712,7 @@ static void prep_umr_unreg_wqe(struct mlx5_ib_dev *dev,
 
 void mlx5_umr_cq_handler(struct ib_cq *cq, void *cq_context)
 {
-   struct mlx5_ib_mr *mr;
+   struct mlx5_ib_umr_context *context;
struct ib_wc wc;
int err;
 
@@ -725,9 +725,9 @@ void mlx5_umr_cq_handler(struct ib_cq *cq, void *cq_context)
if (err == 0)
break;
 
-   mr = (struct mlx5_ib_mr *)(unsigned long)wc.wr_id;
-   mr->status = wc.status;
-   complete(&mr->done);
+   context = (struct mlx5_ib_umr_context *)wc.wr_id;
+   context->status = wc.status;
+   complete(&context->done);
}
ib_req_notify_cq(cq, IB_CQ_NEXT_COMP);
 }
@@ -739,6 +739,7 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct 
ib_umem *umem,
struct mlx5_ib_dev *dev = to_mdev(pd->device);
struct device *ddev = dev->ib_dev.dma_device;
struct umr_common *umrc = &dev->umrc;
+   struct mlx5_ib_umr_context umr_context;
struct ib_send_wr wr, *bad;
struct mlx5_ib_mr *mr;
struct ib_sge sg;
@@ -778,24 +779,21 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, 
struct ib_umem *umem,
}
 
memset(&wr, 0, sizeof(wr));
-   wr.wr_id = (u64)(unsigned long)mr;
+   wr.wr_id = (u64)(unsigned long)&umr_context;
prep_umr_reg_wqe(pd, &wr, &sg, mr->dma, npages, mr->mmr.key, 
page_shift, virt_addr, len, access_flags);
 
-   /* We serialize polls so one process does not kidnap another's
-* completion. This is not a problem since wr is completed in
-* around 1 usec
-*/
+   mlx5_ib_init_umr_context(&umr_context);
down(&umrc->sem);
-   init_completion(&mr->done);
err = ib_post_send(umrc->qp, &wr, &bad);
if (err) {
mlx5_ib_warn(dev, "post send failed, err %d\n", err);
goto unmap_dma;
-   }
-   wait_for_completion(&mr->done);
-   if (mr->status != IB_WC_SUCCESS) {
-   mlx5_ib_warn(dev, "reg umr failed\n");
-   err = -EFAULT;
+   } else {
+   wait_for_completion(&umr_context.done);
+   if (umr_context.status != IB_WC_SUCCESS) {
+   mlx5_ib_warn(dev, "reg umr failed\n");
+   err = -EFAULT;
+   }
}
 
mr->mmr.iova = virt_addr;
@@ -944,24 +942,26 @@ error:
 static int unreg_umr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr)
 {
struct umr_common *umrc = &dev->umrc;
+   struct mlx5_ib_umr_context umr_context;
struct ib_send_wr wr, *bad;
int err;
 
memset(&wr, 0, sizeof(wr));
-   wr.wr_id = (u64)(unsigned long)mr;
+   wr.wr_id = (u64)(unsigned long)&umr_context;
prep_umr_unreg_wqe(dev, &wr, mr->mmr.key);
 
+   mlx5_ib_init_umr_context(&umr_context);
down(&umrc->sem);
-   init_completion(&mr->done);
err = ib_post_send(umrc->qp, &wr, &bad);
if (err) {
up(&umrc->sem);
mlx5_ib_dbg(dev, "err %d\n", err);
goto error;
+   } else {
+   wait_for_completion(&umr_context.done);
+   up(&umrc->sem);
}
-   wait_for_completion(&mr->done);
-   up(&umrc->s

[RFC 20/20] IB/mlx5: Implement on demand paging by adding support for MMU notifiers

2014-03-02 Thread Haggai Eran
From: Shachar Raindel 

* Implement the relevant invalidation functions (zap MTTs as needed)
* Implement interlocking (and rollback in the page fault handlers) for cases of 
a racing notifier and fault.
* With this patch we can now enable the capability bits for supporting RC
  send/receive and UD send.

Signed-off-by: Sagi Grimberg 
Signed-off-by: Shachar Raindel 
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/main.c|   4 ++
 drivers/infiniband/hw/mlx5/mlx5_ib.h |   3 +
 drivers/infiniband/hw/mlx5/mr.c  |  79 +--
 drivers/infiniband/hw/mlx5/odp.c | 118 +++
 4 files changed, 188 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 535ccac..f7941c8 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -644,6 +644,10 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct 
ib_device *ibdev,
goto out_count;
}
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+   context->ibucontext.invalidate_range = &mlx5_ib_invalidate_range;
+#endif
+
INIT_LIST_HEAD(&context->db_page_list);
mutex_init(&context->db_page_mutex);
 
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index f4240cd..a6ef427 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -327,6 +327,7 @@ struct mlx5_ib_mr {
struct mlx5_ib_dev *dev;
struct mlx5_create_mkey_mbox_out out;
struct mlx5_core_sig_ctx*sig;
+   int live;
 };
 
 struct mlx5_ib_fast_reg_page_list {
@@ -642,6 +643,8 @@ int __init mlx5_ib_odp_init(void);
 void mlx5_ib_odp_cleanup(void);
 void mlx5_ib_qp_disable_pagefaults(struct mlx5_ib_qp *qp);
 void mlx5_ib_qp_enable_pagefaults(struct mlx5_ib_qp *qp);
+void mlx5_ib_invalidate_range(struct ib_umem *umem, unsigned long start,
+ unsigned long end);
 
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 4462ca8..c596507 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "mlx5_ib.h"
 
@@ -53,6 +54,18 @@ static 
DEFINE_MUTEX(mlx5_ib_update_mtt_emergency_buffer_mutex);
 
 static int clean_mr(struct mlx5_ib_mr *mr);
 
+static int destroy_mkey(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr)
+{
+   int err = mlx5_core_destroy_mkey(&dev->mdev, &mr->mmr);
+
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+   /* Wait until all page fault handlers using the mr complete. */
+   synchronize_srcu(&dev->mr_srcu);
+#endif
+
+   return err;
+}
+
 static int order2idx(struct mlx5_ib_dev *dev, int order)
 {
struct mlx5_mr_cache *cache = &dev->cache;
@@ -187,7 +200,7 @@ static void remove_keys(struct mlx5_ib_dev *dev, int c, int 
num)
ent->cur--;
ent->size--;
spin_unlock_irq(&ent->lock);
-   err = mlx5_core_destroy_mkey(&dev->mdev, &mr->mmr);
+   err = destroy_mkey(dev, mr);
if (err)
mlx5_ib_warn(dev, "failed destroy mkey\n");
else
@@ -478,7 +491,7 @@ static void clean_keys(struct mlx5_ib_dev *dev, int c)
ent->cur--;
ent->size--;
spin_unlock_irq(&ent->lock);
-   err = mlx5_core_destroy_mkey(&dev->mdev, &mr->mmr);
+   err = destroy_mkey(dev, mr);
if (err)
mlx5_ib_warn(dev, "failed destroy mkey\n");
else
@@ -804,6 +817,8 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct 
ib_umem *umem,
mr->mmr.size = len;
mr->mmr.pd = to_mpd(pd)->pdn;
 
+   mr->live = 1;
+
 unmap_dma:
up(&umrc->sem);
dma_unmap_single(ddev, mr->dma, size, DMA_TO_DEVICE);
@@ -987,6 +1002,7 @@ static struct mlx5_ib_mr *reg_create(struct ib_pd *pd, u64 
virt_addr,
goto err_2;
}
mr->umem = umem;
+   mr->live = 1;
mlx5_vfree(in);
 
mlx5_ib_dbg(dev, "mkey = 0x%x\n", mr->mmr.key);
@@ -1064,10 +1080,47 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 
start, u64 length,
mr->ibmr.lkey = mr->mmr.key;
mr->ibmr.rkey = mr->mmr.key;
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+   if (umem->odp_data) {
+   /*
+* This barrier prevents the compiler from moving the
+* setting of umem->odp_data->private to point to our
+* MR, before reg_umr finished, to ensure that the MR
+* initialization have finished before starting to
+* handle invalidations.
+*/
+   smp_wmb();
+   mr->umem->odp_data->private = mr

[RFC 03/20] IB/core: Add umem function to read data from user-space

2014-03-02 Thread Haggai Eran
In some drivers there's a need to read data from a user space area that
was pinned using ib_umem, when running from a different process context.

The ib_umem_copy_from function allows reading data from the physical pages
pinned in the ib_umem struct.

Signed-off-by: Haggai Eran 
---
 drivers/infiniband/core/umem.c | 25 +
 include/rdma/ib_umem.h |  2 ++
 2 files changed, 27 insertions(+)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index ab14b33..138442a 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -287,3 +287,28 @@ int ib_umem_page_count(struct ib_umem *umem)
return n;
 }
 EXPORT_SYMBOL(ib_umem_page_count);
+
+/*
+ * Copy from the given ib_umem's pages to the given buffer.
+ *
+ * umem - the umem to copy from
+ * start - offset to start copying from
+ * dst - destination buffer
+ * length - buffer length
+ *
+ * Returns the number of copied bytes, or an error code.
+ */
+int ib_umem_copy_from(struct ib_umem *umem, size_t start, void *dst,
+ size_t length)
+{
+   size_t end = start + length;
+
+   if (start > umem->length || end > umem->length || end < start) {
+   pr_err("ib_umem_copy_from not in range.");
+   return -EINVAL;
+   }
+
+   return sg_pcopy_to_buffer(umem->sg_head.sgl, umem->nmap, dst, length,
+   start + ib_umem_offset(umem));
+}
+EXPORT_SYMBOL(ib_umem_copy_from);
diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h
index 0e120f4..6af91b3 100644
--- a/include/rdma/ib_umem.h
+++ b/include/rdma/ib_umem.h
@@ -83,6 +83,8 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
size_t size, int access, int dmasync);
 void ib_umem_release(struct ib_umem *umem);
 int ib_umem_page_count(struct ib_umem *umem);
+int ib_umem_copy_from(struct ib_umem *umem, size_t start, void *dst,
+ size_t length);
 
 #else /* CONFIG_INFINIBAND_USER_MEM */
 
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 14/20] IB/mlx5: Changes in memory region creation to support on-demand paging

2014-03-02 Thread Haggai Eran
This patch wraps together several changes needed for on-demand paging support
in the mlx5_ib_populate_pas function, and when registering memory regions.

* Instead of accepting a UMR bit telling the function to enable all access
  flags, the function now accepts the access flags themselves.
* For on-demand paging memory regions, fill the memory tables from the
  correct list, and enable/disable the access flags per-page according to
  whether the page is present.
* A new bit is set to enable writing of access flags when using the firmware
  create_mkey command.
* Disable contig pages when on-demand paging is enabled.

In addition the patch changes the UMR code to use PTR_ALIGN instead of our own
macro.

Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/mem.c | 54 ++--
 drivers/infiniband/hw/mlx5/mlx5_ib.h | 12 +++-
 drivers/infiniband/hw/mlx5/mr.c  | 32 +++--
 include/linux/mlx5/device.h  |  3 ++
 4 files changed, 83 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
index 8499aec..d760bfb 100644
--- a/drivers/infiniband/hw/mlx5/mem.c
+++ b/drivers/infiniband/hw/mlx5/mem.c
@@ -32,6 +32,7 @@
 
 #include 
 #include 
+#include 
 #include "mlx5_ib.h"
 
 /* @umem: umem object to scan
@@ -56,6 +57,17 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr, int 
*count, int *shift,
struct scatterlist *sg;
int entry;
 
+   /* With ODP we must always match OS page size. */
+   if (umem->odp_data) {
+   *count = ib_umem_page_count(umem);
+   *shift = PAGE_SHIFT;
+   *ncont = *count;
+   if (order)
+   *order = ilog2(roundup_pow_of_two(*count));
+
+   return;
+   }
+
addr = addr >> PAGE_SHIFT;
tmp = (unsigned long)addr;
m = find_first_bit(&tmp, sizeof(tmp));
@@ -107,8 +119,31 @@ void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr, 
int *count, int *shift,
*count = i;
 }
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+static u64 umem_dma_to_mtt(dma_addr_t umem_dma)
+{
+   u64 mtt_entry = umem_dma & ODP_DMA_ADDR_MASK;
+
+   if (umem_dma & ODP_READ_ALLOWED_BIT)
+   mtt_entry |= MLX5_IB_MTT_READ;
+   if (umem_dma & ODP_WRITE_ALLOWED_BIT)
+   mtt_entry |= MLX5_IB_MTT_WRITE;
+
+   return mtt_entry;
+}
+#endif
+
+/*
+ * Populate the given array with bus addresses from the umem.
+ *
+ * dev - mlx5_ib device
+ * umem - umem to use to fill the pages
+ * page_shift - determines the page size used in the resulting array
+ * pas - bus addresses array to fill
+ * access_flags - access flags to set on all present pages
+ */
 void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
- int page_shift, __be64 *pas, int umr)
+ int page_shift, __be64 *pas, int access_flags)
 {
int shift = page_shift - PAGE_SHIFT;
int mask = (1 << shift) - 1;
@@ -118,6 +153,20 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct 
ib_umem *umem,
int len;
struct scatterlist *sg;
int entry;
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+   const bool odp = umem->odp_data != NULL;
+
+   if (odp) {
+   int num_pages = ib_umem_num_pages(umem);
+   WARN_ON(shift != 0);
+   WARN_ON(access_flags != (MLX5_IB_MTT_READ | MLX5_IB_MTT_WRITE));
+   for (i = 0; i < num_pages; ++i) {
+   dma_addr_t pa = umem->odp_data->dma_list[i];
+   pas[i] = cpu_to_be64(umem_dma_to_mtt(pa));
+   }
+   return;
+   }
+#endif
 
i = 0;
for_each_sg(umem->sg_head.sgl, sg, umem->nmap, entry) {
@@ -126,8 +175,7 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct 
ib_umem *umem,
for (k = 0; k < len; k++) {
if (!(i & mask)) {
cur = base + (k << PAGE_SHIFT);
-   if (umr)
-   cur |= 3;
+   cur |= access_flags;
 
pas[i >> shift] = cpu_to_be64(cur);
mlx5_ib_dbg(dev, "pas[%d] 0x%llx\n",
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 767e791..dd93790 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -268,6 +268,13 @@ struct mlx5_ib_xrcd {
u32 xrcdn;
 };
 
+enum mlx5_ib_mtt_access_flags {
+   MLX5_IB_MTT_READ  = (1 << 0),
+   MLX5_IB_MTT_WRITE = (1 << 1),
+};
+
+#define MLX5_IB_MTT_PRESENT (MLX5_IB_MTT_READ | MLX5_IB_MTT_WRITE)
+
 struct mlx5_ib_mr {
struct ib_mribmr;
struct mlx5_core_mr mmr;
@@ -562,7 +569,7 @@ void mlx5_ib_cleanup_fmr(struct mlx5_

[RFC 19/20] IB/mlx5: Add support for RDMA write responder page faults

2014-03-02 Thread Haggai Eran
From: Shachar Raindel 

Signed-off-by: Shachar Raindel 
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/odp.c | 71 
 1 file changed, 71 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index c6da238..56f2e3b 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -35,6 +35,8 @@
 
 #include "mlx5_ib.h"
 
+#define MAX_PREFETCH_LEN (4*1024*1024U)
+
 struct workqueue_struct *mlx5_ib_page_fault_wq;
 
 #define COPY_ODP_BIT_MLX_TO_IB(reg, ib_caps, field_name, bit_name) do {
\
@@ -487,6 +489,72 @@ resolve_page_fault:
free_page((unsigned long)buffer);
 }
 
+static int pages_in_range(u64 address, u32 length)
+{
+   return (ALIGN(address + length, PAGE_SIZE) -
+   (address & PAGE_MASK)) >> PAGE_SHIFT;
+}
+
+static void mlx5_ib_mr_rdma_pfault_handler(struct mlx5_ib_qp *qp,
+  struct mlx5_ib_pfault *pfault)
+{
+   struct mlx5_pagefault *mpfault = &pfault->mpfault;
+   u64 address;
+   u32 length;
+   u32 prefetch_len = mpfault->bytes_committed;
+   int prefetch_activated = 0;
+   u32 rkey = mpfault->rdma.r_key;
+   int ret;
+   struct mlx5_ib_pfault dummy_pfault = {};
+   dummy_pfault.mpfault.bytes_committed = 0;
+
+   mpfault->rdma.rdma_va += mpfault->bytes_committed;
+   mpfault->rdma.rdma_op_len -= min(mpfault->bytes_committed,
+mpfault->rdma.rdma_op_len);
+   mpfault->bytes_committed = 0;
+
+   address = mpfault->rdma.rdma_va;
+   length  = mpfault->rdma.rdma_op_len;
+
+   /* For some operations, the hardware cannot tell the exact message
+* length, and in those cases it reports zero. Use prefetch
+* logic. */
+   if (length == 0) {
+   prefetch_activated = 1;
+   length = mpfault->rdma.packet_size;
+   prefetch_len = min(MAX_PREFETCH_LEN, prefetch_len);
+   }
+
+   ret = pagefault_single_data_segment(qp, pfault, rkey, address, length,
+   NULL);
+   if (ret == -EAGAIN) {
+   /* We're racing with an invalidation, don't prefetch */
+   prefetch_activated = 0;
+   } else if (ret < 0 || pages_in_range(address, length) > ret) {
+   mlx5_ib_page_fault_resume(qp, pfault, 1);
+   return;
+   }
+
+   mlx5_ib_page_fault_resume(qp, pfault, 0);
+
+   /* At this point, there might be a new pagefault already arriving in
+* the eq, switch to the dummy pagefault for the rest of the
+* processing. We're still OK with the objects being alive as the
+* work-queue is being fenced. */
+
+   if (prefetch_activated) {
+   ret = pagefault_single_data_segment(qp, &dummy_pfault, rkey,
+   address,
+   prefetch_len,
+   NULL);
+   if (ret < 0) {
+   pr_warn("Prefetch failed (ret = %d, prefetch_activated 
= %d) for QPN %d, address: 0x%.16llx, length = 0x%.16x\n",
+   ret, prefetch_activated,
+   qp->ibqp.qp_num, address, prefetch_len);
+   }
+   }
+}
+
 void mlx5_ib_mr_pfault_handler(struct mlx5_ib_qp *qp,
   struct mlx5_ib_pfault *pfault)
 {
@@ -496,6 +564,9 @@ void mlx5_ib_mr_pfault_handler(struct mlx5_ib_qp *qp,
case MLX5_PFAULT_SUBTYPE_WQE:
mlx5_ib_mr_wqe_pfault_handler(qp, pfault);
break;
+   case MLX5_PFAULT_SUBTYPE_RDMA:
+   mlx5_ib_mr_rdma_pfault_handler(qp, pfault);
+   break;
default:
pr_warn("Invalid page fault event subtype: 0x%x\n",
event_subtype);
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 17/20] IB/mlx5: Page faults handling infrastructure

2014-03-02 Thread Haggai Eran
From: Sagi Grimberg 

* Refactor MR registration and cleanup, and fix reg_pages accounting.
* Create a work queue to handle page fault events in a kthread context.
* Register a fault handler to get events from the core for each QP.

The registered fault handler is empty in this patch, and only a later patch
implements it.

Signed-off-by: Sagi Grimberg 
Signed-off-by: Shachar Raindel 
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/main.c|  25 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  66 +++-
 drivers/infiniband/hw/mlx5/mr.c  |  45 +++
 drivers/infiniband/hw/mlx5/odp.c | 145 +++
 drivers/infiniband/hw/mlx5/qp.c  |  22 ++
 include/linux/mlx5/driver.h  |   2 +-
 6 files changed, 286 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index f8015ac..535ccac 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -934,7 +934,7 @@ static ssize_t show_reg_pages(struct device *device,
struct mlx5_ib_dev *dev =
container_of(device, struct mlx5_ib_dev, ib_dev.dev);
 
-   return sprintf(buf, "%d\n", dev->mdev.priv.reg_pages);
+   return sprintf(buf, "%d\n", atomic_read(&dev->mdev.priv.reg_pages));
 }
 
 static ssize_t show_hca(struct device *device, struct device_attribute *attr,
@@ -1468,7 +1468,6 @@ static int init_one(struct pci_dev *pdev,
goto err_eqs;
 
mutex_init(&dev->cap_mask_mutex);
-   spin_lock_init(&dev->mr_lock);
 
err = create_dev_resources(&dev->devr);
if (err)
@@ -1489,6 +1488,10 @@ static int init_one(struct pci_dev *pdev,
goto err_umrc;
}
 
+   err = mlx5_ib_odp_init_one(dev);
+   if (err)
+   goto err_umrc;
+
dev->ib_active = true;
 
return 0;
@@ -1518,6 +1521,7 @@ static void remove_one(struct pci_dev *pdev)
 {
struct mlx5_ib_dev *dev = mlx5_pci2ibdev(pdev);
 
+   mlx5_ib_odp_remove_one(dev);
destroy_umrc_res(dev);
ib_unregister_device(&dev->ib_dev);
destroy_dev_resources(&dev->devr);
@@ -1542,12 +1546,27 @@ static struct pci_driver mlx5_ib_driver = {
 
 static int __init mlx5_ib_init(void)
 {
-   return pci_register_driver(&mlx5_ib_driver);
+   int err;
+
+   err = mlx5_ib_odp_init();
+   if (err)
+   return err;
+
+   err = pci_register_driver(&mlx5_ib_driver);
+   if (err)
+   goto clean_odp;
+
+   return err;
+
+clean_odp:
+   mlx5_ib_odp_cleanup();
+   return err;
 }
 
 static void __exit mlx5_ib_cleanup(void)
 {
pci_unregister_driver(&mlx5_ib_driver);
+   mlx5_ib_odp_cleanup();
 }
 
 module_init(mlx5_ib_init);
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 8d20408..f4240cd 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -149,6 +149,29 @@ enum {
MLX5_QP_EMPTY
 };
 
+/*
+ * Connect-IB can trigger up to four concurrent pagefaults
+ * per-QP.
+ */
+enum mlx5_ib_pagefault_context {
+   MLX5_IB_PAGEFAULT_RESPONDER_READ,
+   MLX5_IB_PAGEFAULT_REQUESTOR_READ,
+   MLX5_IB_PAGEFAULT_RESPONDER_WRITE,
+   MLX5_IB_PAGEFAULT_REQUESTOR_WRITE,
+   MLX5_IB_PAGEFAULT_CONTEXTS
+};
+
+static inline enum mlx5_ib_pagefault_context
+   mlx5_ib_get_pagefault_context(struct mlx5_pagefault *pagefault)
+{
+   return pagefault->flags & (MLX5_PFAULT_REQUESTOR | MLX5_PFAULT_WRITE);
+}
+
+struct mlx5_ib_pfault {
+   struct work_struct  work;
+   struct mlx5_pagefault   mpfault;
+};
+
 struct mlx5_ib_qp {
struct ib_qpibqp;
struct mlx5_core_qp mqp;
@@ -194,6 +217,21 @@ struct mlx5_ib_qp {
 
/* Store signature errors */
boolsignature_en;
+
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+   /*
+* A flag that is true for QP's that are in a state that doesn't
+* allow page faults, and shouldn't schedule any more faults.
+*/
+   int disable_page_faults;
+   /*
+* The disable_page_faults_lock protects a QP's disable_page_faults
+* field, allowing for a thread to atomically check whether the QP
+* allows page faults, and if so schedule a page fault.
+*/
+   spinlock_t  disable_page_faults_lock;
+   struct mlx5_ib_pfault   pagefaults[MLX5_IB_PAGEFAULT_CONTEXTS];
+#endif
 };
 
 struct mlx5_ib_cq_buf {
@@ -394,13 +432,17 @@ struct mlx5_ib_dev {
struct umr_common   umrc;
/* sync used page count stats
 */
-   spinlock_t  mr_lock;
struct mlx5_ib_resourcesdevr;
struct mlx5_mr_cachecache;
struct timer_list   delay_timer;
int fill_delay;
 #

[RFC 13/20] IB/mlx5: Implement the ODP capability query verb

2014-03-02 Thread Haggai Eran
The patch adds infrastructure to support the ODP capability query verb in the
mlx5 driver. The verb will read the capabilities from the device, and enable
only those capabilities that both the driver and the device supports.
At this point ODP is not supported, so no capability is copied from the
device, but the patch exposes the global ODP device capability bit to
designate that the new verb can be called.

Signed-off-by: Shachar Raindel 
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/Makefile  |  1 +
 drivers/infiniband/hw/mlx5/main.c| 13 +++
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  8 +
 drivers/infiniband/hw/mlx5/odp.c | 69 
 4 files changed, 91 insertions(+)
 create mode 100644 drivers/infiniband/hw/mlx5/odp.c

diff --git a/drivers/infiniband/hw/mlx5/Makefile 
b/drivers/infiniband/hw/mlx5/Makefile
index 4ea0135..27a7015 100644
--- a/drivers/infiniband/hw/mlx5/Makefile
+++ b/drivers/infiniband/hw/mlx5/Makefile
@@ -1,3 +1,4 @@
 obj-$(CONFIG_MLX5_INFINIBAND)  += mlx5_ib.o
 
 mlx5_ib-y :=   main.o cq.o doorbell.o qp.o mem.o srq.o mr.o ah.o mad.o
+mlx5_ib-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += odp.o
diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 7b9c078..f8015ac 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -318,6 +318,12 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
   props->max_mcast_grp;
props->max_map_per_fmr = INT_MAX; /* no limit in ConnectIB */
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+   if (dev->mdev.caps.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG)
+   props->device_cap_flags |= IB_DEVICE_ON_DEMAND_PAGING;
+   props->odp_caps = dev->odp_caps;
+#endif
+
 out:
kfree(in_mad);
kfree(out_mad);
@@ -1397,6 +1403,10 @@ static int init_one(struct pci_dev *pdev,
(1ull << IB_USER_VERBS_CMD_DESTROY_SRQ) |
(1ull << IB_USER_VERBS_CMD_CREATE_XSRQ) |
(1ull << IB_USER_VERBS_CMD_OPEN_QP);
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+   dev->ib_dev.uverbs_ex_cmd_mask =
+   (1ull << IB_USER_VERBS_EX_CMD_QUERY_ODP_CAPS);
+#endif
 
dev->ib_dev.query_device= mlx5_ib_query_device;
dev->ib_dev.query_port  = mlx5_ib_query_port;
@@ -1441,6 +1451,9 @@ static int init_one(struct pci_dev *pdev,
dev->ib_dev.alloc_fast_reg_page_list = mlx5_ib_alloc_fast_reg_page_list;
dev->ib_dev.free_fast_reg_page_list  = mlx5_ib_free_fast_reg_page_list;
dev->ib_dev.check_mr_status = mlx5_ib_check_mr_status;
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+   mlx5_ib_internal_query_odp_caps(dev);
+#endif
 
if (mdev->caps.flags & MLX5_DEV_CAP_FLAG_XRC) {
dev->ib_dev.alloc_xrcd = mlx5_ib_alloc_xrcd;
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 29f58c1..767e791 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -392,6 +392,9 @@ struct mlx5_ib_dev {
struct mlx5_mr_cachecache;
struct timer_list   delay_timer;
int fill_delay;
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+   struct ib_odp_caps  odp_caps;
+#endif
 };
 
 static inline struct mlx5_ib_cq *to_mibcq(struct mlx5_core_cq *mcq)
@@ -569,6 +572,11 @@ void mlx5_umr_cq_handler(struct ib_cq *cq, void 
*cq_context);
 int mlx5_ib_check_mr_status(struct ib_mr *ibmr, u32 check_mask,
struct ib_mr_status *mr_status);
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+int mlx5_ib_query_odp_caps(struct ib_device *ibdev, struct ib_odp_caps *caps);
+int mlx5_ib_internal_query_odp_caps(struct mlx5_ib_dev *dev);
+#endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
+
 static inline void init_query_mad(struct ib_smp *mad)
 {
mad->base_version  = 1;
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
new file mode 100644
index 000..71ef604
--- /dev/null
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -0,0 +1,69 @@
+/*
+ * Copyright (c) 2014 Mellanox Technologies. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+

[RFC 08/20] mlx5: Store MR attributes in mlx5_mr_core during creation and after UMR

2014-03-02 Thread Haggai Eran
The patch stores iova, pd and size during mr creation and after UMRs that
modify them. It removes the unused access flags field.

Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/mr.c  | 4 
 drivers/net/ethernet/mellanox/mlx5/core/mr.c | 4 
 include/linux/mlx5/driver.h  | 1 -
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index f447257..66b7290 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -794,6 +794,10 @@ static struct mlx5_ib_mr *reg_umr(struct ib_pd *pd, struct 
ib_umem *umem,
err = -EFAULT;
}
 
+   mr->mmr.iova = virt_addr;
+   mr->mmr.size = len;
+   mr->mmr.pd = to_mpd(pd)->pdn;
+
 unmap_dma:
up(&umrc->sem);
dma_unmap_single(ddev, mr->dma, size, DMA_TO_DEVICE);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mr.c 
b/drivers/net/ethernet/mellanox/mlx5/core/mr.c
index 4cc9276..ac52a0f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mr.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mr.c
@@ -82,7 +82,11 @@ int mlx5_core_create_mkey(struct mlx5_core_dev *dev, struct 
mlx5_core_mr *mr,
return mlx5_cmd_status_to_err(&lout.hdr);
}
 
+   mr->iova = be64_to_cpu(in->seg.start_addr);
+   mr->size = be64_to_cpu(in->seg.len);
mr->key = mlx5_idx_to_mkey(be32_to_cpu(lout.mkey) & 0xff) | key;
+   mr->pd = be32_to_cpu(in->seg.flags_pd) & 0xff;
+
mlx5_core_dbg(dev, "out 0x%x, key 0x%x, mkey 0x%x\n",
  be32_to_cpu(lout.mkey), key, mr->key);
 
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 93cef63..2bce4aa 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -427,7 +427,6 @@ struct mlx5_core_mr {
u64 size;
u32 key;
u32 pd;
-   u32 access;
 };
 
 struct mlx5_core_srq {
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 02/20] IB/core: Replace ib_umem's offset field with a full address

2014-03-02 Thread Haggai Eran
In order to allow umems that do not pin memory we need the umem to keep track
of its region's address.

This makes the offset field redundant, and so this patch removes it.

Signed-off-by: Haggai Eran 
---
 drivers/infiniband/core/umem.c   |  6 +++---
 drivers/infiniband/hw/amso1100/c2_provider.c |  2 +-
 drivers/infiniband/hw/ehca/ehca_mrmw.c   |  2 +-
 drivers/infiniband/hw/ipath/ipath_mr.c   |  2 +-
 drivers/infiniband/hw/nes/nes_verbs.c|  4 ++--
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c  |  2 +-
 drivers/infiniband/hw/qib/qib_mr.c   |  2 +-
 include/rdma/ib_umem.h   | 25 -
 8 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 1fba9d3..ab14b33 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -103,7 +103,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
 
umem->context   = context;
umem->length= size;
-   umem->offset= addr & ~PAGE_MASK;
+   umem->address   = addr;
umem->page_size = PAGE_SIZE;
/*
 * We ask for writable memory if any of the following
@@ -133,7 +133,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
if (!vma_list)
umem->hugetlb = 0;
 
-   npages = PAGE_ALIGN(size + umem->offset) >> PAGE_SHIFT;
+   npages = ib_umem_num_pages(umem);
 
down_write(¤t->mm->mmap_sem);
 
@@ -242,7 +242,7 @@ void ib_umem_release(struct ib_umem *umem)
return;
}
 
-   diff = PAGE_ALIGN(umem->length + umem->offset) >> PAGE_SHIFT;
+   diff = ib_umem_num_pages(umem);
 
/*
 * We may be called with the mm's mmap_sem already held.  This
diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c 
b/drivers/infiniband/hw/amso1100/c2_provider.c
index 8af33cf..056c405 100644
--- a/drivers/infiniband/hw/amso1100/c2_provider.c
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -476,7 +476,7 @@ static struct ib_mr *c2_reg_user_mr(struct ib_pd *pd, u64 
start, u64 length,
 c2mr->umem->page_size,
 i,
 length,
-c2mr->umem->offset,
+ib_umem_offset(c2mr->umem),
 &kva,
 c2_convert_access(acc),
 c2mr);
diff --git a/drivers/infiniband/hw/ehca/ehca_mrmw.c 
b/drivers/infiniband/hw/ehca/ehca_mrmw.c
index 7168f59..d64c9cc 100644
--- a/drivers/infiniband/hw/ehca/ehca_mrmw.c
+++ b/drivers/infiniband/hw/ehca/ehca_mrmw.c
@@ -399,7 +399,7 @@ reg_user_mr_fallback:
pginfo.num_kpages = num_kpages;
pginfo.num_hwpages = num_hwpages;
pginfo.u.usr.region = e_mr->umem;
-   pginfo.next_hwpage = e_mr->umem->offset / hwpage_size;
+   pginfo.next_hwpage = ib_umem_offset(e_mr->umem) / hwpage_size;
pginfo.u.usr.next_sg = pginfo.u.usr.region->sg_head.sgl;
ret = ehca_reg_mr(shca, e_mr, (u64 *)virt, length, mr_access_flags,
  e_pd, &pginfo, &e_mr->ib.ib_mr.lkey,
diff --git a/drivers/infiniband/hw/ipath/ipath_mr.c 
b/drivers/infiniband/hw/ipath/ipath_mr.c
index 5e61e9b..c7278f6 100644
--- a/drivers/infiniband/hw/ipath/ipath_mr.c
+++ b/drivers/infiniband/hw/ipath/ipath_mr.c
@@ -214,7 +214,7 @@ struct ib_mr *ipath_reg_user_mr(struct ib_pd *pd, u64 
start, u64 length,
mr->mr.user_base = start;
mr->mr.iova = virt_addr;
mr->mr.length = length;
-   mr->mr.offset = umem->offset;
+   mr->mr.offset = ib_umem_offset(umem);
mr->mr.access_flags = mr_access_flags;
mr->mr.max_segs = n;
mr->umem = umem;
diff --git a/drivers/infiniband/hw/nes/nes_verbs.c 
b/drivers/infiniband/hw/nes/nes_verbs.c
index 32d3682..e049750 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -2342,7 +2342,7 @@ static struct ib_mr *nes_reg_user_mr(struct ib_pd *pd, 
u64 start, u64 length,
(unsigned long int)start, (unsigned long int)virt, 
(u32)length,
region->offset, region->page_size);
 
-   skip_pages = ((u32)region->offset) >> 12;
+   skip_pages = ((u32)ib_umem_offset(region)) >> 12;
 
if (ib_copy_from_udata(&req, udata, sizeof(req))) {
ib_umem_release(region);
@@ -2407,7 +2407,7 @@ static struct ib_mr *nes_reg_user_mr(struct ib_pd *pd, 
u64 start, u64 length,
region_length -= skip_pages << 12;
for (page_index = skip_pages; page_index < 
chunk_pages; page_index++) {
skip_pages = 0;
-   if ((page_

[RFC 15/20] IB/mlx5: Add mlx5_ib_update_mtt to update page tables after creation

2014-03-02 Thread Haggai Eran
The new function allows updating the page tables of a memory region after it
was created. This can be used to handle page faults and page invalidations.

Since mlx5_ib_update_mtt will need to work from within page invalidation, so
it must not block on memory allocation. It employs an atomic memory allocation
mechanism that is used as a fallback when kmalloc(GFP_ATOMIC) fails.

In order to reuse code from mlx5_ib_populate_pas, the patch splits this
function and add the needed parameters.

Signed-off-by: Haggai Eran 
Signed-off-by: Shachar Raindel 
---
 drivers/infiniband/hw/mlx5/mem.c |  19 +++--
 drivers/infiniband/hw/mlx5/mlx5_ib.h |   5 ++
 drivers/infiniband/hw/mlx5/mr.c  | 130 ++-
 include/linux/mlx5/device.h  |   1 +
 4 files changed, 148 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mem.c b/drivers/infiniband/hw/mlx5/mem.c
index d760bfb..a1a748e 100644
--- a/drivers/infiniband/hw/mlx5/mem.c
+++ b/drivers/infiniband/hw/mlx5/mem.c
@@ -139,11 +139,14 @@ static u64 umem_dma_to_mtt(dma_addr_t umem_dma)
  * dev - mlx5_ib device
  * umem - umem to use to fill the pages
  * page_shift - determines the page size used in the resulting array
+ * offset - offset into the umem to start from
+ * num_pages - total number of pages to fill
  * pas - bus addresses array to fill
  * access_flags - access flags to set on all present pages
  */
-void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
- int page_shift, __be64 *pas, int access_flags)
+void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
+ int page_shift, size_t offset, size_t num_pages,
+ __be64 *pas, int access_flags)
 {
int shift = page_shift - PAGE_SHIFT;
int mask = (1 << shift) - 1;
@@ -157,15 +160,16 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct 
ib_umem *umem,
const bool odp = umem->odp_data != NULL;
 
if (odp) {
-   int num_pages = ib_umem_num_pages(umem);
WARN_ON(shift != 0);
WARN_ON(access_flags != (MLX5_IB_MTT_READ | MLX5_IB_MTT_WRITE));
for (i = 0; i < num_pages; ++i) {
-   dma_addr_t pa = umem->odp_data->dma_list[i];
+   dma_addr_t pa = umem->odp_data->dma_list[offset + i];
pas[i] = cpu_to_be64(umem_dma_to_mtt(pa));
}
return;
}
+
+   BUG_ON(!odp && offset);
 #endif
 
i = 0;
@@ -188,6 +192,13 @@ void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct 
ib_umem *umem,
}
 }
 
+void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
+ int page_shift, __be64 *pas, int access_flags)
+{
+   return __mlx5_ib_populate_pas(dev, umem, page_shift, 0,
+ ib_umem_num_pages(umem), pas,
+ access_flags);
+}
 int mlx5_ib_get_buf_offset(u64 addr, int page_shift, u32 *offset)
 {
u64 page_size;
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index dd93790..f48a511 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -537,6 +537,8 @@ struct ib_mr *mlx5_ib_get_dma_mr(struct ib_pd *pd, int acc);
 struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
  u64 virt_addr, int access_flags,
  struct ib_udata *udata);
+int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index,
+  int npages, int zap);
 int mlx5_ib_dereg_mr(struct ib_mr *ibmr);
 int mlx5_ib_destroy_mr(struct ib_mr *ibmr);
 struct ib_mr *mlx5_ib_create_mr(struct ib_pd *pd,
@@ -568,6 +570,9 @@ int mlx5_ib_init_fmr(struct mlx5_ib_dev *dev);
 void mlx5_ib_cleanup_fmr(struct mlx5_ib_dev *dev);
 void mlx5_ib_cont_pages(struct ib_umem *umem, u64 addr, int *count, int *shift,
int *ncont, int *order);
+void __mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
+ int page_shift, size_t offset, size_t num_pages,
+ __be64 *pas, int access_flags);
 void mlx5_ib_populate_pas(struct mlx5_ib_dev *dev, struct ib_umem *umem,
  int page_shift, __be64 *pas, int access_flags);
 void mlx5_ib_copy_pas(u64 *old, u64 *new, int step, int num);
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 5ea099e..1071cb5 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -44,9 +44,12 @@ enum {
MAX_PENDING_REG_MR = 8,
 };
 
-enum {
-   MLX5_UMR_ALIGN  = 2048
-};
+#define MLX5_UMR_ALIGN 2048
+
+static __be64 mlx5_ib_update_mtt_emergency_buffer[
+   MLX5_UMR_MTT_MIN_CHUNK_SIZE/sizeof(__be64)]
+   __aligned(MLX5_UMR_ALIG

[RFC 01/20] IB/core: Add flags for on demand paging support

2014-03-02 Thread Haggai Eran
From: Sagi Grimberg 

* Add a configuration option for enable on-demand paging support in the
  infiniband subsystem (CONFIG_INFINIBAND_ON_DEMAND_PAGING). In a later patch,
  this configuration option will select the MMU_NOTIFIER configuration option
  to enable mmu notifiers.
* Add a flag for on demand paging (ODP) support in the IB device capabilities.
* Add a flag to request ODP MR in the access flags to reg_mr.
* Fail registrations done with the ODP flag when the low-level driver doesn't
  support this.
* Change the conditions in which an MR will be writable to explicitly
  specify the access flags. This is to avoid making an MR writable just
  because it is an ODP MR.
* Add a query_odp_caps verb to query from user-space for ODP capabilities.

Signed-off-by: Sagi Grimberg 
Signed-off-by: Shachar Raindel 
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/Kconfig| 10 ++
 drivers/infiniband/core/umem.c|  8 +++--
 drivers/infiniband/core/uverbs.h  |  1 +
 drivers/infiniband/core/uverbs_cmd.c  | 63 +++
 drivers/infiniband/core/uverbs_main.c |  5 ++-
 include/rdma/ib_verbs.h   | 28 ++--
 include/uapi/rdma/ib_user_verbs.h | 18 +-
 7 files changed, 126 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 7708939..089a2c2 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -38,6 +38,16 @@ config INFINIBAND_USER_MEM
depends on INFINIBAND_USER_ACCESS != n
default y
 
+config INFINIBAND_ON_DEMAND_PAGING
+   bool "InfiniBand on-demand paging support"
+   depends on INFINIBAND_USER_MEM
+   default y
+   ---help---
+ On demand paging support for the InfiniBand subsystem.
+ Together with driver support this allows registration of
+ memory regions without pinning their pages, fetching the
+ pages on demand instead.
+
 config INFINIBAND_ADDR_TRANS
bool
depends on INFINIBAND
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index a3a2e9c..1fba9d3 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -106,13 +106,15 @@ struct ib_umem *ib_umem_get(struct ib_ucontext *context, 
unsigned long addr,
umem->offset= addr & ~PAGE_MASK;
umem->page_size = PAGE_SIZE;
/*
-* We ask for writable memory if any access flags other than
-* "remote read" are set.  "Local write" and "remote write"
+* We ask for writable memory if any of the following
+* access flags are set.  "Local write" and "remote write"
 * obviously require write access.  "Remote atomic" can do
 * things like fetch and add, which will modify memory, and
 * "MW bind" can change permissions by binding a window.
 */
-   umem->writable  = !!(access & ~IB_ACCESS_REMOTE_READ);
+   umem->writable  = !!(access &
+   (IB_ACCESS_LOCAL_WRITE   | IB_ACCESS_REMOTE_WRITE |
+IB_ACCESS_REMOTE_ATOMIC | IB_ACCESS_MW_BIND));
 
/* We assume the memory is from hugetlb until proved otherwise */
umem->hugetlb   = 1;
diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index a283274..d1cefc8 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -257,5 +257,6 @@ IB_UVERBS_DECLARE_CMD(close_xrcd);
 
 IB_UVERBS_DECLARE_EX_CMD(create_flow);
 IB_UVERBS_DECLARE_EX_CMD(destroy_flow);
+IB_UVERBS_DECLARE_EX_CMD(query_odp_caps);
 
 #endif /* UVERBS_H */
diff --git a/drivers/infiniband/core/uverbs_cmd.c 
b/drivers/infiniband/core/uverbs_cmd.c
index ea6203e..2795d86 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -947,6 +947,22 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file,
goto err_free;
}
 
+
+   if (cmd.access_flags & IB_ACCESS_ON_DEMAND) {
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+   struct ib_device_attr attr;
+   ret = ib_query_device(pd->device, &attr);
+   if (ret || !(attr.device_cap_flags &
+   IB_DEVICE_ON_DEMAND_PAGING)) {
+   ret = -EINVAL;
+   goto err_put;
+   }
+#else
+   ret = -EINVAL;
+   goto err_put;
+#endif
+   }
+
mr = pd->device->reg_user_mr(pd, cmd.start, cmd.length, cmd.hca_va,
 cmd.access_flags, &udata);
if (IS_ERR(mr)) {
@@ -1160,6 +1176,53 @@ ssize_t ib_uverbs_dealloc_mw(struct ib_uverbs_file *file,
return in_len;
 }
 
+#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
+int ib_uverbs_ex_query_odp_caps(struct ib_uverbs_file *file,
+   struct ib_udata *ucore,
+   struct ib_udata *uhw)
+{
+   struct ib_uverbs_query_odp_caps cmd

[RFC 00/20] On demand paging

2014-03-02 Thread Haggai Eran
The following set of patches implements on-demand paging (ODP) support
in the RDMA stack and in the mlx5_ib Infiniband driver.

What is on-demand paging?

Applications register memory with an RDMA adapter using system calls,
and subsequently post IO operations that refer to the corresponding
virtual addresses directly to HW. Until now, this was achieved by
pinning the memory during the registration calls. The goal of on demand
paging is to avoid pinning the pages of registered memory regions (MRs).
This will allow users the same flexibility they get when swapping any
other part of their processes address spaces. Instead of requiring the
entire MR to fit in physical memory, we can allow the MR to be larger,
and only fit the current working set in physical memory.

This can make programming with RDMA much simpler. Today, developers that
are working with more data than their RAM can hold need either to
deregister and reregister memory regions throughout their process's
life, or keep a single memory region and copy the data to it. On demand
paging will allow these developers to register a single MR at the
beginning of their process's life, and let the operating system manage
which pages needs to be fetched at a given time. In the future, we might
be able to provide a single memory access key for each process that
would provide the entire process's address as one large memory region,
and the developers wouldn't need to register memory regions at all.

How does page faults generally work?

With pinned memory regions, the driver would map the virtual addresses
to bus addresses, and pass these addresses to the HCA to associate them
with the new MR. With ODP, the driver is now allowed to mark some of the
pages in the MR as not-present. When the HCA attempts to perform memory
access for a communication operation, it notices the page is not
present, and raises a page fault event to the driver. In addition, the
HCA performs whatever operation is required by the transport protocol to
suspend communication until the page fault is resolved.

Upon receiving the page fault interrupt, the driver first needs to know
on which virtual address the page fault occurred, and on what memory
key. When handling send/receive operations, this information is inside
the work queue. The driver reads the needed work queue elements, and
parses them to gather the address and memory key. For other RDMA
operations, the event generated by the HCA only contains the virtual
address and rkey, as there are no work queue elements involved.

Having the rkey, the driver can find the relevant memory region in its
data structures, and calculate the actual pages needed to complete the
operation. It then uses get_user_pages to retrieve the needed pages back
to the memory, obtains dma mapping, and passes the addresses to the HCA.
Finally, the driver notifies the HCA it can continue operation on the
queue pair that encountered the page fault. The pages that
get_user_pages returned are unpinned immediately by releasing their
reference.

How are invalidations handled?

The patches add infrastructure to subscribe the RDMA stack as an mmu
notifier client [1]. Each process that uses ODP register a notifier client.
When receiving page invalidation notifications, they are passed to the
mlx5_ib driver, which updates the HCA with new, not-present mappings.
Only after flushing the HCA's page table caches the notifier returns,
allowing the kernel to release the pages.

What operations are supported?

Currently only send, receive and RDMA write operations are supported on the
RC protocol, and also send operations on the UD protocol. We hope to
implement support for other transports and operations in the future.

The structure of the patchset

First, the patches apply against the for-next branch in the
roland/infiniband.git tree, with the signature patches [2] applied, and also
the patch that refactors umem to use a linear SG table [3].

Patches 1-5:
The first set of patches adds page fault support to the IB core layer,
allowing MRs to be registered without their pages to be pinned. The first
patch adds capability bits, configuration options, and a method for
querying whether the paging capabilities from user-space. The next two
patches (2-3) make some necessary changes to the ib_umem type. Patches
4 and 5 add paging support and invalidation support respectively.

Patches 6-9:
The next set of patches contain some minor fixes to the mlx5 driver that
were needed. Patches 6-7 fix a two bugs that may affect the paging code,
and patches 8-9 add code to store missing information in mlx5 structures
that is needed for the paging code to work correctly.

Patches 10-16:
This set of patches add small size new functionality to the mlx5 driver and
builds toward paging support. Patches 10-11 make changes to UMR mechanism
(an internal mechanism used by mlx5 to update device page mappings).
Patch 12 adds infrastructure support for page fault handling to the
mlx5_core module. Patch 13 q

[RFC 07/20] IB/mlx5: Add MR to radix tree in reg_mr_callback

2014-03-02 Thread Haggai Eran
For memory regions that are allocated using reg_umr, the suffix of
mlx5_core_create_mkey isn't being called. Instead the creation is completed in
a callback function (reg_mr_callback). This means that these MRs aren't being
added to the MR radix tree. The patch add them in the callback.

A later patch that adds page fault handling uses the MR radix tree to find the
MR pointer given an key that is read from the page fault event from hardware,
or from the WQE that caused the page fault. Therefore we need all MRs,
including those created with the reg_mr_callback to be in that tree.

Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/mr.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 24a68aa..f447257 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -73,6 +73,8 @@ static void reg_mr_callback(int status, void *context)
struct mlx5_cache_ent *ent = &cache->ent[c];
u8 key;
unsigned long flags;
+   struct mlx5_mr_table *table = &dev->mdev.priv.mr_table;
+   int err;
 
spin_lock_irqsave(&ent->lock, flags);
ent->pending--;
@@ -107,6 +109,13 @@ static void reg_mr_callback(int status, void *context)
ent->cur++;
ent->size++;
spin_unlock_irqrestore(&ent->lock, flags);
+
+   write_lock_irq(&table->lock);
+   err = radix_tree_insert(&table->tree, mlx5_base_mkey(mr->mmr.key),
+   &mr->mmr);
+   if (err)
+   pr_err("Error inserting to mr tree. 0x%x\n", -err);
+   write_unlock_irq(&table->lock);
 }
 
 static int add_keys(struct mlx5_ib_dev *dev, int c, int num)
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 09/20] IB/mlx5: Set QP offsets and parameters for user QPs and not just for kernel QPs

2014-03-02 Thread Haggai Eran
For user QPs, the creation process does not currently initialize the fields:
* qp->rq.offset
* qp->sq.offset
* qp->sq.wqe_shift

These fields are used for handling page faults to calculate where to read the
faulting WQE from.

Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/qp.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index b5cf2c4..335bcbe 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -574,6 +574,10 @@ static int create_user_qp(struct mlx5_ib_dev *dev, struct 
ib_pd *pd,
uar_index = uuarn_to_uar_index(&context->uuari, uuarn);
mlx5_ib_dbg(dev, "uuarn 0x%x, uar_index 0x%x\n", uuarn, uar_index);
 
+   qp->rq.offset = 0;
+   qp->sq.wqe_shift = ilog2(MLX5_SEND_WQE_BB);
+   qp->sq.offset = qp->rq.wqe_cnt << qp->rq.wqe_shift;
+
err = set_user_buf_size(dev, qp, &ucmd);
if (err)
goto err_uuar;
-- 
1.7.11.2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC 18/20] IB/mlx5: Handle page faults

2014-03-02 Thread Haggai Eran
This patch implement a page fault handler (leaving the pages pinned as
of time being). The page fault handler handles initiator and responder
page faults for UD/RC transports, and for send/receive operations.

Signed-off-by: Sagi Grimberg 
Signed-off-by: Shachar Raindel 
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/odp.c | 396 +++
 include/linux/mlx5/qp.h  |   7 +
 2 files changed, 403 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index f297f14..c6da238 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -30,6 +30,9 @@
  * SOFTWARE.
  */
 
+#include 
+#include 
+
 #include "mlx5_ib.h"
 
 struct workqueue_struct *mlx5_ib_page_fault_wq;
@@ -94,12 +97,405 @@ static void mlx5_ib_page_fault_resume(struct mlx5_ib_qp 
*qp,
   qp->mqp.qpn);
 }
 
+/*
+ * Handle a single data segment in a page-fault WQE.
+ *
+ * Returns number of pages retrieved on success. The caller will continue to
+ * the next data segment.
+ * Can return the following error codes:
+ * -EAGAIN to designate a temporary error. The caller will abort handling the
+ *  page fault and resolve it.
+ * -EFAULT when there's an error mapping the requested pages. The caller will
+ *  abort the page fault handling and possibly move the QP to an error state.
+ * On other errors the QP should also be closed with an error.
+ */
+static int pagefault_single_data_segment(struct mlx5_ib_qp *qp,
+   struct mlx5_ib_pfault *pfault, u32 key, u64 io_virt,
+   size_t bcnt, u32 *bytes_mapped)
+{
+   struct mlx5_ib_dev *mib_dev = to_mdev(qp->ibqp.pd->device);
+   int srcu_key;
+   u64 start_idx;
+   int npages = 0, ret = 0;
+   struct mlx5_ib_mr *mr;
+   srcu_key = srcu_read_lock(&mib_dev->mr_srcu);
+   mr = mlx5_ib_odp_find_mr_lkey(mib_dev, key);
+   /*
+* If we didn't find the MR, it means the MR was closed while we were
+* handling the ODP event. In this case we return -EFAULT so that the
+* QP will be closed.
+*/
+   if (!mr || !mr->ibmr.pd) {
+   pr_err("Failed to find relevant mr for lkey=0x%06x, probably 
the MR was destroyed\n",
+  key);
+   ret = -EFAULT;
+   goto srcu_unlock;
+   }
+   if (!mr->umem->odp_data) {
+   pr_debug("skipping non ODP MR (lkey=0x%06x) in page fault 
handler.\n",
+key);
+   if (bytes_mapped)
+   *bytes_mapped +=
+   (bcnt - pfault->mpfault.bytes_committed);
+   goto srcu_unlock;
+   }
+   if (mr->ibmr.pd != qp->ibqp.pd) {
+   pr_err("Page-fault with different PDs for QP and MR.\n");
+   ret = -EFAULT;
+   goto srcu_unlock;
+   }
+
+   /*
+* Avoid branches - this code will perform correctly
+* in all iterations (in iteration 2 and above,
+* gather_commit == 0).
+*/
+   io_virt += pfault->mpfault.bytes_committed;
+   bcnt -= pfault->mpfault.bytes_committed;
+
+   start_idx = (io_virt - (mr->mmr.iova & PAGE_MASK)) >> PAGE_SHIFT;
+
+   npages = ib_umem_odp_map_dma_pages(mr->umem, io_virt, bcnt,
+   mr->umem->writable ?
+   (ODP_READ_ALLOWED_BIT | ODP_WRITE_ALLOWED_BIT) :
+   ODP_READ_ALLOWED_BIT,
+   atomic_read(&mr->umem->odp_data->notifiers_seq));
+   if (npages < 0) {
+   ret = npages;
+   goto srcu_unlock;
+   }
+
+   if (npages > 0) {
+   mutex_lock(&mr->umem->odp_data->umem_mutex);
+   /*
+* No need to check whether the MTTs really belong to
+* this MR, since ib_umem_odp_map_dma_pages already
+* checks this.
+*/
+   ret = mlx5_ib_update_mtt(mr, start_idx, npages, 0);
+   mutex_unlock(&mr->umem->odp_data->umem_mutex);
+
+   if (bytes_mapped) {
+   u32 new_mappings = npages * PAGE_SIZE -
+   (io_virt - round_down(io_virt, PAGE_SIZE));
+   *bytes_mapped += min_t(u32, new_mappings, bcnt);
+   }
+   }
+   if (ret) {
+   pr_err("Failed to update mkey page tables\n");
+   ret = -EAGAIN;
+   goto srcu_unlock;
+   }
+
+srcu_unlock:
+   srcu_read_unlock(&mib_dev->mr_srcu, srcu_key);
+   pfault->mpfault.bytes_committed = 0;
+   return ret ? ret : npages;
+}
+
+/**
+ * Parse a series of data segments for page fault handling.
+ *
+ * @qp the QP on which the fault occurred.
+ * @pfault contains page fault information.
+ * @wqe points at the first data segment in the WQE.
+ * @wqe_end points after the end of the WQE.
+ * @bytes_mapped receives

[RFC 12/20] net/mlx5_core: Add support for page faults events and low level handling

2014-03-02 Thread Haggai Eran
* Add a handler function pointer in the mlx5_core_qp struct for page fault
  events. Handle page fault events by calling the handler function, if not
  NULL.
* Add on-demand paging capability query command.
* Export command for resuming QPs after page faults.
* Add various constants related to paging support.

Signed-off-by: Sagi Grimberg 
Signed-off-by: Shachar Raindel 
Signed-off-by: Haggai Eran 
---
 drivers/infiniband/hw/mlx5/mr.c|   6 +-
 drivers/infiniband/hw/mlx5/qp.c|   4 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c   |  11 +-
 drivers/net/ethernet/mellanox/mlx5/core/fw.c   |  35 ++-
 drivers/net/ethernet/mellanox/mlx5/core/main.c |   8 +-
 drivers/net/ethernet/mellanox/mlx5/core/qp.c   | 134 -
 include/linux/mlx5/device.h|  60 ++-
 include/linux/mlx5/driver.h|  18 
 include/linux/mlx5/qp.h|  53 ++
 9 files changed, 308 insertions(+), 21 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index fa0bcd6..d1e8426 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -147,7 +147,7 @@ static int add_keys(struct mlx5_ib_dev *dev, int c, int num)
mr->order = ent->order;
mr->umred = 1;
mr->dev = dev;
-   in->seg.status = 1 << 6;
+   in->seg.status = MLX5_MKEY_STATUS_FREE;
in->seg.xlt_oct_size = cpu_to_be32((npages + 1) / 2);
in->seg.qpn_mkey7_0 = cpu_to_be32(0xff << 8);
in->seg.flags = MLX5_ACCESS_MODE_MTT | MLX5_PERM_UMR_EN;
@@ -1029,7 +1029,7 @@ struct ib_mr *mlx5_ib_create_mr(struct ib_pd *pd,
goto err_free;
}
 
-   in->seg.status = 1 << 6; /* free */
+   in->seg.status = MLX5_MKEY_STATUS_FREE;
in->seg.xlt_oct_size = cpu_to_be32(ndescs);
in->seg.qpn_mkey7_0 = cpu_to_be32(0xff << 8);
in->seg.flags_pd = cpu_to_be32(to_mpd(pd)->pdn);
@@ -1144,7 +1144,7 @@ struct ib_mr *mlx5_ib_alloc_fast_reg_mr(struct ib_pd *pd,
goto err_free;
}
 
-   in->seg.status = 1 << 6; /* free */
+   in->seg.status = MLX5_MKEY_STATUS_FREE;
in->seg.xlt_oct_size = cpu_to_be32((max_page_list_len + 1) / 2);
in->seg.qpn_mkey7_0 = cpu_to_be32(0xff << 8);
in->seg.flags = MLX5_PERM_UMR_EN | MLX5_ACCESS_MODE_MTT;
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index e48b699..385501b 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -1890,7 +1890,7 @@ static void set_mkey_segment(struct mlx5_mkey_seg *seg, 
struct ib_send_wr *wr,
 {
memset(seg, 0, sizeof(*seg));
if (li) {
-   seg->status = 1 << 6;
+   seg->status = MLX5_MKEY_STATUS_FREE;
return;
}
 
@@ -1911,7 +1911,7 @@ static void set_reg_mkey_segment(struct mlx5_mkey_seg 
*seg, struct ib_send_wr *w
 
memset(seg, 0, sizeof(*seg));
if (wr->send_flags & MLX5_IB_SEND_UMR_UNREG) {
-   seg->status = 1 << 6;
+   seg->status = MLX5_MKEY_STATUS_FREE;
return;
}
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 64a61b2..23bccbe 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -157,6 +157,8 @@ static const char *eqe_type_str(u8 type)
return "MLX5_EVENT_TYPE_CMD";
case MLX5_EVENT_TYPE_PAGE_REQUEST:
return "MLX5_EVENT_TYPE_PAGE_REQUEST";
+   case MLX5_EVENT_TYPE_PAGE_FAULT:
+   return "MLX5_EVENT_TYPE_PAGE_FAULT";
default:
return "Unrecognized event";
}
@@ -275,6 +277,9 @@ static int mlx5_eq_int(struct mlx5_core_dev *dev, struct 
mlx5_eq *eq)
}
break;
 
+   case MLX5_EVENT_TYPE_PAGE_FAULT:
+   mlx5_eq_pagefault(dev, eqe);
+   break;
 
default:
mlx5_core_warn(dev, "Unhandled event 0x%x on EQ 
0x%x\n", eqe->type, eq->eqn);
@@ -441,8 +446,12 @@ void mlx5_eq_cleanup(struct mlx5_core_dev *dev)
 int mlx5_start_eqs(struct mlx5_core_dev *dev)
 {
struct mlx5_eq_table *table = &dev->priv.eq_table;
+   u32 async_event_mask = MLX5_ASYNC_EVENT_MASK;
int err;
 
+   if (dev->caps.flags & MLX5_DEV_CAP_FLAG_ON_DMND_PG)
+   async_event_mask |= (1ull << MLX5_EVENT_TYPE_PAGE_FAULT);
+
err = mlx5_create_map_eq(dev, &table->cmd_eq, MLX5_EQ_VEC_CMD,
 MLX5_NUM_CMD_EQE, 1ull << MLX5_EVENT_TYPE_CMD,
 "mlx5_cmd_eq", &dev->priv.uuari.uars[0]);
@@ -454,7 +463,7 @@ int mlx5_start_eqs(struct mlx5_core_dev *dev)
mlx5_cmd_use

Re: issues with the rdma-cm server side mapping of IP to GID

2014-03-02 Thread Or Gerlitz

On 02/03/2014 01:50, Hefty, Sean wrote:

Such situation can happen in the following cases:

1. net.ipv4.conf.default.arp_ignore equals 0 (the default)
2. server side bonding/teaming fail-over when the Gratitous ARP sent was
lost
3. re-order of ibM net-devices mapping to HCA PCI devices after server
boot/crash
4. etc more

Basically, when the rdma-cm observes difference between the destination GID as 
present in the IB path within the CM REQ to the one resolved locally,  we 
should at least print a warning. Perhaps, we should reject the connection 
request? (in that case, I wasn't sure what would be the appropriate reject 
reason), any more ideas?

I'm not sure that this results in a single error case.


Sorry... I'm not sure to follow, can you elaborate a bit more?


Can the kernel rdma_cm check for net.ipv4.default.arp_ignore on startup and at 
least print a warning if that is wrong?


I am not sure, and anyway, please note that I brought at least two more 
use cases where the problem happens


- following server side bonding fail-over

- following server side reboot after which the PCI ordering changes 
between two HCAs and hence ibM devices change their PCI association


Or.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html