date:20140312

Re: [PATCH] IB/core: Suppress a sparse warning

2014-03-12 Thread Bart Van Assche

On 03/10/14 17:08, Paul E. McKenney wrote:
 On Mon, Mar 10, 2014 at 04:02:13PM +0100, Yann Droneaud wrote:
 Hi,

 Le lundi 10 mars 2014 à 15:26 +0100, Bart Van Assche a écrit :
 On 03/10/14 14:33, Yann Droneaud wrote:
 Le lundi 10 mars 2014 à 13:22 +0100, Bart Van Assche a écrit :
 Suppress the following sparse warning:
 include/rdma/ib_addr.h:187:24: warning: cast removes address space of 
 expression

 You should explain why there's a warning here, and why is it safe to
 cast. (I believe it's related to RCU domain ?)

 Hello Yann,

 Now that I've had a closer look at the code in include/rdma/ib_addr.h,
 that code probably isn't safe. How about the (untested) patch below ?


 Thanks for investigating.

 I'm not an expert in RCU, but I believe it then miss the RCU annotations
 around the RCU reader section (ensure correct ordering if I recall
 correctly).

 Cc: Paul E. McKenney paul...@linux.vnet.ibm.com
 
 If the rcu_read_lock() isn't supplied by all callers to this function,
 then yes, it needs to be supplied as Yann shows below.
 
 The CONFIG_PROVE_RCU=y Kconfig option can help determine that they are
 needed, but of course cannot prove that they are not needed, at least
 not unless you have a workload that exercises all possible calls to
 this function.

Hello Moni,

I think this warning got introduced via commit IB/cma: IBoE (RoCE)
IP-based GID addressing (7b85627b9f02f9b0fb2ef5f021807f4251135857;
December 12, 2013). Can you follow this up further ?

Thanks,

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: IB/mlx4: Build the port IBoE GID table properly under bonding

2014-03-12 Thread Bart Van Assche

On 02/18/14 15:32, Moni Shoua wrote:
 Ha ha. Take another look. That's what I was just explaining about! :) On
 line 1743 when curr_master is non-NULL then Smatch doesn't complain
 because it understands about the relationship between curr_master and
 curr_netdev. But here it is complaining about line 1749 where
 curr_master is NULL so the implication doesn't apply. Nice, huh?
 regards, dan carpenter
 
 You're right :)
 I'll write the fix.

Hello Moni,

Have you already had a chance to look further into this issue ?

Thanks,

Bart.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v1 1/3] IB/srp: Fix crash when unmapping data loop

2014-03-12 Thread Bart Van Assche

On 03/11/14 16:30, Sagi Grimberg wrote:
 State FAIL_FAST must come *after* stated BLOCKED. Do you think that
 taking the lock
 once the rport transitions to state BLOCKED suffices? I'm aiming to
 avoid this lock in
 the sunny-day flow. Taking this lock always to protect against some
 error flow
 that might occur feels somewhat wrong to me.

Hello Sagi,

I agree that today the SRP initiator only invokes srp_terminate_io()
after having quiesced I/O first so from that point of view it is not
necessary to add more locking in srp_queuecommand(). However, since
this is nontrivial I'd like to trigger a kernel warning if
srp_terminate_io() is ever invoked concurrently with
srp_queuecommand(). Additionally, I think the code in
srp_reset_device() can trigger a race with the I/O completion path. How
about addressing all this with the patch below ?

Thanks,

Bart.

[PATCH] IB/srp: Fix a race condition between failing I/O and I/O completion

Avoid that srp_terminate_io() can access req-scmnd after it has been
cleared by the I/O completion code. Do this by protecting req-scmnd
accesses from srp_terminate_io() via locking

Signed-off-by: Bart Van Assche bvanass...@acm.org
---
 drivers/infiniband/ulp/srp/ib_srp.c | 33 ++---
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index a64e469..66a908b 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -783,6 +783,7 @@ static void srp_unmap_data(struct scsi_cmnd *scmnd,
  * srp_claim_req - Take ownership of the scmnd associated with a request.
  * @target: SRP target port.
  * @req: SRP request.
+ * @sdev: If not NULL, only take ownership for this SCSI device.
  * @scmnd: If NULL, take ownership of @req-scmnd. If not NULL, only take
  * ownership of @req-scmnd if it equals @scmnd.
  *
@@ -791,16 +792,17 @@ static void srp_unmap_data(struct scsi_cmnd *scmnd,
  */
 static struct scsi_cmnd *srp_claim_req(struct srp_target_port *target,
   struct srp_request *req,
+  struct scsi_device *sdev,
   struct scsi_cmnd *scmnd)
 {
unsigned long flags;
 
spin_lock_irqsave(target-lock, flags);
-   if (!scmnd) {
+   if (req-scmnd 
+   (!sdev || req-scmnd-device == sdev) 
+   (!scmnd || req-scmnd == scmnd)) {
scmnd = req-scmnd;
req-scmnd = NULL;
-   } else if (req-scmnd == scmnd) {
-   req-scmnd = NULL;
} else {
scmnd = NULL;
}
@@ -827,9 +829,10 @@ static void srp_free_req(struct srp_target_port *target,
 }
 
 static void srp_finish_req(struct srp_target_port *target,
-  struct srp_request *req, int result)
+  struct srp_request *req, struct scsi_device *sdev,
+  int result)
 {
-   struct scsi_cmnd *scmnd = srp_claim_req(target, req, NULL);
+   struct scsi_cmnd *scmnd = srp_claim_req(target, req, sdev, NULL);
 
if (scmnd) {
srp_free_req(target, req, scmnd, 0);
@@ -841,11 +844,20 @@ static void srp_finish_req(struct srp_target_port *target,
 static void srp_terminate_io(struct srp_rport *rport)
 {
struct srp_target_port *target = rport-lld_data;
+   struct Scsi_Host *shost = target-scsi_host;
+   struct scsi_device *sdev;
int i;
 
+   /*
+* Invoking srp_terminate_io() while srp_queuecommand() is running
+* is not safe. Hence the warning statement below.
+*/
+   shost_for_each_device(sdev, shost)
+   WARN_ON_ONCE(sdev-request_queue-request_fn_active);
+
for (i = 0; i  target-req_ring_size; ++i) {
struct srp_request *req = target-req_ring[i];
-   srp_finish_req(target, req, DID_TRANSPORT_FAILFAST  16);
+   srp_finish_req(target, req, NULL, DID_TRANSPORT_FAILFAST  16);
}
 }
 
@@ -882,7 +894,7 @@ static int srp_rport_reconnect(struct srp_rport *rport)
 
for (i = 0; i  target-req_ring_size; ++i) {
struct srp_request *req = target-req_ring[i];
-   srp_finish_req(target, req, DID_RESET  16);
+   srp_finish_req(target, req, NULL, DID_RESET  16);
}
 
INIT_LIST_HEAD(target-free_tx);
@@ -1290,7 +1302,7 @@ static void srp_process_rsp(struct srp_target_port 
*target, struct srp_rsp *rsp)
complete(target-tsk_mgmt_done);
} else {
req = target-req_ring[rsp-tag];
-   scmnd = srp_claim_req(target, req, NULL);
+   scmnd = srp_claim_req(target, req, NULL, NULL);
if (!scmnd) {
shost_printk(KERN_ERR, target-scsi_host,
 Null scmnd for RSP w/tag %016llx\n,
@@ -2008,7 +2020,7 @@ static int srp_abort(struct

Re: NFS over RDMA crashing

2014-03-12 Thread Jeff Layton

On Sat, 08 Mar 2014 14:13:44 -0600
Steve Wise sw...@opengridcomputing.com wrote:

 On 3/8/2014 1:20 PM, Steve Wise wrote:
 
  I removed your change and started debugging original crash that 
  happens on top-o-tree.   Seems like rq_next_pages is screwed up.  It 
  should always be = rq_respages, yes?  I added a BUG_ON() to assert 
  this in rdma_read_xdr() we hit the BUG_ON(). Look
 
  crash svc_rqst.rq_next_page 0x8800b84e6000
rq_next_page = 0x8800b84e6228
  crash svc_rqst.rq_respages 0x8800b84e6000
rq_respages = 0x8800b84e62a8
 
  Any ideas Bruce/Tom?
 
 
  Guys, the patch below seems to fix the problem.  Dunno if it is 
  correct though.  What do you think?
 
  diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
  b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
  index 0ce7552..6d62411 100644
  --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
  +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
  @@ -90,6 +90,7 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
  sge_no++;
  }
  rqstp-rq_respages = rqstp-rq_pages[sge_no];
  +   rqstp-rq_next_page = rqstp-rq_respages;
 
  /* We should never run out of SGE because the limit is defined to
   * support the max allowed RPC data length
  @@ -276,6 +277,7 @@ static int fast_reg_read_chunks(struct 
  svcxprt_rdma *xprt,
 
  /* rq_respages points one past arg pages */
  rqstp-rq_respages = rqstp-rq_arg.pages[page_no];
  +   rqstp-rq_next_page = rqstp-rq_respages;
 
  /* Create the reply and chunk maps */
  offset = 0;
 
 
 
 While this patch avoids the crashing, it apparently isn't correct...I'm 
 getting IO errors reading files over the mount. :)
 

I hit the same oops and tested your patch and it seems to have fixed
that particular panic, but I still see a bunch of other mem corruption
oopses even with it. I'll look more closely at that when I get some
time.

FWIW, I can easily reproduce that by simply doing something like:

$ dd if=/dev/urandom of=/file/on/nfsordma/mount bs=4k count=1

I'm not sure why you're not seeing any panics with your patch in place.
Perhaps it's due to hw differences between our test rigs.

The EIO problem that you're seeing is likely the same client bug that
Chuck recently fixed in this patch:

[PATCH 2/8] SUNRPC: Fix large reads on NFS/RDMA

AIUI, Trond is merging that set for 3.15, so I'd make sure your client
has those patches when testing.

Finally, I also have a forthcoming patch to fix non-page aligned NFS
READs as well. I'm hesitant to send that out though until I can at
least run the connectathon testsuite against this server. The WRITE
oopses sort of prevent that for now...

-- 
Jeff Layton jlay...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFS over RDMA crashing

2014-03-12 Thread Trond Myklebust


On Mar 12, 2014, at 9:33, Jeff Layton jlay...@redhat.com wrote:

 On Sat, 08 Mar 2014 14:13:44 -0600
 Steve Wise sw...@opengridcomputing.com wrote:
 
 On 3/8/2014 1:20 PM, Steve Wise wrote:
 
 I removed your change and started debugging original crash that 
 happens on top-o-tree.   Seems like rq_next_pages is screwed up.  It 
 should always be = rq_respages, yes?  I added a BUG_ON() to assert 
 this in rdma_read_xdr() we hit the BUG_ON(). Look
 
 crash svc_rqst.rq_next_page 0x8800b84e6000
 rq_next_page = 0x8800b84e6228
 crash svc_rqst.rq_respages 0x8800b84e6000
 rq_respages = 0x8800b84e62a8
 
 Any ideas Bruce/Tom?
 
 
 Guys, the patch below seems to fix the problem.  Dunno if it is 
 correct though.  What do you think?
 
 diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
 b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
 index 0ce7552..6d62411 100644
 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
 +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
 @@ -90,6 +90,7 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
   sge_no++;
   }
   rqstp-rq_respages = rqstp-rq_pages[sge_no];
 +   rqstp-rq_next_page = rqstp-rq_respages;
 
   /* We should never run out of SGE because the limit is defined to
* support the max allowed RPC data length
 @@ -276,6 +277,7 @@ static int fast_reg_read_chunks(struct 
 svcxprt_rdma *xprt,
 
   /* rq_respages points one past arg pages */
   rqstp-rq_respages = rqstp-rq_arg.pages[page_no];
 +   rqstp-rq_next_page = rqstp-rq_respages;
 
   /* Create the reply and chunk maps */
   offset = 0;
 
 
 
 While this patch avoids the crashing, it apparently isn't correct...I'm 
 getting IO errors reading files over the mount. :)
 
 
 I hit the same oops and tested your patch and it seems to have fixed
 that particular panic, but I still see a bunch of other mem corruption
 oopses even with it. I'll look more closely at that when I get some
 time.
 
 FWIW, I can easily reproduce that by simply doing something like:
 
   $ dd if=/dev/urandom of=/file/on/nfsordma/mount bs=4k count=1
 
 I'm not sure why you're not seeing any panics with your patch in place.
 Perhaps it's due to hw differences between our test rigs.
 
 The EIO problem that you're seeing is likely the same client bug that
 Chuck recently fixed in this patch:
 
   [PATCH 2/8] SUNRPC: Fix large reads on NFS/RDMA
 
 AIUI, Trond is merging that set for 3.15, so I'd make sure your client
 has those patches when testing.
 

Nothing is in my queue yet.

_
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.mykleb...@primarydata.com

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFS over RDMA crashing

2014-03-12 Thread Tom Tucker


Hi Trond,

I think this patch is still 'off-by-one'. We'll take a look at this today.

Thanks,
Tom

On 3/12/14 9:05 AM, Trond Myklebust wrote:

On Mar 12, 2014, at 9:33, Jeff Layton jlay...@redhat.com wrote:


On Sat, 08 Mar 2014 14:13:44 -0600
Steve Wise sw...@opengridcomputing.com wrote:


On 3/8/2014 1:20 PM, Steve Wise wrote:

I removed your change and started debugging original crash that
happens on top-o-tree.   Seems like rq_next_pages is screwed up.  It
should always be = rq_respages, yes?  I added a BUG_ON() to assert
this in rdma_read_xdr() we hit the BUG_ON(). Look

crash svc_rqst.rq_next_page 0x8800b84e6000
rq_next_page = 0x8800b84e6228
crash svc_rqst.rq_respages 0x8800b84e6000
rq_respages = 0x8800b84e62a8

Any ideas Bruce/Tom?


Guys, the patch below seems to fix the problem.  Dunno if it is
correct though.  What do you think?

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 0ce7552..6d62411 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -90,6 +90,7 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
   sge_no++;
   }
   rqstp-rq_respages = rqstp-rq_pages[sge_no];
+   rqstp-rq_next_page = rqstp-rq_respages;

   /* We should never run out of SGE because the limit is defined to
* support the max allowed RPC data length
@@ -276,6 +277,7 @@ static int fast_reg_read_chunks(struct
svcxprt_rdma *xprt,

   /* rq_respages points one past arg pages */
   rqstp-rq_respages = rqstp-rq_arg.pages[page_no];
+   rqstp-rq_next_page = rqstp-rq_respages;

   /* Create the reply and chunk maps */
   offset = 0;



While this patch avoids the crashing, it apparently isn't correct...I'm
getting IO errors reading files over the mount. :)


I hit the same oops and tested your patch and it seems to have fixed
that particular panic, but I still see a bunch of other mem corruption
oopses even with it. I'll look more closely at that when I get some
time.

FWIW, I can easily reproduce that by simply doing something like:

   $ dd if=/dev/urandom of=/file/on/nfsordma/mount bs=4k count=1

I'm not sure why you're not seeing any panics with your patch in place.
Perhaps it's due to hw differences between our test rigs.

The EIO problem that you're seeing is likely the same client bug that
Chuck recently fixed in this patch:

   [PATCH 2/8] SUNRPC: Fix large reads on NFS/RDMA

AIUI, Trond is merging that set for 3.15, so I'd make sure your client
has those patches when testing.


Nothing is in my queue yet.

_
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.mykleb...@primarydata.com

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFS over RDMA crashing

2014-03-12 Thread Jeffrey Layton

On Wed, 12 Mar 2014 10:05:24 -0400
Trond Myklebust trond.mykleb...@primarydata.com wrote:

 
 On Mar 12, 2014, at 9:33, Jeff Layton jlay...@redhat.com wrote:
 
  On Sat, 08 Mar 2014 14:13:44 -0600
  Steve Wise sw...@opengridcomputing.com wrote:
  
  On 3/8/2014 1:20 PM, Steve Wise wrote:
  
  I removed your change and started debugging original crash that 
  happens on top-o-tree.   Seems like rq_next_pages is screwed
  up.  It should always be = rq_respages, yes?  I added a
  BUG_ON() to assert this in rdma_read_xdr() we hit the BUG_ON().
  Look
  
  crash svc_rqst.rq_next_page 0x8800b84e6000
  rq_next_page = 0x8800b84e6228
  crash svc_rqst.rq_respages 0x8800b84e6000
  rq_respages = 0x8800b84e62a8
  
  Any ideas Bruce/Tom?
  
  
  Guys, the patch below seems to fix the problem.  Dunno if it is 
  correct though.  What do you think?
  
  diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
  b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
  index 0ce7552..6d62411 100644
  --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
  +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
  @@ -90,6 +90,7 @@ static void rdma_build_arg_xdr(struct svc_rqst
  *rqstp, sge_no++;
}
rqstp-rq_respages = rqstp-rq_pages[sge_no];
  +   rqstp-rq_next_page = rqstp-rq_respages;
  
/* We should never run out of SGE because the limit is
  defined to
 * support the max allowed RPC data length
  @@ -276,6 +277,7 @@ static int fast_reg_read_chunks(struct 
  svcxprt_rdma *xprt,
  
/* rq_respages points one past arg pages */
rqstp-rq_respages = rqstp-rq_arg.pages[page_no];
  +   rqstp-rq_next_page = rqstp-rq_respages;
  
/* Create the reply and chunk maps */
offset = 0;
  
  
  
  While this patch avoids the crashing, it apparently isn't
  correct...I'm getting IO errors reading files over the mount. :)
  
  
  I hit the same oops and tested your patch and it seems to have fixed
  that particular panic, but I still see a bunch of other mem
  corruption oopses even with it. I'll look more closely at that when
  I get some time.
  
  FWIW, I can easily reproduce that by simply doing something like:
  
$ dd if=/dev/urandom of=/file/on/nfsordma/mount bs=4k count=1
  
  I'm not sure why you're not seeing any panics with your patch in
  place. Perhaps it's due to hw differences between our test rigs.
  
  The EIO problem that you're seeing is likely the same client bug
  that Chuck recently fixed in this patch:
  
[PATCH 2/8] SUNRPC: Fix large reads on NFS/RDMA
  
  AIUI, Trond is merging that set for 3.15, so I'd make sure your
  client has those patches when testing.
  
 
 Nothing is in my queue yet.
 

Doh! Any reason not to merge that set from Chuck? They do fix a couple
of nasty client bugs...

-- 
Jeff Layton jlay...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFS over RDMA crashing

2014-03-12 Thread Trond Myklebust


On Mar 12, 2014, at 10:28, Jeffrey Layton jlay...@redhat.com wrote:

 On Wed, 12 Mar 2014 10:05:24 -0400
 Trond Myklebust trond.mykleb...@primarydata.com wrote:
 
 
 On Mar 12, 2014, at 9:33, Jeff Layton jlay...@redhat.com wrote:
 
 On Sat, 08 Mar 2014 14:13:44 -0600
 Steve Wise sw...@opengridcomputing.com wrote:
 
 On 3/8/2014 1:20 PM, Steve Wise wrote:
 
 I removed your change and started debugging original crash that 
 happens on top-o-tree.   Seems like rq_next_pages is screwed
 up.  It should always be = rq_respages, yes?  I added a
 BUG_ON() to assert this in rdma_read_xdr() we hit the BUG_ON().
 Look
 
 crash svc_rqst.rq_next_page 0x8800b84e6000
 rq_next_page = 0x8800b84e6228
 crash svc_rqst.rq_respages 0x8800b84e6000
 rq_respages = 0x8800b84e62a8
 
 Any ideas Bruce/Tom?
 
 
 Guys, the patch below seems to fix the problem.  Dunno if it is 
 correct though.  What do you think?
 
 diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
 b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
 index 0ce7552..6d62411 100644
 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
 +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
 @@ -90,6 +90,7 @@ static void rdma_build_arg_xdr(struct svc_rqst
 *rqstp, sge_no++;
  }
  rqstp-rq_respages = rqstp-rq_pages[sge_no];
 +   rqstp-rq_next_page = rqstp-rq_respages;
 
  /* We should never run out of SGE because the limit is
 defined to
   * support the max allowed RPC data length
 @@ -276,6 +277,7 @@ static int fast_reg_read_chunks(struct 
 svcxprt_rdma *xprt,
 
  /* rq_respages points one past arg pages */
  rqstp-rq_respages = rqstp-rq_arg.pages[page_no];
 +   rqstp-rq_next_page = rqstp-rq_respages;
 
  /* Create the reply and chunk maps */
  offset = 0;
 
 
 
 While this patch avoids the crashing, it apparently isn't
 correct...I'm getting IO errors reading files over the mount. :)
 
 
 I hit the same oops and tested your patch and it seems to have fixed
 that particular panic, but I still see a bunch of other mem
 corruption oopses even with it. I'll look more closely at that when
 I get some time.
 
 FWIW, I can easily reproduce that by simply doing something like:
 
  $ dd if=/dev/urandom of=/file/on/nfsordma/mount bs=4k count=1
 
 I'm not sure why you're not seeing any panics with your patch in
 place. Perhaps it's due to hw differences between our test rigs.
 
 The EIO problem that you're seeing is likely the same client bug
 that Chuck recently fixed in this patch:
 
  [PATCH 2/8] SUNRPC: Fix large reads on NFS/RDMA
 
 AIUI, Trond is merging that set for 3.15, so I'd make sure your
 client has those patches when testing.
 
 
 Nothing is in my queue yet.
 
 
 Doh! Any reason not to merge that set from Chuck? They do fix a couple
 of nasty client bugs…
 

Most of them are one-line debugging dprintks which I do not intend to apply.

One of them confuses a readdir optimisation with a bugfix; at the very least 
the patch comments need changing.
That leaves 2 that can go in, however as they are clearly insufficient to make 
RDMA safe for general use, they certainly do not warrant a stable@ label. The 
workaround for the Oopses is simple: use TCP.

_
Trond Myklebust
Linux NFS client maintainer, PrimaryData
trond.mykleb...@primarydata.com

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NFS over RDMA crashing

2014-03-12 Thread Jeffrey Layton

On Wed, 12 Mar 2014 11:03:52 -0400
Trond Myklebust trond.mykleb...@primarydata.com wrote:

 
 On Mar 12, 2014, at 10:28, Jeffrey Layton jlay...@redhat.com wrote:
 
  On Wed, 12 Mar 2014 10:05:24 -0400
  Trond Myklebust trond.mykleb...@primarydata.com wrote:
  
  
  On Mar 12, 2014, at 9:33, Jeff Layton jlay...@redhat.com wrote:
  
  On Sat, 08 Mar 2014 14:13:44 -0600
  Steve Wise sw...@opengridcomputing.com wrote:
  
  On 3/8/2014 1:20 PM, Steve Wise wrote:
  
  I removed your change and started debugging original crash
  that happens on top-o-tree.   Seems like rq_next_pages is
  screwed up.  It should always be = rq_respages, yes?  I added
  a BUG_ON() to assert this in rdma_read_xdr() we hit the
  BUG_ON(). Look
  
  crash svc_rqst.rq_next_page 0x8800b84e6000
  rq_next_page = 0x8800b84e6228
  crash svc_rqst.rq_respages 0x8800b84e6000
  rq_respages = 0x8800b84e62a8
  
  Any ideas Bruce/Tom?
  
  
  Guys, the patch below seems to fix the problem.  Dunno if it is 
  correct though.  What do you think?
  
  diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
  b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
  index 0ce7552..6d62411 100644
  --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
  +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
  @@ -90,6 +90,7 @@ static void rdma_build_arg_xdr(struct svc_rqst
  *rqstp, sge_no++;
   }
   rqstp-rq_respages = rqstp-rq_pages[sge_no];
  +   rqstp-rq_next_page = rqstp-rq_respages;
  
   /* We should never run out of SGE because the limit is
  defined to
* support the max allowed RPC data length
  @@ -276,6 +277,7 @@ static int fast_reg_read_chunks(struct 
  svcxprt_rdma *xprt,
  
   /* rq_respages points one past arg pages */
   rqstp-rq_respages = rqstp-rq_arg.pages[page_no];
  +   rqstp-rq_next_page = rqstp-rq_respages;
  
   /* Create the reply and chunk maps */
   offset = 0;
  
  
  
  While this patch avoids the crashing, it apparently isn't
  correct...I'm getting IO errors reading files over the mount. :)
  
  
  I hit the same oops and tested your patch and it seems to have
  fixed that particular panic, but I still see a bunch of other mem
  corruption oopses even with it. I'll look more closely at that
  when I get some time.
  
  FWIW, I can easily reproduce that by simply doing something like:
  
   $ dd if=/dev/urandom of=/file/on/nfsordma/mount bs=4k count=1
  
  I'm not sure why you're not seeing any panics with your patch in
  place. Perhaps it's due to hw differences between our test rigs.
  
  The EIO problem that you're seeing is likely the same client bug
  that Chuck recently fixed in this patch:
  
   [PATCH 2/8] SUNRPC: Fix large reads on NFS/RDMA
  
  AIUI, Trond is merging that set for 3.15, so I'd make sure your
  client has those patches when testing.
  
  
  Nothing is in my queue yet.
  
  
  Doh! Any reason not to merge that set from Chuck? They do fix a
  couple of nasty client bugs…
  
 
 Most of them are one-line debugging dprintks which I do not intend to
 apply.
 

Fair enough. Those are certainly not necessary, but some of them clean
up existing printks and probably do need to go in. That said, debugging
this stuff is *really* difficult so having extra debug printks in place
seems like a good thing (unless you're arguing for moving wholesale to
tracepoints instead).

 One of them confuses a readdir optimisation with a bugfix; at the
 very least the patch comments need changing.

I'll leave that to Chuck to comment on. I had the impression that it
was a bugfix, but maybe there's some better way to handle that bug.

  That leaves 2 that can
 go in, however as they are clearly insufficient to make RDMA safe for
 general use, they certainly do not warrant a stable@ label. The
 workaround for the Oopses is simple: use TCP.
 

Yeah, it's definitely rickety, but it's in and we do need to get fixes
merged to this code. I'm ok with dropping the stable labels on those
patches, but if we're going to declare this stuff not stable enough
for general use then I think that we should take an aggressive approach
on merging fixes to it.

FWIW, I also notice that Kconfig doesn't show the option to actually
enable/disable RDMA transports. I'll post a patch to fix that soon.
Since this stuff is not very safe to use, then we should make it
reasonably simple to disable it.

-- 
Jeff Layton jlay...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 00/31] Misc. fixes for cxgb4 and iw_cxgb4

2014-03-12 Thread Hariprasad Shenai

Hi All,

This patch series provides miscelleneous fixes for Chelsio T4/T5 adapters
related to cxgb4 related to sge and mtu. And includes DB Drop avoidance
and other misc. fixes on iw-cxgb4.

The patches series is created against David Miller's 'net-next' tree.
And includes patches on cxgb4 and iw_cxgb4 driver.

We would like to request this patch series to get merged via David Miller's
'net-next' tree.

We have included all the maintainers of respective drivers. Kindly review the
change and let us know in case of any review comments.

Thanks

V6:
   In patch 8/31, move the existing neigh_release() call right before the
   if(!e) test, that way you don't need a completely new label and code block
   to fix this bug - thanks to review by David Miller
   In patch 15/31, use %pad to print dma_addr - thanks to review by Joe Perches
   In patch 10/31, add the STOPPED state string to db_state_str - thanks
   to review by Steve Wise
   In patch 10/31, t4_db_dropped() needs to disable dbs and send DB_FULL
   event to iw_cxgb4 - thanks to review by Steve Wise
V5:
   Dropped patch cxgb4: use spinlock_irqsave/spinlock_irqrestore for db lock.
   The remaining changes from the removed patch are moved into patch 10/31
   (Doorbell Drop Avoidance Bug Fixes).  10/31 has the driver call
   disable_txq_db() from an interrupt handler, and I thought it would be
   better to put all the changes to fix how the db lock is acquired into
   this one patch.   
   save/restore spinlock variants are not required - thanks to review by
   David Miller.
V4:
   Fixed review comments given by Sergei Shtylyov, Joe Perches, Or Gerlitz.
   And, dropped un-used module_params based on comment from Ben Hutchings.
   Also adds a new patch (cxgb4: Calculate len properly for LSO
   path) which fixes regression.
V3:
   Fixed warnings reported by checkpatch.pl --strict  use networking code
   multi-line comments. Also includes fixes based on review comments given by
   Sergei Shtylyov.
V2:
   Dont drop existing module parameters.
   (cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes.)


Hariprasad Shenai (1):
  iw_cxgb4: Use pr_warn_ratelimited

Kumar Sanghvi (5):
  cxgb4: Fix some small bugs in t4_sge_init_soft() when our Page Size is
64KB
  cxgb4: Add code to dump SGE registers when hitting idma hangs
  cxgb4: Rectify emitting messages about SGE Ingress DMA channels being
potentially stuck
  cxgb4: Updates for T5 SGE's Egress Congestion Threshold
  cxgb4: Calculate len properly for LSO path

Steve Wise (25):
  iw_cxgb4: cap CQ size at T4_MAX_IQ_SIZE
  iw_cxgb4: Allow loopback connections
  iw_cxgb4: release neigh entry
  iw_cxgb4: Treat CPL_ERR_KEEPALV_NEG_ADVICE as negative advice
  cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes
  iw_cxgb4: use the BAR2/WC path for kernel QPs and T5 devices
  iw_cxgb4: Fix incorrect BUG_ON conditions
  iw_cxgb4: Mind the sq_sig_all/sq_sig_type QP attributes
  iw_cxgb4: default peer2peer mode to 1
  iw_cxgb4: save the correct map length for fast_reg_page_lists
  iw_cxgb4: don't leak skb in c4iw_uld_rx_handler()
  iw_cxgb4: fix possible memory leak in RX_PKT processing
  iw_cxgb4: ignore read reponse type 1 CQEs
  iw_cxgb4: connect_request_upcall fixes
  iw_cxgb4: adjust tcp snd/rcv window based on link speed
  iw_cxgb4: update snd_seq when sending MPA messages
  iw_cxgb4: lock around accept/reject downcalls
  iw_cxgb4: drop RX_DATA packets if the endpoint is gone
  iw_cxgb4: rx_data() needs to hold the ep mutex
  iw_cxgb4: endpoint timeout fixes
  iw_cxgb4: rmb() after reading valid gen bit
  iw_cxgb4: wc_wmb() needed after DB writes
  iw_cxgb4: SQ flush fix
  iw_cxgb4: minor fixes
  iw_cxgb4: Max fastreg depth depends on DSGL support

 drivers/infiniband/hw/cxgb4/cm.c| 266 +---
 drivers/infiniband/hw/cxgb4/cq.c|  54 +++--
 drivers/infiniband/hw/cxgb4/device.c| 241 +
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h  |  17 +-
 drivers/infiniband/hw/cxgb4/mem.c   |  18 +-
 drivers/infiniband/hw/cxgb4/provider.c  |  46 +++-
 drivers/infiniband/hw/cxgb4/qp.c| 216 ++-
 drivers/infiniband/hw/cxgb4/resource.c  |  10 +-
 drivers/infiniband/hw/cxgb4/t4.h|  78 ++-
 drivers/infiniband/hw/cxgb4/user.h  |   5 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h  |  11 +-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  84 
 drivers/net/ethernet/chelsio/cxgb4/sge.c| 130 +---
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c  | 106 ++
 drivers/net/ethernet/chelsio/cxgb4/t4_msg.h |   2 +
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h|   9 +
 16 files changed, 913 insertions(+), 380 deletions(-)

-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 06/31] iw_cxgb4: cap CQ size at T4_MAX_IQ_SIZE

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/cq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index 88de3aa..c0673ac 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -881,7 +881,7 @@ struct ib_cq *c4iw_create_cq(struct ib_device *ibdev, int 
entries,
/*
 * Make actual HW queue 2x to avoid cdix_inc overflows.
 */
-   hwentries = entries * 2;
+   hwentries = min(entries * 2, T4_MAX_IQ_SIZE);
 
/*
 * Make HW queue at least 64 entries so GTS updates aren't too
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 08/31] iw_cxgb4: release neigh entry

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

Always release the neigh entry in rx_pkt().

Based on original work by Santosh Rastapur sant...@chelsio.com.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/cm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 360807e..2b2af96 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -3347,13 +3347,13 @@ static int rx_pkt(struct c4iw_dev *dev, struct sk_buff 
*skb)
pi = (struct port_info *)netdev_priv(pdev);
tx_chan = cxgb4_port_chan(pdev);
}
+   neigh_release(neigh);
if (!e) {
pr_err(%s - failed to allocate l2t entry!\n,
   __func__);
goto free_dst;
}
 
-   neigh_release(neigh);
step = dev-rdev.lldi.nrxq / dev-rdev.lldi.nchan;
rss_qid = dev-rdev.lldi.rxq_ids[pi-port_id * step];
window = (__force u16) htons((__force u16)tcph-window);
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 03/31] cxgb4: Rectify emitting messages about SGE Ingress DMA channels being potentially stuck

2014-03-12 Thread Hariprasad Shenai

From: Kumar Sanghvi kuma...@chelsio.com

Based on original work by Casey Leedom lee...@chelsio.com

Signed-off-by: Kumar Sanghvi kuma...@chelsio.com
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h |  9 ++-
 drivers/net/ethernet/chelsio/cxgb4/sge.c   | 90 --
 2 files changed, 79 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 509c976..50abe1d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -556,8 +556,13 @@ struct sge {
u32 pktshift;   /* padding between CPL  packet data */
u32 fl_align;   /* response queue message alignment */
u32 fl_starve_thres;/* Free List starvation threshold */
-   unsigned int starve_thres;
-   u8 idma_state[2];
+
+   /* State variables for detecting an SGE Ingress DMA hang */
+   unsigned int idma_1s_thresh;/* SGE same State Counter 1s threshold */
+   unsigned int idma_stalled[2];/* SGE synthesized stalled timers in HZ */
+   unsigned int idma_state[2]; /* SGE IDMA Hang detect state */
+   unsigned int idma_qid[2];   /* SGE IDMA Hung Ingress Queue ID */
+
unsigned int egr_start;
unsigned int ingr_start;
void *egr_map[MAX_EGRQ];/* qid-queue egress queue map */
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index 3a2ecd8..054bb03 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -93,6 +93,16 @@
  */
 #define TX_QCHECK_PERIOD (HZ / 2)
 
+/* SGE Hung Ingress DMA Threshold Warning time (in Hz) and Warning Repeat Rate
+ * (in RX_QCHECK_PERIOD multiples).  If we find one of the SGE Ingress DMA
+ * State Machines in the same state for this amount of time (in HZ) then we'll
+ * issue a warning about a potential hang.  We'll repeat the warning as the
+ * SGE Ingress DMA Channel appears to be hung every N RX_QCHECK_PERIODs till
+ * the situation clears.  If the situation clears, we'll note that as well.
+ */
+#define SGE_IDMA_WARN_THRESH (1 * HZ)
+#define SGE_IDMA_WARN_REPEAT (20 * RX_QCHECK_PERIOD)
+
 /*
  * Max number of Tx descriptors to be reclaimed by the Tx timer.
  */
@@ -2008,7 +2018,7 @@ irq_handler_t t4_intr_handler(struct adapter *adap)
 static void sge_rx_timer_cb(unsigned long data)
 {
unsigned long m;
-   unsigned int i, cnt[2];
+   unsigned int i, idma_same_state_cnt[2];
struct adapter *adap = (struct adapter *)data;
struct sge *s = adap-sge;
 
@@ -2031,21 +2041,64 @@ static void sge_rx_timer_cb(unsigned long data)
}
 
t4_write_reg(adap, SGE_DEBUG_INDEX, 13);
-   cnt[0] = t4_read_reg(adap, SGE_DEBUG_DATA_HIGH);
-   cnt[1] = t4_read_reg(adap, SGE_DEBUG_DATA_LOW);
-
-   for (i = 0; i  2; i++)
-   if (cnt[i] = s-starve_thres) {
-   if (s-idma_state[i] || cnt[i] == 0x)
-   continue;
-   s-idma_state[i] = 1;
-   t4_write_reg(adap, SGE_DEBUG_INDEX, 11);
-   m = t4_read_reg(adap, SGE_DEBUG_DATA_LOW)  (i * 16);
-   dev_warn(adap-pdev_dev,
-SGE idma%u starvation detected for 
-queue %lu\n, i, m  0x);
-   } else if (s-idma_state[i])
-   s-idma_state[i] = 0;
+   idma_same_state_cnt[0] = t4_read_reg(adap, SGE_DEBUG_DATA_HIGH);
+   idma_same_state_cnt[1] = t4_read_reg(adap, SGE_DEBUG_DATA_LOW);
+
+   for (i = 0; i  2; i++) {
+   u32 debug0, debug11;
+
+   /* If the Ingress DMA Same State Counter (timer) is less
+* than 1s, then we can reset our synthesized Stall Timer and
+* continue.  If we have previously emitted warnings about a
+* potential stalled Ingress Queue, issue a note indicating
+* that the Ingress Queue has resumed forward progress.
+*/
+   if (idma_same_state_cnt[i]  s-idma_1s_thresh) {
+   if (s-idma_stalled[i] = SGE_IDMA_WARN_THRESH)
+   CH_WARN(adap, SGE idma%d, queue%u,resumed 
after %d sec\n,
+   i, s-idma_qid[i],
+   s-idma_stalled[i]/HZ);
+   s-idma_stalled[i] = 0;
+   continue;
+   }
+
+   /* Synthesize an SGE Ingress DMA Same State Timer in the Hz
+* domain.  The first time we get here it'll be because we
+* passed the 1s Threshold; each additional time it'll be
+* because the RX Timer Callback is being fired on its regular
+* schedule.
+*
+* If the stall is below our

[PATCHv6 net-next 09/31] iw_cxgb4: Treat CPL_ERR_KEEPALV_NEG_ADVICE as negative advice

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

Based on original work by Anand Priyadarshee ana...@chelsio.com.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/cm.c| 24 
 drivers/net/ethernet/chelsio/cxgb4/t4_msg.h |  1 +
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 2b2af96..3629b52 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -1648,6 +1648,15 @@ static inline int act_open_has_tid(int status)
   status != CPL_ERR_ARP_MISS;
 }
 
+/* Returns whether a CPL status conveys negative advice.
+ */
+static int is_neg_adv(unsigned int status)
+{
+   return status == CPL_ERR_RTX_NEG_ADVICE ||
+  status == CPL_ERR_PERSIST_NEG_ADVICE ||
+  status == CPL_ERR_KEEPALV_NEG_ADVICE;
+}
+
 #define ACT_OPEN_RETRY_COUNT 2
 
 static int import_ep(struct c4iw_ep *ep, int iptype, __u8 *peer_ip,
@@ -1836,7 +1845,7 @@ static int act_open_rpl(struct c4iw_dev *dev, struct 
sk_buff *skb)
PDBG(%s ep %p atid %u status %u errno %d\n, __func__, ep, atid,
 status, status2errno(status));
 
-   if (status == CPL_ERR_RTX_NEG_ADVICE) {
+   if (is_neg_adv(status)) {
printk(KERN_WARNING MOD Connection problems for atid %u\n,
atid);
return 0;
@@ -2266,15 +2275,6 @@ static int peer_close(struct c4iw_dev *dev, struct 
sk_buff *skb)
return 0;
 }
 
-/*
- * Returns whether an ABORT_REQ_RSS message is a negative advice.
- */
-static int is_neg_adv_abort(unsigned int status)
-{
-   return status == CPL_ERR_RTX_NEG_ADVICE ||
-  status == CPL_ERR_PERSIST_NEG_ADVICE;
-}
-
 static int peer_abort(struct c4iw_dev *dev, struct sk_buff *skb)
 {
struct cpl_abort_req_rss *req = cplhdr(skb);
@@ -2288,7 +2288,7 @@ static int peer_abort(struct c4iw_dev *dev, struct 
sk_buff *skb)
unsigned int tid = GET_TID(req);
 
ep = lookup_tid(t, tid);
-   if (is_neg_adv_abort(req-status)) {
+   if (is_neg_adv(req-status)) {
PDBG(%s neg_adv_abort ep %p tid %u\n, __func__, ep,
 ep-hwtid);
return 0;
@@ -3571,7 +3571,7 @@ static int peer_abort_intr(struct c4iw_dev *dev, struct 
sk_buff *skb)
kfree_skb(skb);
return 0;
}
-   if (is_neg_adv_abort(req-status)) {
+   if (is_neg_adv(req-status)) {
PDBG(%s neg_adv_abort ep %p tid %u\n, __func__, ep,
 ep-hwtid);
kfree_skb(skb);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h 
b/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
index cd6874b..f2738c7 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
@@ -116,6 +116,7 @@ enum CPL_error {
CPL_ERR_KEEPALIVE_TIMEDOUT = 34,
CPL_ERR_RTX_NEG_ADVICE = 35,
CPL_ERR_PERSIST_NEG_ADVICE = 36,
+   CPL_ERR_KEEPALV_NEG_ADVICE = 37,
CPL_ERR_ABORT_FAILED   = 42,
CPL_ERR_IWARP_FLM  = 50,
 };
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 27/31] iw_cxgb4: wc_wmb() needed after DB writes

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

Need to do an sfence after both the WC and regular PIDX DB write.
Otherwise the host might reorder things and cause work request corruption
(seen with NFSRDMA).

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/t4.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h
index 67cd09ee..ace3154 100644
--- a/drivers/infiniband/hw/cxgb4/t4.h
+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -470,12 +470,12 @@ static inline void t4_ring_sq_db(struct t4_wq *wq, u16 
inc, u8 t5,
PDBG(%s: WC wq-sq.pidx = %d\n,
 __func__, wq-sq.pidx);
pio_copy(wq-sq.udb + 7, (void *)wqe);
-   wc_wmb();
} else {
PDBG(%s: DB wq-sq.pidx = %d\n,
 __func__, wq-sq.pidx);
writel(PIDX_T5(inc), wq-sq.udb);
}
+   wc_wmb();
return;
}
writel(QID(wq-sq.qid) | PIDX(inc), wq-db);
@@ -490,12 +490,12 @@ static inline void t4_ring_rq_db(struct t4_wq *wq, u16 
inc, u8 t5,
PDBG(%s: WC wq-rq.pidx = %d\n,
 __func__, wq-rq.pidx);
pio_copy(wq-rq.udb + 7, (void *)wqe);
-   wc_wmb();
} else {
PDBG(%s: DB wq-rq.pidx = %d\n,
 __func__, wq-rq.pidx);
writel(PIDX_T5(inc), wq-rq.udb);
}
+   wc_wmb();
return;
}
writel(QID(wq-rq.qid) | PIDX(inc), wq-db);
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 18/31] iw_cxgb4: ignore read reponse type 1 CQEs

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

These are generated by HW in some error cases and need to be
silently discarded.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/cq.c | 24 
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index 59f7601..55f7157 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -365,8 +365,14 @@ void c4iw_flush_hw_cq(struct c4iw_cq *chp)
 
if (CQE_OPCODE(hw_cqe) == FW_RI_READ_RESP) {
 
-   /*
-* drop peer2peer RTR reads.
+   /* If we have reached here because of async
+* event or other error, and have egress error
+* then drop
+*/
+   if (CQE_TYPE(hw_cqe) == 1)
+   goto next_cqe;
+
+   /* drop peer2peer RTR reads.
 */
if (CQE_WRID_STAG(hw_cqe) == 1)
goto next_cqe;
@@ -511,8 +517,18 @@ static int poll_cq(struct t4_wq *wq, struct t4_cq *cq, 
struct t4_cqe *cqe,
 */
if (RQ_TYPE(hw_cqe)  (CQE_OPCODE(hw_cqe) == FW_RI_READ_RESP)) {
 
-   /*
-* If this is an unsolicited read response, then the read
+   /* If we have reached here because of async
+* event or other error, and have egress error
+* then drop
+*/
+   if (CQE_TYPE(hw_cqe) == 1) {
+   if (CQE_STATUS(hw_cqe))
+   t4_set_wq_in_error(wq);
+   ret = -EAGAIN;
+   goto skip_cqe;
+   }
+
+   /* If this is an unsolicited read response, then the read
 * was generated by the kernel driver as part of peer-2-peer
 * connection setup.  So ignore the completion.
 */
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 12/31] iw_cxgb4: Fix incorrect BUG_ON conditions

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

Based on original work from Jay Hernandez j...@chelsio.com

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/cq.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index c0673ac..59f7601 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -603,7 +603,7 @@ proc_cqe:
 */
if (SQ_TYPE(hw_cqe)) {
int idx = CQE_WRID_SQ_IDX(hw_cqe);
-   BUG_ON(idx  wq-sq.size);
+   BUG_ON(idx = wq-sq.size);
 
/*
* Account for any unsignaled completions completed by
@@ -617,7 +617,7 @@ proc_cqe:
wq-sq.in_use -= wq-sq.size + idx - wq-sq.cidx;
else
wq-sq.in_use -= idx - wq-sq.cidx;
-   BUG_ON(wq-sq.in_use  0  wq-sq.in_use  wq-sq.size);
+   BUG_ON(wq-sq.in_use = 0  wq-sq.in_use = wq-sq.size);
 
wq-sq.cidx = (uint16_t)idx;
PDBG(%s completing sq idx %u\n, __func__, wq-sq.cidx);
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 21/31] iw_cxgb4: update snd_seq when sending MPA messages

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/cm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 0663fc4..87bd3c8 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -777,6 +777,7 @@ static void send_mpa_req(struct c4iw_ep *ep, struct sk_buff 
*skb,
start_ep_timer(ep);
state_set(ep-com, MPA_REQ_SENT);
ep-mpa_attr.initiator = 1;
+   ep-snd_seq += mpalen;
return;
 }
 
@@ -856,6 +857,7 @@ static int send_mpa_reject(struct c4iw_ep *ep, const void 
*pdata, u8 plen)
t4_set_arp_err_handler(skb, NULL, arp_failure_discard);
BUG_ON(ep-mpa_skb);
ep-mpa_skb = skb;
+   ep-snd_seq += mpalen;
return c4iw_l2t_send(ep-com.dev-rdev, skb, ep-l2t);
 }
 
@@ -940,6 +942,7 @@ static int send_mpa_reply(struct c4iw_ep *ep, const void 
*pdata, u8 plen)
t4_set_arp_err_handler(skb, NULL, arp_failure_discard);
ep-mpa_skb = skb;
state_set(ep-com, MPA_REP_SENT);
+   ep-snd_seq += mpalen;
return c4iw_l2t_send(ep-com.dev-rdev, skb, ep-l2t);
 }
 
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 20/31] iw_cxgb4: adjust tcp snd/rcv window based on link speed

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

40G devices need a bigger windows, so default 40G devices to snd 512K
rcv 1024K.

Fixed a bug that shows up with recv window sizes that exceed the size of
the RCV_BUFSIZ field in opt0 (= 1024K :).  If the recv window exceeds
this, then we specify the max possible in opt0, add add the rest in via
a RX_DATA_ACK credits.

Added module option named adjust_win, defaulted to 1, that allows
disabling the 40G window bump.  This allows a user to specify the exact
default window sizes via module options snd_win and rcv_win.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/cm.c| 63 +++--
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h  |  2 +
 drivers/net/ethernet/chelsio/cxgb4/t4_msg.h |  1 +
 3 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index a9fb73a..0663fc4 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -134,6 +134,11 @@ static int snd_win = 128 * 1024;
 module_param(snd_win, int, 0644);
 MODULE_PARM_DESC(snd_win, TCP send window in bytes (default=128KB));
 
+static int adjust_win = 1;
+module_param(adjust_win, int, 0644);
+MODULE_PARM_DESC(adjust_win,
+Adjust TCP window based on link speed (default=1));
+
 static struct workqueue_struct *workq;
 
 static struct sk_buff_head rxq;
@@ -465,7 +470,7 @@ static void send_flowc(struct c4iw_ep *ep, struct sk_buff 
*skb)
flowc-mnemval[5].mnemonic = FW_FLOWC_MNEM_RCVNXT;
flowc-mnemval[5].val = cpu_to_be32(ep-rcv_seq);
flowc-mnemval[6].mnemonic = FW_FLOWC_MNEM_SNDBUF;
-   flowc-mnemval[6].val = cpu_to_be32(snd_win);
+   flowc-mnemval[6].val = cpu_to_be32(ep-snd_win);
flowc-mnemval[7].mnemonic = FW_FLOWC_MNEM_MSS;
flowc-mnemval[7].val = cpu_to_be32(ep-emss);
/* Pad WR to 16 byte boundary */
@@ -547,6 +552,7 @@ static int send_connect(struct c4iw_ep *ep)
struct sockaddr_in *ra = (struct sockaddr_in *)ep-com.remote_addr;
struct sockaddr_in6 *la6 = (struct sockaddr_in6 *)ep-com.local_addr;
struct sockaddr_in6 *ra6 = (struct sockaddr_in6 *)ep-com.remote_addr;
+   int win;
 
wrlen = (ep-com.remote_addr.ss_family == AF_INET) ?
roundup(sizev4, 16) :
@@ -564,6 +570,15 @@ static int send_connect(struct c4iw_ep *ep)
 
cxgb4_best_mtu(ep-com.dev-rdev.lldi.mtus, ep-mtu, mtu_idx);
wscale = compute_wscale(rcv_win);
+
+   /*
+* Specify the largest window that will fit in opt0. The
+* remainder will be specified in the rx_data_ack.
+*/
+   win = ep-rcv_win  10;
+   if (win  RCV_BUFSIZ_MASK)
+   win = RCV_BUFSIZ_MASK;
+
opt0 = (nocong ? NO_CONG(1) : 0) |
   KEEP_ALIVE(1) |
   DELACK(1) |
@@ -574,7 +589,7 @@ static int send_connect(struct c4iw_ep *ep)
   SMAC_SEL(ep-smac_idx) |
   DSCP(ep-tos) |
   ULP_MODE(ULP_MODE_TCPDDP) |
-  RCV_BUFSIZ(rcv_win10);
+  RCV_BUFSIZ(win);
opt2 = RX_CHANNEL(0) |
   CCTRL_ECN(enable_ecn) |
   RSS_QUEUE_VALID | RSS_QUEUE(ep-rss_qid);
@@ -1134,6 +1149,14 @@ static int update_rx_credits(struct c4iw_ep *ep, u32 
credits)
return 0;
}
 
+   /*
+* If we couldn't specify the entire rcv window at connection setup
+* due to the limit in the number of bits in the RCV_BUFSIZ field,
+* then add the overage in to the credits returned.
+*/
+   if (ep-rcv_win  RCV_BUFSIZ_MASK * 1024)
+   credits += ep-rcv_win - RCV_BUFSIZ_MASK * 1024;
+
req = (struct cpl_rx_data_ack *) skb_put(skb, wrlen);
memset(req, 0, wrlen);
INIT_TP_WR(req, ep-hwtid);
@@ -1592,6 +1615,7 @@ static void send_fw_act_open_req(struct c4iw_ep *ep, 
unsigned int atid)
unsigned int mtu_idx;
int wscale;
struct sockaddr_in *sin;
+   int win;
 
skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
req = (struct fw_ofld_connection_wr *)__skb_put(skb, sizeof(*req));
@@ -1616,6 +1640,15 @@ static void send_fw_act_open_req(struct c4iw_ep *ep, 
unsigned int atid)
req-tcb.rcv_adv = htons(1);
cxgb4_best_mtu(ep-com.dev-rdev.lldi.mtus, ep-mtu, mtu_idx);
wscale = compute_wscale(rcv_win);
+
+   /*
+* Specify the largest window that will fit in opt0. The
+* remainder will be specified in the rx_data_ack.
+*/
+   win = ep-rcv_win  10;
+   if (win  RCV_BUFSIZ_MASK)
+   win = RCV_BUFSIZ_MASK;
+
req-tcb.opt0 = (__force __be64) (TCAM_BYPASS(1) |
(nocong ? NO_CONG(1) : 0) |
KEEP_ALIVE(1) |
@@ -1627,7 +1660,7 @@ static void send_fw_act_open_req(struct c4iw_ep *ep, 
unsigned int atid)
SMAC_SEL(ep-smac_idx) |

[PATCHv6 net-next 31/31] iw_cxgb4: Use pr_warn_ratelimited

2014-03-12 Thread Hariprasad Shenai

Signed-off-by: Hariprasad Shenai haripra...@chelsio.com
---
 drivers/infiniband/hw/cxgb4/resource.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/resource.c 
b/drivers/infiniband/hw/cxgb4/resource.c
index d9bc9ba..67df71a 100644
--- a/drivers/infiniband/hw/cxgb4/resource.c
+++ b/drivers/infiniband/hw/cxgb4/resource.c
@@ -326,8 +326,8 @@ u32 c4iw_rqtpool_alloc(struct c4iw_rdev *rdev, int size)
unsigned long addr = gen_pool_alloc(rdev-rqt_pool, size  6);
PDBG(%s addr 0x%x size %d\n, __func__, (u32)addr, size  6);
if (!addr)
-   printk_ratelimited(KERN_WARNING MOD %s: Out of RQT memory\n,
-  pci_name(rdev-lldi.pdev));
+   pr_warn_ratelimited(MOD %s: Out of RQT memory\n,
+   pci_name(rdev-lldi.pdev));
mutex_lock(rdev-stats.lock);
if (addr) {
rdev-stats.rqt.cur += roundup(size  6, 1  MIN_RQT_SHIFT);
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 15/31] iw_cxgb4: save the correct map length for fast_reg_page_lists

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

We cannot save the mapped length using the rdma max_page_list_len field
of the ib_fast_reg_page_list struct because the core code uses it.  This
results in an incorrect unmap of the page list in c4iw_free_fastreg_pbl().

I found this with dma map debugging enabled in the kernel.  The fix is
to save the length in the c4iw_fr_page_list struct.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h |  1 +
 drivers/infiniband/hw/cxgb4/mem.c  | 12 ++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h 
b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index 8c32088..b75f8f5 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -375,6 +375,7 @@ struct c4iw_fr_page_list {
DEFINE_DMA_UNMAP_ADDR(mapping);
dma_addr_t dma_addr;
struct c4iw_dev *dev;
+   int pll_len;
 };
 
 static inline struct c4iw_fr_page_list *to_c4iw_fr_page_list(
diff --git a/drivers/infiniband/hw/cxgb4/mem.c 
b/drivers/infiniband/hw/cxgb4/mem.c
index 41b1195..22a2e3e 100644
--- a/drivers/infiniband/hw/cxgb4/mem.c
+++ b/drivers/infiniband/hw/cxgb4/mem.c
@@ -903,7 +903,11 @@ struct ib_fast_reg_page_list 
*c4iw_alloc_fastreg_pbl(struct ib_device *device,
dma_unmap_addr_set(c4pl, mapping, dma_addr);
c4pl-dma_addr = dma_addr;
c4pl-dev = dev;
-   c4pl-ibpl.max_page_list_len = pll_len;
+   c4pl-pll_len = pll_len;
+
+   PDBG(%s c4pl %p pll_len %u page_list %p dma_addr %pad\n,
+__func__, c4pl, c4pl-pll_len, c4pl-ibpl.page_list,
+c4pl-dma_addr);
 
return c4pl-ibpl;
 }
@@ -912,8 +916,12 @@ void c4iw_free_fastreg_pbl(struct ib_fast_reg_page_list 
*ibpl)
 {
struct c4iw_fr_page_list *c4pl = to_c4iw_fr_page_list(ibpl);
 
+   PDBG(%s c4pl %p pll_len %u page_list %p dma_addr %pad\n,
+__func__, c4pl, c4pl-pll_len, c4pl-ibpl.page_list,
+c4pl-dma_addr);
+
dma_free_coherent(c4pl-dev-rdev.lldi.pdev-dev,
- c4pl-ibpl.max_page_list_len,
+ c4pl-pll_len,
  c4pl-ibpl.page_list, dma_unmap_addr(c4pl, mapping));
kfree(c4pl);
 }
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 07/31] iw_cxgb4: Allow loopback connections

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

find_route() must treat loopback as a valid
egress interface.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/cm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index d286bde..360807e 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -400,7 +400,8 @@ static struct dst_entry *find_route(struct c4iw_dev *dev, 
__be32 local_ip,
n = dst_neigh_lookup(rt-dst, peer_ip);
if (!n)
return NULL;
-   if (!our_interface(dev, n-dev)) {
+   if (!our_interface(dev, n-dev) 
+   !(n-dev-flags  IFF_LOOPBACK)) {
dst_release(rt-dst);
return NULL;
}
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 04/31] cxgb4: Updates for T5 SGE's Egress Congestion Threshold

2014-03-12 Thread Hariprasad Shenai

From: Kumar Sanghvi kuma...@chelsio.com

Based on original work by Casey Leedom lee...@chelsio.com

Signed-off-by: Kumar Sanghvi kuma...@chelsio.com
---
 drivers/net/ethernet/chelsio/cxgb4/sge.c | 18 +-
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h |  6 ++
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index 054bb03..a7c56b3 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -2776,8 +2776,8 @@ static int t4_sge_init_hard(struct adapter *adap)
 int t4_sge_init(struct adapter *adap)
 {
struct sge *s = adap-sge;
-   u32 sge_control;
-   int ret;
+   u32 sge_control, sge_conm_ctrl;
+   int ret, egress_threshold;
 
/*
 * Ingress Padding Boundary and Egress Status Page Size are set up by
@@ -2802,10 +2802,18 @@ int t4_sge_init(struct adapter *adap)
 * SGE's Egress Congestion Threshold.  If it isn't, then we can get
 * stuck waiting for new packets while the SGE is waiting for us to
 * give it more Free List entries.  (Note that the SGE's Egress
-* Congestion Threshold is in units of 2 Free List pointers.)
+* Congestion Threshold is in units of 2 Free List pointers.) For T4,
+* there was only a single field to control this.  For T5 there's the
+* original field which now only applies to Unpacked Mode Free List
+* buffers and a new field which only applies to Packed Mode Free List
+* buffers.
 */
-   s-fl_starve_thres
-   = EGRTHRESHOLD_GET(t4_read_reg(adap, SGE_CONM_CTRL))*2 + 1;
+   sge_conm_ctrl = t4_read_reg(adap, SGE_CONM_CTRL);
+   if (is_t4(adap-params.chip))
+   egress_threshold = EGRTHRESHOLD_GET(sge_conm_ctrl);
+   else
+   egress_threshold = EGRTHRESHOLDPACKING_GET(sge_conm_ctrl);
+   s-fl_starve_thres = 2*egress_threshold + 1;
 
setup_timer(s-rx_timer, sge_rx_timer_cb, (unsigned long)adap);
setup_timer(s-tx_timer, sge_tx_timer_cb, (unsigned long)adap);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h 
b/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
index 33cf9ef..225ad8a 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
@@ -230,6 +230,12 @@
 #define  EGRTHRESHOLD(x) ((x)  EGRTHRESHOLDshift)
 #define  EGRTHRESHOLD_GET(x) (((x)  EGRTHRESHOLD_MASK)  EGRTHRESHOLDshift)
 
+#define EGRTHRESHOLDPACKING_MASK   0x3fU
+#define EGRTHRESHOLDPACKING_SHIFT  14
+#define EGRTHRESHOLDPACKING(x) ((x)  EGRTHRESHOLDPACKING_SHIFT)
+#define EGRTHRESHOLDPACKING_GET(x) (((x)  EGRTHRESHOLDPACKING_SHIFT)  \
+ EGRTHRESHOLDPACKING_MASK)
+
 #define SGE_DBFIFO_STATUS 0x10a4
 #define  HP_INT_THRESH_SHIFT 28
 #define  HP_INT_THRESH_MASK  0xfU
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 10/31] cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

The current logic suffers from a slow response time to disable user DB
usage, and also fails to avoid DB FIFO drops under heavy load. This commit
fixes these deficiencies and makes the avoidance logic more optimal.
This is done by more efficiently notifying the ULDs of potential DB
problems, and implements a smoother flow control algorithm in iw_cxgb4,
which is the ULD that puts the most load on the DB fifo.

Design:

cxgb4:

Direct ULD notification when a DB FULL/DROP interrupt fires.  This allows
the ULD to stop doing user DB writes as quickly as possible.

While user DB usage is disabled, the LLD will accumulate DB write events
for its queues.  Then once DB usage is reenabled, a single DB write is
done for each queue with its accumulated write count.  This reduces the
load put on the DB fifo when reenabling.

iw_cxgb4:

Instead of marking each qp to indicate DB writes are disabled, we create
a device-global status page that each user process maps.  This allows
iw_cxgb4 to only set this single bit to disable all DB write for all
user QPs vs traversing all the active QPs.  If the libcxgb4 doesn't
support this, then we fall back to the old approach of marking each QP.
Thus we allow the new driver to work with an older libcxgb4.

When the LLD upcalls indicating DB FULL, we disable all DB writes
via the status page and transition the DB state to STOPPED.  As user
processes see that DB writes are disabled, they call into the iw_cxgb4
submit their DB write events.  Since the DB state is in STOPPED,
the QP trying to write gets enqueued on a new DB flow control list.
As subsequent DB writes are submitted for this flow controlled QP, the
amount of writes are accumulated for each QP on the flow control list.
So all the user QPs that are actively ringing the DB get put on this
list and the number of writes they request are accumulated.

When the LLD upcalls indicating DB EMPTY, which is in a workq context, we
change the DB state to FLOW_CONTROL, and begin resuming all the QPs that
are on the flow control list.  This logic runs on until the flow control
list is empty or we exit FLOW_CONTROL mode (due to a DB DROP upcall,
for example).  QPs are removed from this list, and their accumulated
DB write counts written to the DB FIFO.  Sets of QPs, called chunks in
the code, are removed at one time. This chunk size is a module option,
db_fc_resume_size, and defaults to 64.  So 64 QPs are resumed at a time,
and before the next chunk is resumed, the logic waits (blocks) for the
DB FIFO to drain.  This prevents resuming to quickly and overflowing
the FIFO.  Once the flow control list is empty, the db state transitions
back to NORMAL and user QPs are again allowed to write directly to the
user DB register.

The algorithm is designed such that if the DB write load is high enough,
then all the DB writes get submitted by the kernel using this flow
controlled approach to avoid DB drops.  As the load lightens though, we
resume to normal DB writes directly by user applications.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/device.c| 198 ++--
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h  |  11 +-
 drivers/infiniband/hw/cxgb4/provider.c  |  44 +-
 drivers/infiniband/hw/cxgb4/qp.c| 152 --
 drivers/infiniband/hw/cxgb4/t4.h|   6 +
 drivers/infiniband/hw/cxgb4/user.h  |   5 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h  |   1 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  84 +-
 drivers/net/ethernet/chelsio/cxgb4/sge.c|   8 +-
 9 files changed, 296 insertions(+), 213 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/device.c 
b/drivers/infiniband/hw/cxgb4/device.c
index 4a03385..f485849 100644
--- a/drivers/infiniband/hw/cxgb4/device.c
+++ b/drivers/infiniband/hw/cxgb4/device.c
@@ -45,15 +45,22 @@ MODULE_DESCRIPTION(Chelsio T4/T5 RDMA Driver);
 MODULE_LICENSE(Dual BSD/GPL);
 MODULE_VERSION(DRV_VERSION);
 
-static int allow_db_fc_on_t5;
-module_param(allow_db_fc_on_t5, int, 0644);
-MODULE_PARM_DESC(allow_db_fc_on_t5,
-Allow DB Flow Control on T5 (default = 0));
-
-static int allow_db_coalescing_on_t5;
-module_param(allow_db_coalescing_on_t5, int, 0644);
-MODULE_PARM_DESC(allow_db_coalescing_on_t5,
-Allow DB Coalescing on T5 (default = 0));
+static int db_fc_resume_size = 64;
+module_param(db_fc_resume_size, int, 0644);
+MODULE_PARM_DESC(db_fc_resume_size, qps are resumed from db flow control in 
+this size chunks (default = 64));
+
+static int db_fc_resume_delay = 1;
+module_param(db_fc_resume_delay, int, 0644);
+MODULE_PARM_DESC(db_fc_resume_delay, how long to delay between removing qps 
+from the fc list (default is 1 jiffy));
+
+static int db_fc_drain_thresh;
+module_param(db_fc_drain_thresh, int, 0644);
+MODULE_PARM_DESC(db_fc_drain_thresh,
+

[PATCHv6 net-next 16/31] iw_cxgb4: don't leak skb in c4iw_uld_rx_handler()

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/device.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/device.c 
b/drivers/infiniband/hw/cxgb4/device.c
index 2cf2a06..3958a52 100644
--- a/drivers/infiniband/hw/cxgb4/device.c
+++ b/drivers/infiniband/hw/cxgb4/device.c
@@ -924,11 +924,13 @@ static int c4iw_uld_rx_handler(void *handle, const __be64 
*rsp,
}
 
opcode = *(u8 *)rsp;
-   if (c4iw_handlers[opcode])
+   if (c4iw_handlers[opcode]) {
c4iw_handlers[opcode](dev, skb);
-   else
+   } else {
pr_info(%s no handler opcode 0x%x...\n, __func__,
   opcode);
+   kfree_skb(skb);
+   }
 
return 0;
 nomem:
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 13/31] iw_cxgb4: Mind the sq_sig_all/sq_sig_type QP attributes

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h | 1 +
 drivers/infiniband/hw/cxgb4/qp.c   | 6 --
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h 
b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index c05c875..8c32088 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -448,6 +448,7 @@ struct c4iw_qp {
atomic_t refcnt;
wait_queue_head_t wait;
struct timer_list timer;
+   int sq_sig_all;
 };
 
 static inline struct c4iw_qp *to_c4iw_qp(struct ib_qp *ibqp)
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index 7aee163..4c70df9 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -720,7 +720,7 @@ int c4iw_post_send(struct ib_qp *ibqp, struct ib_send_wr 
*wr,
fw_flags = 0;
if (wr-send_flags  IB_SEND_SOLICITED)
fw_flags |= FW_RI_SOLICITED_EVENT_FLAG;
-   if (wr-send_flags  IB_SEND_SIGNALED)
+   if (wr-send_flags  IB_SEND_SIGNALED || qhp-sq_sig_all)
fw_flags |= FW_RI_COMPLETION_FLAG;
swsqe = qhp-wq.sq.sw_sq[qhp-wq.sq.pidx];
switch (wr-opcode) {
@@ -781,7 +781,8 @@ int c4iw_post_send(struct ib_qp *ibqp, struct ib_send_wr 
*wr,
}
swsqe-idx = qhp-wq.sq.pidx;
swsqe-complete = 0;
-   swsqe-signaled = (wr-send_flags  IB_SEND_SIGNALED);
+   swsqe-signaled = (wr-send_flags  IB_SEND_SIGNALED) ||
+ qhp-sq_sig_all;
swsqe-flushed = 0;
swsqe-wr_id = wr-wr_id;
 
@@ -1608,6 +1609,7 @@ struct ib_qp *c4iw_create_qp(struct ib_pd *pd, struct 
ib_qp_init_attr *attrs,
qhp-attr.enable_bind = 1;
qhp-attr.max_ord = 1;
qhp-attr.max_ird = 1;
+   qhp-sq_sig_all = attrs-sq_sig_type == IB_SIGNAL_ALL_WR;
spin_lock_init(qhp-lock);
mutex_init(qhp-mutex);
init_waitqueue_head(qhp-wait);
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 05/31] cxgb4: Calculate len properly for LSO path

2014-03-12 Thread Hariprasad Shenai

From: Kumar Sanghvi kuma...@chelsio.com

Commit 0034b29 (cxgb4: Don't assume LSO only uses SGL path in t4_eth_xmit())
introduced a regression where-in length was calculated wrongly for LSO path,
causing chip hangs.
So, correct the calculation of len.

Fixes: 0034b29 (cxgb4: Don't assume LSO only uses SGL path in t4_eth_xmit())
Signed-off-by: Kumar Sanghvi kuma...@chelsio.com
---
 drivers/net/ethernet/chelsio/cxgb4/sge.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index a7c56b3..46429f9 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -1051,7 +1051,6 @@ out_free: dev_kfree_skb(skb);
end = (u64 *)wr + flits;
 
len = immediate ? skb-len : 0;
-   len += sizeof(*cpl);
ssi = skb_shinfo(skb);
if (ssi-gso_size) {
struct cpl_tx_pkt_lso *lso = (void *)wr;
@@ -1079,6 +1078,7 @@ out_free: dev_kfree_skb(skb);
q-tso++;
q-tx_cso += ssi-gso_segs;
} else {
+   len += sizeof(*cpl);
wr-op_immdlen = htonl(FW_WR_OP(FW_ETH_TX_PKT_WR) |
   FW_WR_IMMDLEN(len));
cpl = (void *)(wr + 1);
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 14/31] iw_cxgb4: default peer2peer mode to 1

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/cm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 3629b52..52fe177 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -98,9 +98,9 @@ int c4iw_debug;
 module_param(c4iw_debug, int, 0644);
 MODULE_PARM_DESC(c4iw_debug, Enable debug logging (default=0));
 
-static int peer2peer;
+static int peer2peer = 1;
 module_param(peer2peer, int, 0644);
-MODULE_PARM_DESC(peer2peer, Support peer2peer ULPs (default=0));
+MODULE_PARM_DESC(peer2peer, Support peer2peer ULPs (default=1));
 
 static int p2p_type = FW_RI_INIT_P2PTYPE_READ_REQ;
 module_param(p2p_type, int, 0644);
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 30/31] iw_cxgb4: Max fastreg depth depends on DSGL support

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

The max depth of a fastreg mr depends on  whether the device supports DSGL or 
not.  So
compute it dynamically based on the device support and the module use_dsgl 
option.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/provider.c | 2 +-
 drivers/infiniband/hw/cxgb4/qp.c   | 3 ++-
 drivers/infiniband/hw/cxgb4/t4.h   | 9 -
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/provider.c 
b/drivers/infiniband/hw/cxgb4/provider.c
index d1565a4..9e1a409 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -327,7 +327,7 @@ static int c4iw_query_device(struct ib_device *ibdev,
props-max_mr = c4iw_num_stags(dev-rdev);
props-max_pd = T4_MAX_NUM_PD;
props-local_ca_ack_delay = 0;
-   props-max_fast_reg_page_list_len = T4_MAX_FR_DEPTH;
+   props-max_fast_reg_page_list_len = t4_max_fr_depth(use_dsgl);
 
return 0;
 }
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index dfe37fc..9f9cdc5 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -560,7 +560,8 @@ static int build_fastreg(struct t4_sq *sq, union t4_wr *wqe,
int pbllen = roundup(wr-wr.fast_reg.page_list_len * sizeof(u64), 32);
int rem;
 
-   if (wr-wr.fast_reg.page_list_len  T4_MAX_FR_DEPTH)
+   if (wr-wr.fast_reg.page_list_len 
+   t4_max_fr_depth(use_dsgl))
return -EINVAL;
 
wqe-fr.qpbinde_to_dcacpu = 0;
diff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h
index ace3154..1543d6b 100644
--- a/drivers/infiniband/hw/cxgb4/t4.h
+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -84,7 +84,14 @@ struct t4_status_page {
sizeof(struct fw_ri_isgl)) / sizeof(struct fw_ri_sge))
 #define T4_MAX_FR_IMMD ((T4_SQ_NUM_BYTES - sizeof(struct fw_ri_fr_nsmr_wr) - \
sizeof(struct fw_ri_immd))  ~31UL)
-#define T4_MAX_FR_DEPTH (1024 / sizeof(u64))
+#define T4_MAX_FR_IMMD_DEPTH (T4_MAX_FR_IMMD / sizeof(u64))
+#define T4_MAX_FR_DSGL 1024
+#define T4_MAX_FR_DSGL_DEPTH (T4_MAX_FR_DSGL / sizeof(u64))
+
+static inline int t4_max_fr_depth(int use_dsgl)
+{
+   return use_dsgl ? T4_MAX_FR_DSGL_DEPTH : T4_MAX_FR_IMMD_DEPTH;
+}
 
 #define T4_RQ_NUM_SLOTS 2
 #define T4_RQ_NUM_BYTES (T4_EQ_ENTRY_SIZE * T4_RQ_NUM_SLOTS)
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 11/31] iw_cxgb4: use the BAR2/WC path for kernel QPs and T5 devices

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/device.c   | 41 +-
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h |  2 ++
 drivers/infiniband/hw/cxgb4/qp.c   | 59 +---
 drivers/infiniband/hw/cxgb4/t4.h   | 62 +++---
 4 files changed, 132 insertions(+), 32 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/device.c 
b/drivers/infiniband/hw/cxgb4/device.c
index f485849..2cf2a06 100644
--- a/drivers/infiniband/hw/cxgb4/device.c
+++ b/drivers/infiniband/hw/cxgb4/device.c
@@ -685,7 +685,10 @@ static void c4iw_dealloc(struct uld_ctx *ctx)
idr_destroy(ctx-dev-hwtid_idr);
idr_destroy(ctx-dev-stid_idr);
idr_destroy(ctx-dev-atid_idr);
-   iounmap(ctx-dev-rdev.oc_mw_kva);
+   if (ctx-dev-rdev.bar2_kva)
+   iounmap(ctx-dev-rdev.bar2_kva);
+   if (ctx-dev-rdev.oc_mw_kva)
+   iounmap(ctx-dev-rdev.oc_mw_kva);
ib_dealloc_device(ctx-dev-ibdev);
ctx-dev = NULL;
 }
@@ -725,11 +728,31 @@ static struct c4iw_dev *c4iw_alloc(const struct 
cxgb4_lld_info *infop)
}
devp-rdev.lldi = *infop;
 
-   devp-rdev.oc_mw_pa = pci_resource_start(devp-rdev.lldi.pdev, 2) +
-   (pci_resource_len(devp-rdev.lldi.pdev, 2) -
-roundup_pow_of_two(devp-rdev.lldi.vr-ocq.size));
-   devp-rdev.oc_mw_kva = ioremap_wc(devp-rdev.oc_mw_pa,
-  devp-rdev.lldi.vr-ocq.size);
+   /*
+* For T5 devices, we map all of BAR2 with WC.
+* For T4 devices with onchip qp mem, we map only that part
+* of BAR2 with WC.
+*/
+   devp-rdev.bar2_pa = pci_resource_start(devp-rdev.lldi.pdev, 2);
+   if (is_t5(devp-rdev.lldi.adapter_type)) {
+   devp-rdev.bar2_kva = ioremap_wc(devp-rdev.bar2_pa,
+   pci_resource_len(devp-rdev.lldi.pdev, 2));
+   if (!devp-rdev.bar2_kva) {
+   pr_err(MOD Unable to ioremap BAR2\n);
+   return ERR_PTR(-EINVAL);
+   }
+   } else if (ocqp_supported(infop)) {
+   devp-rdev.oc_mw_pa =
+   pci_resource_start(devp-rdev.lldi.pdev, 2) +
+   pci_resource_len(devp-rdev.lldi.pdev, 2) -
+   roundup_pow_of_two(devp-rdev.lldi.vr-ocq.size);
+   devp-rdev.oc_mw_kva = ioremap_wc(devp-rdev.oc_mw_pa,
+   devp-rdev.lldi.vr-ocq.size);
+   if (!devp-rdev.oc_mw_kva) {
+   pr_err(MOD Unable to ioremap onchip mem\n);
+   return ERR_PTR(-EINVAL);
+   }
+   }
 
PDBG(KERN_INFO MOD ocq memory: 
   hw_start 0x%x size %u mw_pa 0x%lx mw_kva %p\n,
@@ -1004,9 +1027,11 @@ static int enable_qp_db(int id, void *p, void *data)
 static void resume_rc_qp(struct c4iw_qp *qp)
 {
spin_lock(qp-lock);
-   t4_ring_sq_db(qp-wq, qp-wq.sq.wq_pidx_inc);
+   t4_ring_sq_db(qp-wq, qp-wq.sq.wq_pidx_inc,
+ is_t5(qp-rhp-rdev.lldi.adapter_type), NULL);
qp-wq.sq.wq_pidx_inc = 0;
-   t4_ring_rq_db(qp-wq, qp-wq.rq.wq_pidx_inc);
+   t4_ring_rq_db(qp-wq, qp-wq.rq.wq_pidx_inc,
+ is_t5(qp-rhp-rdev.lldi.adapter_type), NULL);
qp-wq.rq.wq_pidx_inc = 0;
spin_unlock(qp-lock);
 }
diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h 
b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index e9ecbfa..c05c875 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -149,6 +149,8 @@ struct c4iw_rdev {
struct gen_pool *ocqp_pool;
u32 flags;
struct cxgb4_lld_info lldi;
+   unsigned long bar2_pa;
+   void __iomem *bar2_kva;
unsigned long oc_mw_pa;
void __iomem *oc_mw_kva;
struct c4iw_stats stats;
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index 2b0d994..7aee163 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -46,6 +46,10 @@ static int max_fr_immd = T4_MAX_FR_IMMD;
 module_param(max_fr_immd, int, 0644);
 MODULE_PARM_DESC(max_fr_immd, fastreg threshold for using DSGL instead of 
immedate);
 
+int t5_en_wc = 1;
+module_param(t5_en_wc, int, 0644);
+MODULE_PARM_DESC(t5_en_wc, Use BAR2/WC path for kernel users (default 1));
+
 static void set_state(struct c4iw_qp *qhp, enum c4iw_qp_state state)
 {
unsigned long flag;
@@ -200,13 +204,23 @@ static int create_qp(struct c4iw_rdev *rdev, struct t4_wq 
*wq,
 
wq-db = rdev-lldi.db_reg;
wq-gts = rdev-lldi.gts_reg;
-   if (user) {
-   wq-sq.udb = (u64)pci_resource_start(rdev-lldi.pdev, 2) +
-   (wq-sq.qid  rdev-qpshift);
-   wq-sq.udb = PAGE_MASK;
-   wq-rq.udb =

[PATCHv6 net-next 24/31] iw_cxgb4: rx_data() needs to hold the ep mutex

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

To avoid racing with other threads doing close/flush/whatever, rx_data()
should hold the endpoint mutex.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/cm.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index f62801a..c3267b5 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -1193,7 +1193,7 @@ static void process_mpa_reply(struct c4iw_ep *ep, struct 
sk_buff *skb)
 * the connection.
 */
stop_ep_timer(ep);
-   if (state_read(ep-com) != MPA_REQ_SENT)
+   if (ep-com.state != MPA_REQ_SENT)
return;
 
/*
@@ -1268,7 +1268,7 @@ static void process_mpa_reply(struct c4iw_ep *ep, struct 
sk_buff *skb)
 * start reply message including private data. And
 * the MPA header is valid.
 */
-   state_set(ep-com, FPDU_MODE);
+   __state_set(ep-com, FPDU_MODE);
ep-mpa_attr.crc_enabled = (mpa-flags  MPA_CRC) | crc_enabled ? 1 : 0;
ep-mpa_attr.recv_marker_enabled = markers_enabled;
ep-mpa_attr.xmit_marker_enabled = mpa-flags  MPA_MARKERS ? 1 : 0;
@@ -1383,7 +1383,7 @@ static void process_mpa_reply(struct c4iw_ep *ep, struct 
sk_buff *skb)
}
goto out;
 err:
-   state_set(ep-com, ABORTING);
+   __state_set(ep-com, ABORTING);
send_abort(ep, skb, GFP_KERNEL);
 out:
connect_reply_upcall(ep, err);
@@ -1398,7 +1398,7 @@ static void process_mpa_request(struct c4iw_ep *ep, 
struct sk_buff *skb)
 
PDBG(%s ep %p tid %u\n, __func__, ep, ep-hwtid);
 
-   if (state_read(ep-com) != MPA_REQ_WAIT)
+   if (ep-com.state != MPA_REQ_WAIT)
return;
 
/*
@@ -1519,7 +1519,7 @@ static void process_mpa_request(struct c4iw_ep *ep, 
struct sk_buff *skb)
 ep-mpa_attr.xmit_marker_enabled, ep-mpa_attr.version,
 ep-mpa_attr.p2p_type);
 
-   state_set(ep-com, MPA_REQ_RCVD);
+   __state_set(ep-com, MPA_REQ_RCVD);
stop_ep_timer(ep);
 
/* drive upcall */
@@ -1549,11 +1549,12 @@ static int rx_data(struct c4iw_dev *dev, struct sk_buff 
*skb)
PDBG(%s ep %p tid %u dlen %u\n, __func__, ep, ep-hwtid, dlen);
skb_pull(skb, sizeof(*hdr));
skb_trim(skb, dlen);
+   mutex_lock(ep-com.mutex);
 
/* update RX credits */
update_rx_credits(ep, dlen);
 
-   switch (state_read(ep-com)) {
+   switch (ep-com.state) {
case MPA_REQ_SENT:
ep-rcv_seq += dlen;
process_mpa_reply(ep, skb);
@@ -1569,7 +1570,7 @@ static int rx_data(struct c4iw_dev *dev, struct sk_buff 
*skb)
pr_err(%s Unexpected streaming data. \
qpid %u ep %p state %d tid %u status %d\n,
   __func__, ep-com.qp-wq.sq.qid, ep,
-  state_read(ep-com), ep-hwtid, status);
+  ep-com.state, ep-hwtid, status);
attrs.next_state = C4IW_QP_STATE_TERMINATE;
c4iw_modify_qp(ep-com.qp-rhp, ep-com.qp,
   C4IW_QP_ATTR_NEXT_STATE, attrs, 0);
@@ -1578,6 +1579,7 @@ static int rx_data(struct c4iw_dev *dev, struct sk_buff 
*skb)
default:
break;
}
+   mutex_unlock(ep-com.mutex);
return 0;
 }
 
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 17/31] iw_cxgb4: fix possible memory leak in RX_PKT processing

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

If cxgb4_ofld_send() returns  0, then send_fw_pass_open_req() must
free the request skb and the saved skb with the tcp header.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/cm.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 52fe177..db8dfdf 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -3204,6 +3204,7 @@ static void send_fw_pass_open_req(struct c4iw_dev *dev, 
struct sk_buff *skb,
struct sk_buff *req_skb;
struct fw_ofld_connection_wr *req;
struct cpl_pass_accept_req *cpl = cplhdr(skb);
+   int ret;
 
req_skb = alloc_skb(sizeof(struct fw_ofld_connection_wr), GFP_KERNEL);
req = (struct fw_ofld_connection_wr *)__skb_put(req_skb, sizeof(*req));
@@ -3240,7 +3241,13 @@ static void send_fw_pass_open_req(struct c4iw_dev *dev, 
struct sk_buff *skb,
req-cookie = (unsigned long)skb;
 
set_wr_txq(req_skb, CPL_PRIORITY_CONTROL, port_id);
-   cxgb4_ofld_send(dev-rdev.lldi.ports[0], req_skb);
+   ret = cxgb4_ofld_send(dev-rdev.lldi.ports[0], req_skb);
+   if (ret  0) {
+   pr_err(%s - cxgb4_ofld_send error %d - dropping\n, __func__,
+  ret);
+   kfree_skb(skb);
+   kfree_skb(req_skb);
+   }
 }
 
 /*
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 28/31] iw_cxgb4: SQ flush fix

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

There is a race when moving a QP from RTS-CLOSING where a SQ work
request could be posted after the FW receives the RDMA_RI/FINI WR.
The SQ work request will never get processed, and should be completed
with FLUSHED status.  Function c4iw_flush_sq(), however was dropping
the oldest SQ work request when in CLOSING or IDLE states, instead of
completing the pending work request. If that oldest pending work request
was actually complete and has a CQE in the CQ, then when that CQE is
proceessed in poll_cq, we'll BUG_ON() due to the inconsistent SQ/CQ state.

This is a very small timing hole and has only been hit once so far.

The fix is two-fold:

1) c4iw_flush_sq() MUST always flush all non-completed WRs with FLUSHED
status regardless of the QP state.

2) In c4iw_modify_rc_qp(), always set the in error bit on the queue
before moving the state out of RTS.  This ensures that the state
transition will not happen while another thread is in post_rc_send(),
because set_state() and post_rc_send() both aquire the qp spinlock.
Also, once we transition the state out of RTS, subsequent calls to
post_rc_send() will fail because the in error bit is set.  I don't
think this fully closes the race where the FW can get a FINI followed a
SQ work request being posted (because they are posted to differente EQs),
but the #1 fix will handle the issue by flushing the SQ work request.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/cq.c | 22 --
 drivers/infiniband/hw/cxgb4/qp.c |  6 +++---
 2 files changed, 11 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index 55f7157..121ad07 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -235,27 +235,21 @@ int c4iw_flush_sq(struct c4iw_qp *qhp)
struct t4_cq *cq = chp-cq;
int idx;
struct t4_swsqe *swsqe;
-   int error = (qhp-attr.state != C4IW_QP_STATE_CLOSING 
-   qhp-attr.state != C4IW_QP_STATE_IDLE);
 
if (wq-sq.flush_cidx == -1)
wq-sq.flush_cidx = wq-sq.cidx;
idx = wq-sq.flush_cidx;
BUG_ON(idx = wq-sq.size);
while (idx != wq-sq.pidx) {
-   if (error) {
-   swsqe = wq-sq.sw_sq[idx];
-   BUG_ON(swsqe-flushed);
-   swsqe-flushed = 1;
-   insert_sq_cqe(wq, cq, swsqe);
-   if (wq-sq.oldest_read == swsqe) {
-   BUG_ON(swsqe-opcode != FW_RI_READ_REQ);
-   advance_oldest_read(wq);
-   }
-   flushed++;
-   } else {
-   t4_sq_consume(wq);
+   swsqe = wq-sq.sw_sq[idx];
+   BUG_ON(swsqe-flushed);
+   swsqe-flushed = 1;
+   insert_sq_cqe(wq, cq, swsqe);
+   if (wq-sq.oldest_read == swsqe) {
+   BUG_ON(swsqe-opcode != FW_RI_READ_REQ);
+   advance_oldest_read(wq);
}
+   flushed++;
if (++idx == wq-sq.size)
idx = 0;
}
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index 4c70df9..141999d 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -1359,6 +1359,7 @@ int c4iw_modify_qp(struct c4iw_dev *rhp, struct c4iw_qp 
*qhp,
switch (attrs-next_state) {
case C4IW_QP_STATE_CLOSING:
BUG_ON(atomic_read(qhp-ep-com.kref.refcount)  2);
+   t4_set_wq_in_error(qhp-wq);
set_state(qhp, C4IW_QP_STATE_CLOSING);
ep = qhp-ep;
if (!internal) {
@@ -1366,16 +1367,15 @@ int c4iw_modify_qp(struct c4iw_dev *rhp, struct c4iw_qp 
*qhp,
disconnect = 1;
c4iw_get_ep(qhp-ep-com);
}
-   t4_set_wq_in_error(qhp-wq);
ret = rdma_fini(rhp, qhp, ep);
if (ret)
goto err;
break;
case C4IW_QP_STATE_TERMINATE:
+   t4_set_wq_in_error(qhp-wq);
set_state(qhp, C4IW_QP_STATE_TERMINATE);
qhp-attr.layer_etype = attrs-layer_etype;
qhp-attr.ecode = attrs-ecode;
-   t4_set_wq_in_error(qhp-wq);
ep = qhp-ep;
disconnect = 1;
if (!internal)
@@ -1388,8 +1388,8 @@ int c4iw_modify_qp(struct c4iw_dev *rhp, struct c4iw_qp 
*qhp,
c4iw_get_ep(qhp-ep-com);
break;
case

[PATCHv6 net-next 25/31] iw_cxgb4: endpoint timeout fixes

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

1) timedout endpoint processing can be starved. If there is continual
CPL messages flowing into the driver, the endpoint timeout processing
can be starved.  This condition exposed the other bugs below.

Solution: In process_work(), call process_timedout_eps() after each CPL
is processed.

2) Connection events can be processed even though the endpoint is on
the timeout list.  If the endpoint is scheduled for timeout processing,
then we must ignore MPA Start Requests and Replies.

Solution: Change stop_ep_timer() to return 1 if the ep has already been
queued for timeout processing.  All the callers of stop_ep_timer() need
to check this and act accordingly.  There are just a few cases where
the caller needs to do something different if stop_ep_timer() returns 1:

1) in process_mpa_reply(), ignore the reply and  process_timeout()
will abort the connection.

2) in process_mpa_request, ignore the request and process_timeout()
will abort the connection.

It is ok for callers of stop_ep_timer() to abort the connection since
that will leave the state in ABORTING or DEAD, and process_timeout()
now ignores timeouts when the ep is in these states.

3) Double insertion on the timeout list.  Since the endpoint timers are
used for connection setup and teardown, we need to guard against the
possibility that an endpoint is already on the timeout list.  This is
a rare condition and only seen under heavy load and in the presense of
the above 2 bugs.

Solution: In ep_timeout(), don't queue the endpoint if it is already on
the queue.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/cm.c | 89 +---
 1 file changed, 56 insertions(+), 33 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index c3267b5..e2fe4a2 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -178,12 +178,15 @@ static void start_ep_timer(struct c4iw_ep *ep)
add_timer(ep-timer);
 }
 
-static void stop_ep_timer(struct c4iw_ep *ep)
+static int stop_ep_timer(struct c4iw_ep *ep)
 {
PDBG(%s ep %p stopping\n, __func__, ep);
del_timer_sync(ep-timer);
-   if (!test_and_set_bit(TIMEOUT, ep-com.flags))
+   if (!test_and_set_bit(TIMEOUT, ep-com.flags)) {
c4iw_put_ep(ep-com);
+   return 0;
+   }
+   return 1;
 }
 
 static int c4iw_l2t_send(struct c4iw_rdev *rdev, struct sk_buff *skb,
@@ -1188,12 +1191,11 @@ static void process_mpa_reply(struct c4iw_ep *ep, 
struct sk_buff *skb)
PDBG(%s ep %p tid %u\n, __func__, ep, ep-hwtid);
 
/*
-* Stop mpa timer.  If it expired, then the state has
-* changed and we bail since ep_timeout already aborted
-* the connection.
+* Stop mpa timer.  If it expired, then
+* we ignore the MPA reply.  process_timeout()
+* will abort the connection.
 */
-   stop_ep_timer(ep);
-   if (ep-com.state != MPA_REQ_SENT)
+   if (stop_ep_timer(ep))
return;
 
/*
@@ -1398,15 +1400,12 @@ static void process_mpa_request(struct c4iw_ep *ep, 
struct sk_buff *skb)
 
PDBG(%s ep %p tid %u\n, __func__, ep, ep-hwtid);
 
-   if (ep-com.state != MPA_REQ_WAIT)
-   return;
-
/*
 * If we get more than the supported amount of private data
 * then we must fail this connection.
 */
if (ep-mpa_pkt_len + skb-len  sizeof(ep-mpa_pkt)) {
-   stop_ep_timer(ep);
+   (void)stop_ep_timer(ep);
abort_connection(ep, skb, GFP_KERNEL);
return;
}
@@ -1436,13 +1435,13 @@ static void process_mpa_request(struct c4iw_ep *ep, 
struct sk_buff *skb)
if (mpa-revision  mpa_rev) {
printk(KERN_ERR MOD %s MPA version mismatch. Local = %d,
Received = %d\n, __func__, mpa_rev, mpa-revision);
-   stop_ep_timer(ep);
+   (void)stop_ep_timer(ep);
abort_connection(ep, skb, GFP_KERNEL);
return;
}
 
if (memcmp(mpa-key, MPA_KEY_REQ, sizeof(mpa-key))) {
-   stop_ep_timer(ep);
+   (void)stop_ep_timer(ep);
abort_connection(ep, skb, GFP_KERNEL);
return;
}
@@ -1453,7 +1452,7 @@ static void process_mpa_request(struct c4iw_ep *ep, 
struct sk_buff *skb)
 * Fail if there's too much private data.
 */
if (plen  MPA_MAX_PRIVATE_DATA) {
-   stop_ep_timer(ep);
+   (void)stop_ep_timer(ep);
abort_connection(ep, skb, GFP_KERNEL);
return;
}
@@ -1462,7 +1461,7 @@ static void process_mpa_request(struct c4iw_ep *ep, 
struct sk_buff *skb)
 * If plen does not account for pkt size
 */
if (ep-mpa_pkt_len  (sizeof(*mpa) + plen)) {
-

[PATCHv6 net-next 26/31] iw_cxgb4: rmb() after reading valid gen bit

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

Some HW platforms can reorder read operations, so we must rmb() after
we see a valid gen bit in a CQE but before we read any other fields
from the CQE.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/t4.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h
index edab0e9..67cd09ee 100644
--- a/drivers/infiniband/hw/cxgb4/t4.h
+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -622,6 +622,7 @@ static inline int t4_next_hw_cqe(struct t4_cq *cq, struct 
t4_cqe **cqe)
printk(KERN_ERR MOD cq overflow cqid %u\n, cq-cqid);
BUG_ON(1);
} else if (t4_valid_cqe(cq, cq-queue[cq-cidx])) {
+   rmb();
*cqe = cq-queue[cq-cidx];
ret = 0;
} else
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 23/31] iw_cxgb4: drop RX_DATA packets if the endpoint is gone

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/cm.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 9708987..f62801a 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -1544,6 +1544,8 @@ static int rx_data(struct c4iw_dev *dev, struct sk_buff 
*skb)
__u8 status = hdr-status;
 
ep = lookup_tid(t, tid);
+   if (!ep)
+   return 0;
PDBG(%s ep %p tid %u dlen %u\n, __func__, ep, ep-hwtid, dlen);
skb_pull(skb, sizeof(*hdr));
skb_trim(skb, dlen);
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 29/31] iw_cxgb4: minor fixes

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

Added some missing debug stats.

Use uninitialized_var().

Initialize reserved fields in a FW work request.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/cq.c   | 2 +-
 drivers/infiniband/hw/cxgb4/mem.c  | 6 +-
 drivers/infiniband/hw/cxgb4/qp.c   | 2 ++
 drivers/infiniband/hw/cxgb4/resource.c | 6 +-
 4 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index 121ad07..a47b845 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -672,7 +672,7 @@ skip_cqe:
 static int c4iw_poll_cq_one(struct c4iw_cq *chp, struct ib_wc *wc)
 {
struct c4iw_qp *qhp = NULL;
-   struct t4_cqe cqe = {0, 0}, *rd_cqe;
+   struct t4_cqe uninitialized_var(cqe), *rd_cqe;
struct t4_wq *wq;
u32 credit = 0;
u8 cqe_flushed;
diff --git a/drivers/infiniband/hw/cxgb4/mem.c 
b/drivers/infiniband/hw/cxgb4/mem.c
index 22a2e3e..0989871a 100644
--- a/drivers/infiniband/hw/cxgb4/mem.c
+++ b/drivers/infiniband/hw/cxgb4/mem.c
@@ -259,8 +259,12 @@ static int write_tpt_entry(struct c4iw_rdev *rdev, u32 
reset_tpt_entry,
 
if ((!reset_tpt_entry)  (*stag == T4_STAG_UNSET)) {
stag_idx = c4iw_get_resource(rdev-resource.tpt_table);
-   if (!stag_idx)
+   if (!stag_idx) {
+   mutex_lock(rdev-stats.lock);
+   rdev-stats.stag.fail++;
+   mutex_unlock(rdev-stats.lock);
return -ENOMEM;
+   }
mutex_lock(rdev-stats.lock);
rdev-stats.stag.cur += 32;
if (rdev-stats.stag.cur  rdev-stats.stag.max)
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index 141999d..dfe37fc 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -428,6 +428,8 @@ static int build_rdma_send(struct t4_sq *sq, union t4_wr 
*wqe,
default:
return -EINVAL;
}
+   wqe-send.r3 = 0;
+   wqe-send.r4 = 0;
 
plen = 0;
if (wr-num_sge) {
diff --git a/drivers/infiniband/hw/cxgb4/resource.c 
b/drivers/infiniband/hw/cxgb4/resource.c
index cdef4d7..d9bc9ba 100644
--- a/drivers/infiniband/hw/cxgb4/resource.c
+++ b/drivers/infiniband/hw/cxgb4/resource.c
@@ -179,8 +179,12 @@ u32 c4iw_get_qpid(struct c4iw_rdev *rdev, struct 
c4iw_dev_ucontext *uctx)
kfree(entry);
} else {
qid = c4iw_get_resource(rdev-resource.qid_table);
-   if (!qid)
+   if (!qid) {
+   mutex_lock(rdev-stats.lock);
+   rdev-stats.qid.fail++;
+   mutex_unlock(rdev-stats.lock);
goto out;
+   }
mutex_lock(rdev-stats.lock);
rdev-stats.qid.cur += rdev-qpmask + 1;
mutex_unlock(rdev-stats.lock);
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 02/31] cxgb4: Add code to dump SGE registers when hitting idma hangs

2014-03-12 Thread Hariprasad Shenai

From: Kumar Sanghvi kuma...@chelsio.com

Based on original work by Casey Leedom lee...@chelsio.com

Signed-off-by: Kumar Sanghvi kuma...@chelsio.com
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h   |   1 +
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c   | 106 +++
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h |   3 +
 3 files changed, 110 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 944f2cb..509c976 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -1032,4 +1032,5 @@ void t4_db_dropped(struct adapter *adapter);
 int t4_mem_win_read_len(struct adapter *adap, u32 addr, __be32 *data, int len);
 int t4_fwaddrspace_write(struct adapter *adap, unsigned int mbox,
 u32 addr, u32 val);
+void t4_sge_decode_idma_state(struct adapter *adapter, int state);
 #endif /* __CXGB4_H__ */
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c 
b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index d3c2a51..fb2fe65 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -2597,6 +2597,112 @@ int t4_mdio_wr(struct adapter *adap, unsigned int mbox, 
unsigned int phy_addr,
 }
 
 /**
+ * t4_sge_decode_idma_state - decode the idma state
+ * @adap: the adapter
+ * @state: the state idma is stuck in
+ */
+void t4_sge_decode_idma_state(struct adapter *adapter, int state)
+{
+   static const char * const t4_decode[] = {
+   IDMA_IDLE,
+   IDMA_PUSH_MORE_CPL_FIFO,
+   IDMA_PUSH_CPL_MSG_HEADER_TO_FIFO,
+   Not used,
+   IDMA_PHYSADDR_SEND_PCIEHDR,
+   IDMA_PHYSADDR_SEND_PAYLOAD_FIRST,
+   IDMA_PHYSADDR_SEND_PAYLOAD,
+   IDMA_SEND_FIFO_TO_IMSG,
+   IDMA_FL_REQ_DATA_FL_PREP,
+   IDMA_FL_REQ_DATA_FL,
+   IDMA_FL_DROP,
+   IDMA_FL_H_REQ_HEADER_FL,
+   IDMA_FL_H_SEND_PCIEHDR,
+   IDMA_FL_H_PUSH_CPL_FIFO,
+   IDMA_FL_H_SEND_CPL,
+   IDMA_FL_H_SEND_IP_HDR_FIRST,
+   IDMA_FL_H_SEND_IP_HDR,
+   IDMA_FL_H_REQ_NEXT_HEADER_FL,
+   IDMA_FL_H_SEND_NEXT_PCIEHDR,
+   IDMA_FL_H_SEND_IP_HDR_PADDING,
+   IDMA_FL_D_SEND_PCIEHDR,
+   IDMA_FL_D_SEND_CPL_AND_IP_HDR,
+   IDMA_FL_D_REQ_NEXT_DATA_FL,
+   IDMA_FL_SEND_PCIEHDR,
+   IDMA_FL_PUSH_CPL_FIFO,
+   IDMA_FL_SEND_CPL,
+   IDMA_FL_SEND_PAYLOAD_FIRST,
+   IDMA_FL_SEND_PAYLOAD,
+   IDMA_FL_REQ_NEXT_DATA_FL,
+   IDMA_FL_SEND_NEXT_PCIEHDR,
+   IDMA_FL_SEND_PADDING,
+   IDMA_FL_SEND_COMPLETION_TO_IMSG,
+   IDMA_FL_SEND_FIFO_TO_IMSG,
+   IDMA_FL_REQ_DATAFL_DONE,
+   IDMA_FL_REQ_HEADERFL_DONE,
+   };
+   static const char * const t5_decode[] = {
+   IDMA_IDLE,
+   IDMA_ALMOST_IDLE,
+   IDMA_PUSH_MORE_CPL_FIFO,
+   IDMA_PUSH_CPL_MSG_HEADER_TO_FIFO,
+   IDMA_SGEFLRFLUSH_SEND_PCIEHDR,
+   IDMA_PHYSADDR_SEND_PCIEHDR,
+   IDMA_PHYSADDR_SEND_PAYLOAD_FIRST,
+   IDMA_PHYSADDR_SEND_PAYLOAD,
+   IDMA_SEND_FIFO_TO_IMSG,
+   IDMA_FL_REQ_DATA_FL,
+   IDMA_FL_DROP,
+   IDMA_FL_DROP_SEND_INC,
+   IDMA_FL_H_REQ_HEADER_FL,
+   IDMA_FL_H_SEND_PCIEHDR,
+   IDMA_FL_H_PUSH_CPL_FIFO,
+   IDMA_FL_H_SEND_CPL,
+   IDMA_FL_H_SEND_IP_HDR_FIRST,
+   IDMA_FL_H_SEND_IP_HDR,
+   IDMA_FL_H_REQ_NEXT_HEADER_FL,
+   IDMA_FL_H_SEND_NEXT_PCIEHDR,
+   IDMA_FL_H_SEND_IP_HDR_PADDING,
+   IDMA_FL_D_SEND_PCIEHDR,
+   IDMA_FL_D_SEND_CPL_AND_IP_HDR,
+   IDMA_FL_D_REQ_NEXT_DATA_FL,
+   IDMA_FL_SEND_PCIEHDR,
+   IDMA_FL_PUSH_CPL_FIFO,
+   IDMA_FL_SEND_CPL,
+   IDMA_FL_SEND_PAYLOAD_FIRST,
+   IDMA_FL_SEND_PAYLOAD,
+   IDMA_FL_REQ_NEXT_DATA_FL,
+   IDMA_FL_SEND_NEXT_PCIEHDR,
+   IDMA_FL_SEND_PADDING,
+   IDMA_FL_SEND_COMPLETION_TO_IMSG,
+   };
+   static const u32 sge_regs[] = {
+   SGE_DEBUG_DATA_LOW_INDEX_2,
+   SGE_DEBUG_DATA_LOW_INDEX_3,
+   SGE_DEBUG_DATA_HIGH_INDEX_10,
+   };
+   const char **sge_idma_decode;
+   int sge_idma_decode_nstates;
+   int i;
+
+   if (is_t4(adapter-params.chip)) {
+   sge_idma_decode = (const char **)t4_decode;
+   sge_idma_decode_nstates = ARRAY_SIZE(t4_decode);
+   } else {
+   sge_idma_decode = (const char **)t5_decode;
+   sge_idma_decode_nstates =

[PATCHv6 net-next 22/31] iw_cxgb4: lock around accept/reject downcalls

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

There is a race between ULP threads doing an accept/reject, and the
ingress processing thread handling close/abort for the same connection.
The accept/reject path needs to hold the lock to serialize these paths.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/cm.c | 31 +--
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 87bd3c8..9708987 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -775,7 +775,7 @@ static void send_mpa_req(struct c4iw_ep *ep, struct sk_buff 
*skb,
ep-mpa_skb = skb;
c4iw_l2t_send(ep-com.dev-rdev, skb, ep-l2t);
start_ep_timer(ep);
-   state_set(ep-com, MPA_REQ_SENT);
+   __state_set(ep-com, MPA_REQ_SENT);
ep-mpa_attr.initiator = 1;
ep-snd_seq += mpalen;
return;
@@ -941,7 +941,7 @@ static int send_mpa_reply(struct c4iw_ep *ep, const void 
*pdata, u8 plen)
skb_get(skb);
t4_set_arp_err_handler(skb, NULL, arp_failure_discard);
ep-mpa_skb = skb;
-   state_set(ep-com, MPA_REP_SENT);
+   __state_set(ep-com, MPA_REP_SENT);
ep-snd_seq += mpalen;
return c4iw_l2t_send(ep-com.dev-rdev, skb, ep-l2t);
 }
@@ -959,6 +959,7 @@ static int act_establish(struct c4iw_dev *dev, struct 
sk_buff *skb)
PDBG(%s ep %p tid %u snd_isn %u rcv_isn %u\n, __func__, ep, tid,
 be32_to_cpu(req-snd_isn), be32_to_cpu(req-rcv_isn));
 
+   mutex_lock(ep-com.mutex);
dst_confirm(ep-dst);
 
/* setup the hwtid for this connection */
@@ -982,7 +983,7 @@ static int act_establish(struct c4iw_dev *dev, struct 
sk_buff *skb)
send_mpa_req(ep, skb, 1);
else
send_mpa_req(ep, skb, mpa_rev);
-
+   mutex_unlock(ep-com.mutex);
return 0;
 }
 
@@ -2566,22 +2567,28 @@ static int fw4_ack(struct c4iw_dev *dev, struct sk_buff 
*skb)
 
 int c4iw_reject_cr(struct iw_cm_id *cm_id, const void *pdata, u8 pdata_len)
 {
-   int err;
+   int err = 0;
+   int disconnect = 0;
struct c4iw_ep *ep = to_ep(cm_id);
PDBG(%s ep %p tid %u\n, __func__, ep, ep-hwtid);
 
-   if (state_read(ep-com) == DEAD) {
+
+   mutex_lock(ep-com.mutex);
+   if (ep-com.state == DEAD) {
c4iw_put_ep(ep-com);
return -ECONNRESET;
}
set_bit(ULP_REJECT, ep-com.history);
-   BUG_ON(state_read(ep-com) != MPA_REQ_RCVD);
+   BUG_ON(ep-com.state != MPA_REQ_RCVD);
if (mpa_rev == 0)
abort_connection(ep, NULL, GFP_KERNEL);
else {
err = send_mpa_reject(ep, pdata, pdata_len);
-   err = c4iw_ep_disconnect(ep, 0, GFP_KERNEL);
+   disconnect = 1;
}
+   mutex_unlock(ep-com.mutex);
+   if (disconnect)
+   err = c4iw_ep_disconnect(ep, 0, GFP_KERNEL);
c4iw_put_ep(ep-com);
return 0;
 }
@@ -2596,12 +2603,14 @@ int c4iw_accept_cr(struct iw_cm_id *cm_id, struct 
iw_cm_conn_param *conn_param)
struct c4iw_qp *qp = get_qhp(h, conn_param-qpn);
 
PDBG(%s ep %p tid %u\n, __func__, ep, ep-hwtid);
-   if (state_read(ep-com) == DEAD) {
+
+   mutex_lock(ep-com.mutex);
+   if (ep-com.state == DEAD) {
err = -ECONNRESET;
goto err;
}
 
-   BUG_ON(state_read(ep-com) != MPA_REQ_RCVD);
+   BUG_ON(ep-com.state != MPA_REQ_RCVD);
BUG_ON(!qp);
 
set_bit(ULP_ACCEPT, ep-com.history);
@@ -2670,14 +2679,16 @@ int c4iw_accept_cr(struct iw_cm_id *cm_id, struct 
iw_cm_conn_param *conn_param)
if (err)
goto err1;
 
-   state_set(ep-com, FPDU_MODE);
+   __state_set(ep-com, FPDU_MODE);
established_upcall(ep);
+   mutex_unlock(ep-com.mutex);
c4iw_put_ep(ep-com);
return 0;
 err1:
ep-com.cm_id = NULL;
cm_id-rem_ref(cm_id);
 err:
+   mutex_unlock(ep-com.mutex);
c4iw_put_ep(ep-com);
return err;
 }
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 01/31] cxgb4: Fix some small bugs in t4_sge_init_soft() when our Page Size is 64KB

2014-03-12 Thread Hariprasad Shenai

From: Kumar Sanghvi kuma...@chelsio.com

We'd come in with SGE_FL_BUFFER_SIZE[0] and [1] both equal to 64KB and the
extant logic would flag that as an error.

Based on original work by Casey Leedom lee...@chelsio.com

Signed-off-by: Kumar Sanghvi kuma...@chelsio.com
---
 drivers/net/ethernet/chelsio/cxgb4/sge.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index af76b25..3a2ecd8 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -2596,11 +2596,19 @@ static int t4_sge_init_soft(struct adapter *adap)
fl_small_mtu = READ_FL_BUF(RX_SMALL_MTU_BUF);
fl_large_mtu = READ_FL_BUF(RX_LARGE_MTU_BUF);
 
+   /* We only bother using the Large Page logic if the Large Page Buffer
+* is larger than our Page Size Buffer.
+*/
+   if (fl_large_pg = fl_small_pg)
+   fl_large_pg = 0;
+
#undef READ_FL_BUF
 
+   /* The Page Size Buffer must be exactly equal to our Page Size and the
+* Large Page Size Buffer should be 0 (per above) or a power of 2.
+*/
if (fl_small_pg != PAGE_SIZE ||
-   (fl_large_pg != 0  (fl_large_pg  fl_small_pg ||
- (fl_large_pg  (fl_large_pg-1)) != 0))) {
+   (fl_large_pg  (fl_large_pg-1)) != 0) {
dev_err(adap-pdev_dev, bad SGE FL page buffer sizes [%d, 
%d]\n,
fl_small_pg, fl_large_pg);
return -EINVAL;
-- 
1.8.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCHv6 net-next 19/31] iw_cxgb4: connect_request_upcall fixes

2014-03-12 Thread Hariprasad Shenai

From: Steve Wise sw...@opengridcomputing.com

When processing an MPA Start Request, if the listening
endpoint is DEAD, then abort the connection.

If the IWCM returns an error, then we must abort the connection and
release resources.  Also abort_connection() should not post a CLOSE
event, so clean that up too.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
---
 drivers/infiniband/hw/cxgb4/cm.c | 40 
 1 file changed, 24 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index db8dfdf..a9fb73a 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -968,13 +968,14 @@ static int act_establish(struct c4iw_dev *dev, struct 
sk_buff *skb)
return 0;
 }
 
-static void close_complete_upcall(struct c4iw_ep *ep)
+static void close_complete_upcall(struct c4iw_ep *ep, int status)
 {
struct iw_cm_event event;
 
PDBG(%s ep %p tid %u\n, __func__, ep, ep-hwtid);
memset(event, 0, sizeof(event));
event.event = IW_CM_EVENT_CLOSE;
+   event.status = status;
if (ep-com.cm_id) {
PDBG(close complete delivered ep %p cm_id %p tid %u\n,
 ep, ep-com.cm_id, ep-hwtid);
@@ -988,7 +989,6 @@ static void close_complete_upcall(struct c4iw_ep *ep)
 static int abort_connection(struct c4iw_ep *ep, struct sk_buff *skb, gfp_t gfp)
 {
PDBG(%s ep %p tid %u\n, __func__, ep, ep-hwtid);
-   close_complete_upcall(ep);
state_set(ep-com, ABORTING);
set_bit(ABORT_CONN, ep-com.history);
return send_abort(ep, skb, gfp);
@@ -1067,9 +1067,10 @@ static void connect_reply_upcall(struct c4iw_ep *ep, int 
status)
}
 }
 
-static void connect_request_upcall(struct c4iw_ep *ep)
+static int connect_request_upcall(struct c4iw_ep *ep)
 {
struct iw_cm_event event;
+   int ret;
 
PDBG(%s ep %p tid %u\n, __func__, ep, ep-hwtid);
memset(event, 0, sizeof(event));
@@ -1094,15 +1095,14 @@ static void connect_request_upcall(struct c4iw_ep *ep)
event.private_data_len = ep-plen;
event.private_data = ep-mpa_pkt + sizeof(struct mpa_message);
}
-   if (state_read(ep-parent_ep-com) != DEAD) {
-   c4iw_get_ep(ep-com);
-   ep-parent_ep-com.cm_id-event_handler(
-   ep-parent_ep-com.cm_id,
-   event);
-   }
+   c4iw_get_ep(ep-com);
+   ret = ep-parent_ep-com.cm_id-event_handler(ep-parent_ep-com.cm_id,
+ event);
+   if (ret)
+   c4iw_put_ep(ep-com);
set_bit(CONNREQ_UPCALL, ep-com.history);
c4iw_put_ep(ep-parent_ep-com);
-   ep-parent_ep = NULL;
+   return ret;
 }
 
 static void established_upcall(struct c4iw_ep *ep)
@@ -1401,7 +1401,6 @@ static void process_mpa_request(struct c4iw_ep *ep, 
struct sk_buff *skb)
return;
 
PDBG(%s enter (%s line %u)\n, __func__, __FILE__, __LINE__);
-   stop_ep_timer(ep);
mpa = (struct mpa_message *) ep-mpa_pkt;
 
/*
@@ -1494,9 +1493,17 @@ static void process_mpa_request(struct c4iw_ep *ep, 
struct sk_buff *skb)
 ep-mpa_attr.p2p_type);
 
state_set(ep-com, MPA_REQ_RCVD);
+   stop_ep_timer(ep);
 
/* drive upcall */
-   connect_request_upcall(ep);
+   mutex_lock(ep-parent_ep-com.mutex);
+   if (ep-parent_ep-com.state != DEAD) {
+   if (connect_request_upcall(ep))
+   abort_connection(ep, skb, GFP_KERNEL);
+   } else {
+   abort_connection(ep, skb, GFP_KERNEL);
+   }
+   mutex_unlock(ep-parent_ep-com.mutex);
return;
 }
 
@@ -2256,7 +2263,7 @@ static int peer_close(struct c4iw_dev *dev, struct 
sk_buff *skb)
c4iw_modify_qp(ep-com.qp-rhp, ep-com.qp,
   C4IW_QP_ATTR_NEXT_STATE, attrs, 1);
}
-   close_complete_upcall(ep);
+   close_complete_upcall(ep, 0);
__state_set(ep-com, DEAD);
release = 1;
disconnect = 0;
@@ -2426,7 +2433,7 @@ static int close_con_rpl(struct c4iw_dev *dev, struct 
sk_buff *skb)
 C4IW_QP_ATTR_NEXT_STATE,
 attrs, 1);
}
-   close_complete_upcall(ep);
+   close_complete_upcall(ep, 0);
__state_set(ep-com, DEAD);
release = 1;
break;
@@ -2981,7 +2988,7 @@ int c4iw_ep_disconnect(struct c4iw_ep *ep, int abrupt, 
gfp_t gfp)
rdev = ep-com.dev-rdev;
if (c4iw_fatal_error(rdev)) {
fatal = 1;
-   close_complete_upcall(ep);
+   close_complete_upcall(ep, -EIO);

Re: IB/mlx4: Build the port IBoE GID table properly under bonding

2014-03-12 Thread Moni Shoua


On 3/12/2014 12:17 PM, Bart Van Assche wrote:

On 02/18/14 15:32, Moni Shoua wrote:

Ha ha. Take another look. That's what I was just explaining about! :) On
line 1743 when curr_master is non-NULL then Smatch doesn't complain
because it understands about the relationship between curr_master and
curr_netdev. But here it is complaining about line 1749 where
curr_master is NULL so the implication doesn't apply. Nice, huh?
regards, dan carpenter

You're right :)
I'll write the fix.

Hello Moni,

Have you already had a chance to look further into this issue ?

Thanks,

Bart.


Hi
Yes, a patch is waiting for internal review before submission.
I hope it will be out soon.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] IB/core: Suppress a sparse warning

2014-03-12 Thread Moni Shoua


On 3/12/2014 12:15 PM, Bart Van Assche wrote:

On 03/10/14 17:08, Paul E. McKenney wrote:

On Mon, Mar 10, 2014 at 04:02:13PM +0100, Yann Droneaud wrote:

Hi,

Le lundi 10 mars 2014 à 15:26 +0100, Bart Van Assche a écrit :

On 03/10/14 14:33, Yann Droneaud wrote:

Le lundi 10 mars 2014 à 13:22 +0100, Bart Van Assche a écrit :

Suppress the following sparse warning:
include/rdma/ib_addr.h:187:24: warning: cast removes address space of expression

You should explain why there's a warning here, and why is it safe to
cast. (I believe it's related to RCU domain ?)

Hello Yann,

Now that I've had a closer look at the code in include/rdma/ib_addr.h,
that code probably isn't safe. How about the (untested) patch below ?


Thanks for investigating.

I'm not an expert in RCU, but I believe it then miss the RCU annotations
around the RCU reader section (ensure correct ordering if I recall
correctly).

Cc: Paul E. McKenney paul...@linux.vnet.ibm.com

If the rcu_read_lock() isn't supplied by all callers to this function,
then yes, it needs to be supplied as Yann shows below.

The CONFIG_PROVE_RCU=y Kconfig option can help determine that they are
needed, but of course cannot prove that they are not needed, at least
not unless you have a workload that exercises all possible calls to
this function.

Hello Moni,

I think this warning got introduced via commit IB/cma: IBoE (RoCE)
IP-based GID addressing (7b85627b9f02f9b0fb2ef5f021807f4251135857;
December 12, 2013). Can you follow this up further ?

Thanks,

Bart.

Sure. I'll look into it.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch 1/2] IB/mlx5: Implementation of PCI error handler

2014-03-12 Thread Ben Hutchings

On Tue, 2014-03-11 at 22:42 -0500, cls...@linux.vnet.ibm.com wrote:
[...]
 Index: b/include/linux/mlx5/driver.h
 ===
 --- a/include/linux/mlx5/driver.h
 +++ b/include/linux/mlx5/driver.h
 @@ -51,10 +51,10 @@ enum {
  };
  
  enum {
 -   /* one minute for the sake of bringup. Generally, commands must always
 +   /* 10 msecs for the sake of bringup. Generally, commands must always
  * complete and we may need to increase this timeout value
  */
 -   MLX5_CMD_TIMEOUT_MSEC   = 7200 * 1000,
 +   MLX5_CMD_TIMEOUT_MSEC   = 10 * 1000,

You seem to be changing the timeout from 2 hours (not one minute) to 10
seconds (not milliseconds).

Ben.

 MLX5_CMD_WQ_MAX_NAME= 32,
  };
  
 

-- 
Ben Hutchings
Any sufficiently advanced bug is indistinguishable from a feature.


signature.asc
Description: This is a digitally signed message part

Re: [PATCHv6 net-next 20/31] iw_cxgb4: adjust tcp snd/rcv window based on link speed

2014-03-12 Thread David Miller

From: Hariprasad Shenai haripra...@chelsio.com
Date: Wed, 12 Mar 2014 21:20:35 +0530

 Added module option named adjust_win, defaulted to 1, that allows
 disabling the 40G window bump.  This allows a user to specify the exact
 default window sizes via module options snd_win and rcv_win.

This is terrible.  As is the existing other TCP tweaking module
parameters.

You can just use the TCP settings the kernel already provides for
the real TCP stack.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv6 net-next 00/31] Misc. fixes for cxgb4 and iw_cxgb4

2014-03-12 Thread David Miller

From: Hariprasad Shenai haripra...@chelsio.com
Date: Wed, 12 Mar 2014 21:20:15 +0530

 V6:
In patch 8/31, move the existing neigh_release() call right before the
if(!e) test, that way you don't need a completely new label and code block
to fix this bug - thanks to review by David Miller
In patch 15/31, use %pad to print dma_addr - thanks to review by Joe 
 Perches
In patch 10/31, add the STOPPED state string to db_state_str - thanks
to review by Steve Wise
In patch 10/31, t4_db_dropped() needs to disable dbs and send DB_FULL
event to iw_cxgb4 - thanks to review by Steve Wise
 V5:
Dropped patch cxgb4: use spinlock_irqsave/spinlock_irqrestore for db 
 lock.

I do not see the spinlock patch reinstated, part of it was correct and
fixed a real bug.  For the second time, I only stated that parts of it
were superfluous, not all of it.

This is becomming beyond tiring.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCHv6 net-next 00/31] Misc. fixes for cxgb4 and iw_cxgb4

2014-03-12 Thread Steve Wise

 -Original Message-
 From: David Miller [mailto:da...@davemloft.net]
 Sent: Wednesday, March 12, 2014 2:51 PM
 To: haripra...@chelsio.com
 Cc: net...@vger.kernel.org; linux-rdma@vger.kernel.org; 
 rol...@purestorage.com;
 d...@chelsio.com; sw...@opengridcomputing.com; lee...@chelsio.com;
 sant...@chelsio.com; kuma...@chelsio.com; nirran...@chelsio.com
 Subject: Re: [PATCHv6 net-next 00/31] Misc. fixes for cxgb4 and iw_cxgb4

 From: Hariprasad Shenai haripra...@chelsio.com
 Date: Wed, 12 Mar 2014 21:20:15 +0530

  V6:
 In patch 8/31, move the existing neigh_release() call right before the
 if(!e) test, that way you don't need a completely new label and code 
  block
 to fix this bug - thanks to review by David Miller
 In patch 15/31, use %pad to print dma_addr - thanks to review by Joe 
  Perches
 In patch 10/31, add the STOPPED state string to db_state_str - thanks
 to review by Steve Wise
 In patch 10/31, t4_db_dropped() needs to disable dbs and send DB_FULL
 event to iw_cxgb4 - thanks to review by Steve Wise
  V5:
 Dropped patch cxgb4: use spinlock_irqsave/spinlock_irqrestore for db 
  lock.

 I do not see the spinlock patch reinstated, part of it was correct and
 fixed a real bug.  For the second time, I only stated that parts of it
 were superfluous, not all of it.

Ok we can reinstate this patch and remove the bits from 10/31 if that's what 
you prefer.

Steve.

BTW: From my earlier reply explaining that we didn't drop the needed fixes:

The remaining changes from the removed patch are moved into patch 10/31
(Doorbell Drop Avoidance Bug Fixes).  10/31 has the driver call
disable_txq_db() from an interrupt handler, and I thought it would be
better to put all the changes to fix how the db lock is acquired into
this one patch.   

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCHv6 net-next 20/31] iw_cxgb4: adjust tcp snd/rcv window based on link speed

2014-03-12 Thread Steve Wise

  Added module option named adjust_win, defaulted to 1, that allows
  disabling the 40G window bump.  This allows a user to specify the exact
  default window sizes via module options snd_win and rcv_win.
 
 This is terrible.  As is the existing other TCP tweaking module
 parameters.
 
 You can just use the TCP settings the kernel already provides for
 the real TCP stack.

Do you mean use sysctl_tcp_*mem, sysctl_tcp_timestamps, 
sysctl_tcp_window_scaling, etc?
I'll look into this.  

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCHv6 net-next 20/31] iw_cxgb4: adjust tcp snd/rcv window based on link speed

2014-03-12 Thread Steve Wise

  You can just use the TCP settings the kernel already provides for
  the real TCP stack.
 
  Do you mean use sysctl_tcp_*mem, sysctl_tcp_timestamps, 
  sysctl_tcp_window_scaling,
etc?
  I'll look into this.
 
 And the socket memory limits, which we use to compute default window
 sizes.

How's this look (compile-tested only)?  Note I had to export some of the tcp 
limits.

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index e2fe4a2..ff95fa3 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -81,19 +81,6 @@ int c4iw_max_read_depth = 8;
 module_param(c4iw_max_read_depth, int, 0644);
 MODULE_PARM_DESC(c4iw_max_read_depth, Per-connection max ORD/IRD 
(default=8));
 
-static int enable_tcp_timestamps;
-module_param(enable_tcp_timestamps, int, 0644);
-MODULE_PARM_DESC(enable_tcp_timestamps, Enable tcp timestamps (default=0));
-
-static int enable_tcp_sack;
-module_param(enable_tcp_sack, int, 0644);
-MODULE_PARM_DESC(enable_tcp_sack, Enable tcp SACK (default=0));
-
-static int enable_tcp_window_scaling = 1;
-module_param(enable_tcp_window_scaling, int, 0644);
-MODULE_PARM_DESC(enable_tcp_window_scaling,
-Enable tcp window scaling (default=1));
-
 int c4iw_debug;
 module_param(c4iw_debug, int, 0644);
 MODULE_PARM_DESC(c4iw_debug, Enable debug logging (default=0));
@@ -126,19 +113,6 @@ static int crc_enabled = 1;
 module_param(crc_enabled, int, 0644);
 MODULE_PARM_DESC(crc_enabled, Enable MPA CRC (default(1)=enabled));
 
-static int rcv_win = 256 * 1024;
-module_param(rcv_win, int, 0644);
-MODULE_PARM_DESC(rcv_win, TCP receive window in bytes (default=256KB));
-
-static int snd_win = 128 * 1024;
-module_param(snd_win, int, 0644);
-MODULE_PARM_DESC(snd_win, TCP send window in bytes (default=128KB));
-
-static int adjust_win = 1;
-module_param(adjust_win, int, 0644);
-MODULE_PARM_DESC(adjust_win,
-Adjust TCP window based on link speed (default=1));
-
 static struct workqueue_struct *workq;
 
 static struct sk_buff_head rxq;
@@ -572,7 +546,7 @@ static int send_connect(struct c4iw_ep *ep)
set_wr_txq(skb, CPL_PRIORITY_SETUP, ep-ctrlq_idx);
 
cxgb4_best_mtu(ep-com.dev-rdev.lldi.mtus, ep-mtu, mtu_idx);
-   wscale = compute_wscale(rcv_win);
+   wscale = compute_wscale(ep-rcv_win);
 
/*
 * Specify the largest window that will fit in opt0. The
@@ -596,11 +570,11 @@ static int send_connect(struct c4iw_ep *ep)
opt2 = RX_CHANNEL(0) |
   CCTRL_ECN(enable_ecn) |
   RSS_QUEUE_VALID | RSS_QUEUE(ep-rss_qid);
-   if (enable_tcp_timestamps)
+   if (sysctl_tcp_timestamps)
opt2 |= TSTAMPS_EN(1);
-   if (enable_tcp_sack)
+   if (sysctl_tcp_sack)
opt2 |= SACK_EN(1);
-   if (wscale  enable_tcp_window_scaling)
+   if (wscale  sysctl_tcp_window_scaling)
opt2 |= WND_SCALE_EN(1);
t4_set_arp_err_handler(skb, NULL, act_open_req_arp_failure);
 
@@ -1652,7 +1626,7 @@ static void send_fw_act_open_req(struct c4iw_ep *ep, 
unsigned int
atid)
req-tcb.tx_max = (__force __be32) jiffies;
req-tcb.rcv_adv = htons(1);
cxgb4_best_mtu(ep-com.dev-rdev.lldi.mtus, ep-mtu, mtu_idx);
-   wscale = compute_wscale(rcv_win);
+   wscale = compute_wscale(ep-rcv_win);
 
/*
 * Specify the largest window that will fit in opt0. The
@@ -1679,11 +1653,11 @@ static void send_fw_act_open_req(struct c4iw_ep *ep, 
unsigned int
atid)
RX_CHANNEL(0) |
CCTRL_ECN(enable_ecn) |
RSS_QUEUE_VALID | RSS_QUEUE(ep-rss_qid));
-   if (enable_tcp_timestamps)
+   if (sysctl_tcp_timestamps)
req-tcb.opt2 |= (__force __be32) TSTAMPS_EN(1);
-   if (enable_tcp_sack)
+   if (sysctl_tcp_sack)
req-tcb.opt2 |= (__force __be32) SACK_EN(1);
-   if (wscale  enable_tcp_window_scaling)
+   if (wscale  sysctl_tcp_window_scaling)
req-tcb.opt2 |= (__force __be32) WND_SCALE_EN(1);
req-tcb.opt0 = cpu_to_be64((__force u64) req-tcb.opt0);
req-tcb.opt2 = cpu_to_be32((__force u32) req-tcb.opt2);
@@ -1712,11 +1686,14 @@ static int is_neg_adv(unsigned int status)
 
 static void set_tcp_window(struct c4iw_ep *ep, struct port_info *pi)
 {
+   u32 snd_win = max_t(u32, sysctl_tcp_rmem[2], sysctl_rmem_max);
+   u32 rcv_win = max_t(u32, sysctl_tcp_wmem[2], sysctl_wmem_max);
+
ep-snd_win = snd_win;
ep-rcv_win = rcv_win;
-   if (adjust_win  pi-link_cfg.speed == 4) {
-   ep-snd_win *= 4;
-   ep-rcv_win *= 4;
+   if (pi-link_cfg.speed == 4) {
+   ep-snd_win = min_t(u32, ep-snd_win * 4, snd_win);
+   ep-rcv_win = min_t(u32, ep-rcv_win * 4, rcv_win);
}
PDBG(%s snd_win %d rcv_win %d\n, __func__, ep-snd_win, ep-rcv_win);
 }
@@ -2026,7 +2003,7 @@ static void accept_cr(struct c4iw_ep *ep, struct

Re: [Patch 1/2] IB/mlx5: Implementation of PCI error handler

2014-03-12 Thread Carol Soto



On 3/12/2014 1:34 PM, Ben Hutchings wrote:

On Tue, 2014-03-11 at 22:42 -0500, cls...@linux.vnet.ibm.com wrote:
[...]

Index: b/include/linux/mlx5/driver.h
===
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -51,10 +51,10 @@ enum {
  };
  
  enum {

-   /* one minute for the sake of bringup. Generally, commands must always
+   /* 10 msecs for the sake of bringup. Generally, commands must always
  * complete and we may need to increase this timeout value
  */
-   MLX5_CMD_TIMEOUT_MSEC   = 7200 * 1000,
+   MLX5_CMD_TIMEOUT_MSEC   = 10 * 1000,

You seem to be changing the timeout from 2 hours (not one minute) to 10
seconds (not milliseconds).

Ben.

Yes you are right the comment should say 10 seconds instead of 10 msecs.

Carol



 MLX5_CMD_WQ_MAX_NAME= 32,
  };
  




--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv6 net-next 20/31] iw_cxgb4: adjust tcp snd/rcv window based on link speed

2014-03-12 Thread David Miller

From: Steve Wise sw...@opengridcomputing.com
Date: Wed, 12 Mar 2014 16:29:27 -0500

  You can just use the TCP settings the kernel already provides for
  the real TCP stack.

  Do you mean use sysctl_tcp_*mem, sysctl_tcp_timestamps, 
  sysctl_tcp_window_scaling,
 etc?
  I'll look into this.

 And the socket memory limits, which we use to compute default window
 sizes.

 How's this look (compile-tested only)?  Note I had to export some of the tcp 
 limits.

Well, the problem is that you've dug your own hole already.

You can't just remove these existing module parameters that users can
set.  They are user visible APIs, you can't just remove them.

The best you can do is stop adding new ones.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCHv6 net-next 20/31] iw_cxgb4: adjust tcp snd/rcv window based on link speed

2014-03-12 Thread Steve Wise

 
  How's this look (compile-tested only)?  Note I had to export some of the 
  tcp limits.
 
 Well, the problem is that you've dug your own hole already.
 
 You can't just remove these existing module parameters that users can
 set.  They are user visible APIs, you can't just remove them.
 
 The best you can do is stop adding new ones.


Ok thanks.


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv6 net-next 00/31] Misc. fixes for cxgb4 and iw_cxgb4

2014-03-12 Thread Casey Leedom

[[Sorry, the first effort at this reply fell afoul of netdev’s HTML email 
filter and my Mail Agent’s default modes. — Casey]]

On Mar 12, 2014, at 12:51 PM, David Miller da...@davemloft.net wrote:

 This is becomming beyond tiring.

  Im really sorry for how much work this has turned into David.  Hari is trying 
to do the right thing but it’s an insanely large patch set.  A while back 
someone recommended backing off and restarting with a series of much smaller 
patch sets.  I think the decision at that time was that too much time had gone 
into this patch set so Hari should proceed with the effort.  Should we revisit 
that decision and ask Hari to submit a series of much smaller patch sets (one 
at a time obviously)?

Casey--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv6 net-next 00/31] Misc. fixes for cxgb4 and iw_cxgb4

2014-03-12 Thread David Miller

From: Casey Leedom lee...@chelsio.com
Date: Wed, 12 Mar 2014 16:43:33 -0700

 Should we revisit that decision and ask Hari to submit a series of
 much smaller patch sets (one at a time obviously)?

That might be a good idea, honestly.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Question] unexpected block during RDMA migration

2014-03-12 Thread Wangyufei (James)

Hello,
Recently I did a test like this:
1. I have host A and host B. I set ib0 on host A 192.168.0.1, I set ib0 on 
host B 192.168.0.2.
2. I start a guestOS C on host A, and I do a RDMA migration from host A to 
host B.
3. During the RDMA migration, I make host B power off(It simulates bad net 
connection or bad power supply).

Then I found that:
1. The libvirt migration api will block there forever.
2. I debug the qemu process, and it blocks in RDMA api like forever. The 
stack is like this:

#0  0x7fe987f2061c in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x7fe9887bf1d3 in rdma_destroy_id ()
   from /usr/lib64/librdmacm.so.1
#2  0x7fe98a484eb3 in qemu_rdma_resolve_host (rdma=0x7fe98b585670,
errp=0x7fffdc452bb8) at migration-rdma.c:981
#3  0x7fe98a487d05 in qemu_rdma_source_init (rdma=0x7fe98b585670,
errp=0x7fffdc452c10, pin_all=true) at migration-rdma.c:2298
#4  0x7fe98a48a762 in rdma_start_outgoing_migration (opaque=
0x7fe98ab67240 current_migration.25929, host_port=
0x7fe98b4a5a97 192.168.0.1:49152, errp=0x7fffdc452c88)
at migration-rdma.c:3426
#5  0x7fe98a48bd28 in qmp_migrate (uri=
0x7fe98b4a5a90 x-rdma:192.168.0.1:49152, has_blk=true, blk=false,
has_inc=true, inc=false, has_detach=true, detach=true, errp=0x7fffaf760008)
at migration.c:459
#6  0x7fd00969f3b4 in qmp_marshal_input_migrate (mon=0x7fd00a61adc0, qdict=
0x7fd00a7585f0, ret=0x7fffaf760058) at qmp-marshal.c:2793
#7  0x7fd00978033f in qmp_call_cmd (mon=0x7fd00a61adc0, cmd=
0x7fd009ced100 qmp_cmds+1344, params=0x7fd00a7585f0)
at /home/sdb/qemu-kvm-1.5.1/monitor.c:4520
#8  0x7fd009780606 in handle_qmp_command (parser=0x7fd00a619738, tokens=
0x7fd00a618cd0) at /home/sdb/qemu-kvm-1.5.1/monitor.c:4598
#9  0x7fd009813036 in json_message_process_token (lexer=0x7fd00a619740,
token=0x7fd00a67c690, type=JSON_OPERATOR, x=124, y=10)
at qobject/json-streamer.c:87
#10 0x7fd00982d0ef in json_lexer_feed_char (lexer=0x7fd00a619740, ch=
125 '}', flush=false) at qobject/json-lexer.c:303
#11 0x7fd00982d3a0 in json_lexer_feed (lexer=0x7fd00a619740, buffer=
0x7fffaf760330 }\005v\257\377\177, size=1) at qobject/json-lexer.c:356
#12 0x7fd009813270 in json_message_parser_feed (parser=0x7fd00a619738,
buffer=0x7fffaf760330 }\005v\257\377\177, size=1)
at qobject/json-streamer.c:110
#13 0x7fd00978070d in monitor_control_read (opaque=0x7fd00a61adc0, buf=
0x7fffaf760330 }\005v\257\377\177, size=1)
at /home/sdb/qemu-kvm-1.5.1/monitor.c:4619
#14 0x7fd00968d9b2 in qemu_chr_be_write (s=0x7fd00a61a990, buf=
0x7fffaf760330 }\005v\257\377\177, len=1) at qemu-char.c:254
#15 0x7fd009691de8 in tcp_chr_read (chan=0x7fd00a67ee40, cond=G_IO_IN,
opaque=0x7fd00a61a990) at qemu-char.c:2607
#16 0x7fd008b8569a in g_main_context_dispatch ()
   from /usr/lib64/libglib-2.0.so.0
#17 0x7fd009659e59 in glib_pollfds_poll () at main-loop.c:188
#18 0x7fd009659f4a in os_host_main_loop_wait (timeout=5643498)
at main-loop.c:233
#19 0x7fd00965a004 in main_loop_wait (nonblocking=0) at main-loop.c:478
#20 0x7fd0096e8b27 in main_loop () at vl.c:2186
#21 0x7fd0096efe33 in main (argc=49, argv=0x7fffaf761968, envp=
0x7fffaf761af8) at vl.c:4639

3. Just as the stack above, the main_loop of qemu will be blocked until I 
destroy the qemu process.

It's unreasonable. I expect it returns error.
What do you think about this? It confused me.

Best Regards,
-WangYufei

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

55 matches

Mail list logo