Re: [ewg] rping/cxgb3 regression

2011-02-16 Thread Vladimir Sokolovsky

On 02/16/2011 04:00 AM, Hefty, Sean wrote:

Not a big deal.

Vlad, can you pull librdmacm 1.0.14.1 into the next OFED 1.5.3 RC?  The only 
change versus 1.0.14 is reverting a patch to the rping sample.

Thanks,
Sean




Done,

Regards,
Vladimir
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] rping/cxgb3 regression

2011-02-15 Thread Steve Wise

Hey Sean,

Can you peruse:

http://bugs.openfabrics.org/bugzilla/show_bug.cgi?id=2230

The changes I added to rping to fix problems I was seeing when running rping over the upstream iw_cxgb3 have been 
included in OFED-1.5.3.  That change, however breaks rping over iw_cxgb3 in 1.5.3.  It causes a hang at the end of the 
rping run.  The problem is not with rping, but with the down-level iw_cxgb3 code.  The upstream change to fix iw_cxgb3, 
however, isn't trivial and it bumps the cxgb3 uverbs ABI.  That's why I didn't pull it into 1.5.3.  I'm hesitant to pull 
in the iw_cxgb3 (and required libcxgb3-1.3.0) since we're at RC4 and going to try and GA next week.


I'm wondering if pulling the rping changes for ofed-1.5.3 would be ok?  I guess to do this you would have to push a 
1-off librdmacm without those changes?  Or maybe back up what is in OFED-1.5.3 to the previous release without this 
rping change?


Thoughts?

Sorry about this regression.

Steve.


___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] rping/cxgb3 regression

2011-02-15 Thread Hefty, Sean
 I'm wondering if pulling the rping changes for ofed-1.5.3 would be ok?  I
 guess to do this you would have to push a
 1-off librdmacm without those changes?  Or maybe back up what is in OFED-
 1.5.3 to the previous release without this
 rping change?
 
 Thoughts?

Is the commit (93635fa33b41d356fa096242fec4ce788194b42f) below the issue?  
(Btw, the author listed in my git tree is wrong.)

I don't think I want to drop back to 1.0.13 for 1.5.3, so maybe reverting this 
change and pushing out 1.0.14.1 would work.  There's just one other change 
after 1.0.14 at the moment, and it's to the build, so I'd skip a full release 
for now.

Let me know if you think this would work.

- Sean

---

librdmacm/rping: Make sure CQ event thread exits before destroying the CQ

It is possible for the CQ event thread to poll the CQ after it has been
destroyed which can result in a seg fault on T3 interfaces.  This patch
waits for the thread to exit before destroying the CQ.

Signed-off-by: Steve Wise sw...@opengridcomputing.com
Signed-off-by: Sean Hefty sean.he...@intel.com

diff --git a/examples/rping.c b/examples/rping.c
index 2d4c2de..ee292ec 100644
--- a/examples/rping.c
+++ b/examples/rping.c
@@ -280,12 +280,11 @@ static int rping_cq_event_handler(struct rping_cb *cb)
ret = 0;

if (wc.status) {
-   if (wc.status != IBV_WC_WR_FLUSH_ERR) {
+   if (wc.status != IBV_WC_WR_FLUSH_ERR)
fprintf(stderr,
cq completion failed status %d\n,
wc.status);
-   ret = -1;
-   }
+   ret = -1;
goto error;
}

@@ -802,10 +801,9 @@ static void *rping_persistent_server_thread(void *arg)

rping_test_server(cb);
rdma_disconnect(cb-child_cm_id);
+   pthread_join(cb-cqthread, NULL);
rping_free_buffers(cb);
rping_free_qp(cb);
-   pthread_cancel(cb-cqthread);
-   pthread_join(cb-cqthread, NULL);
rdma_destroy_id(cb-child_cm_id);
free_cb(cb);
return NULL;
@@ -890,6 +888,7 @@ static int rping_run_server(struct rping_cb *cb)

rping_test_server(cb);
rdma_disconnect(cb-child_cm_id);
+   pthread_join(cb-cqthread, NULL);
rdma_destroy_id(cb-child_cm_id);
 err2:
rping_free_buffers(cb);
@@ -1057,6 +1056,7 @@ static int rping_run_client(struct rping_cb *cb)

rping_test_client(cb);
rdma_disconnect(cb-cm_id);
+   pthread_join(cb-cqthread, NULL);
 err2:
rping_free_buffers(cb);
 err1:
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] rping/cxgb3 regression

2011-02-15 Thread Steve Wise


On 02/15/2011 12:18 PM, Hefty, Sean wrote:

I'm wondering if pulling the rping changes for ofed-1.5.3 would be ok?  I
guess to do this you would have to push a
1-off librdmacm without those changes?  Or maybe back up what is in OFED-
1.5.3 to the previous release without this
rping change?

Thoughts?

Is the commit (93635fa33b41d356fa096242fec4ce788194b42f) below the issue?  
(Btw, the author listed in my git tree is wrong.)



Yes.


I don't think I want to drop back to 1.0.13 for 1.5.3, so maybe reverting this 
change and pushing out 1.0.14.1 would work.  There's just one other change 
after 1.0.14 at the moment, and it's to the build, so I'd skip a full release 
for now.

Let me know if you think this would work.



I just tested that removing this from 1.0.14 will resolve the issue for 1.5.3.



- Sean

---

 librdmacm/rping: Make sure CQ event thread exits before destroying the CQ

 It is possible for the CQ event thread to poll the CQ after it has been
 destroyed which can result in a seg fault on T3 interfaces.  This patch
 waits for the thread to exit before destroying the CQ.

 Signed-off-by: Steve Wisesw...@opengridcomputing.com
 Signed-off-by: Sean Heftysean.he...@intel.com

diff --git a/examples/rping.c b/examples/rping.c
index 2d4c2de..ee292ec 100644
--- a/examples/rping.c
+++ b/examples/rping.c
@@ -280,12 +280,11 @@ static int rping_cq_event_handler(struct rping_cb *cb)
 ret = 0;

 if (wc.status) {
-   if (wc.status != IBV_WC_WR_FLUSH_ERR) {
+   if (wc.status != IBV_WC_WR_FLUSH_ERR)
 fprintf(stderr,
 cq completion failed status %d\n,
 wc.status);
-   ret = -1;
-   }
+   ret = -1;
 goto error;
 }

@@ -802,10 +801,9 @@ static void *rping_persistent_server_thread(void *arg)

 rping_test_server(cb);
 rdma_disconnect(cb-child_cm_id);
+   pthread_join(cb-cqthread, NULL);
 rping_free_buffers(cb);
 rping_free_qp(cb);
-   pthread_cancel(cb-cqthread);
-   pthread_join(cb-cqthread, NULL);
 rdma_destroy_id(cb-child_cm_id);
 free_cb(cb);
 return NULL;
@@ -890,6 +888,7 @@ static int rping_run_server(struct rping_cb *cb)

 rping_test_server(cb);
 rdma_disconnect(cb-child_cm_id);
+   pthread_join(cb-cqthread, NULL);
 rdma_destroy_id(cb-child_cm_id);
  err2:
 rping_free_buffers(cb);
@@ -1057,6 +1056,7 @@ static int rping_run_client(struct rping_cb *cb)

 rping_test_client(cb);
 rdma_disconnect(cb-cm_id);
+   pthread_join(cb-cqthread, NULL);
  err2:
 rping_free_buffers(cb);
  err1:


___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] rping/cxgb3 regression

2011-02-15 Thread Hefty, Sean
I placed a 1.0.14.1 package on the ofa server in the downloads/rdmacm section.  
Can you verify that it works?  If so, I'll ask to pull it into 1.5.3

 -Original Message-
 From: Steve Wise [mailto:sw...@opengridcomputing.com]
 Sent: Tuesday, February 15, 2011 10:37 AM
 To: Hefty, Sean
 Cc: OpenFabrics EWG; Tziporet Koren
 Subject: Re: rping/cxgb3 regression
 
 
 On 02/15/2011 12:18 PM, Hefty, Sean wrote:
  I'm wondering if pulling the rping changes for ofed-1.5.3 would be ok?
 I
  guess to do this you would have to push a
  1-off librdmacm without those changes?  Or maybe back up what is in
 OFED-
  1.5.3 to the previous release without this
  rping change?
 
  Thoughts?
  Is the commit (93635fa33b41d356fa096242fec4ce788194b42f) below the issue?
 (Btw, the author listed in my git tree is wrong.)
 
 
 Yes.
 
  I don't think I want to drop back to 1.0.13 for 1.5.3, so maybe reverting
 this change and pushing out 1.0.14.1 would work.  There's just one other
 change after 1.0.14 at the moment, and it's to the build, so I'd skip a
 full release for now.
 
  Let me know if you think this would work.
 
 
 I just tested that removing this from 1.0.14 will resolve the issue for
 1.5.3.
 
 
  - Sean
 
  ---
 
   librdmacm/rping: Make sure CQ event thread exits before destroying
 the CQ
 
   It is possible for the CQ event thread to poll the CQ after it has
 been
   destroyed which can result in a seg fault on T3 interfaces.  This
 patch
   waits for the thread to exit before destroying the CQ.
 
   Signed-off-by: Steve Wisesw...@opengridcomputing.com
   Signed-off-by: Sean Heftysean.he...@intel.com
 
  diff --git a/examples/rping.c b/examples/rping.c
  index 2d4c2de..ee292ec 100644
  --- a/examples/rping.c
  +++ b/examples/rping.c
  @@ -280,12 +280,11 @@ static int rping_cq_event_handler(struct rping_cb
 *cb)
   ret = 0;
 
   if (wc.status) {
  -   if (wc.status != IBV_WC_WR_FLUSH_ERR) {
  +   if (wc.status != IBV_WC_WR_FLUSH_ERR)
   fprintf(stderr,
   cq completion failed status
 %d\n,
   wc.status);
  -   ret = -1;
  -   }
  +   ret = -1;
   goto error;
   }
 
  @@ -802,10 +801,9 @@ static void *rping_persistent_server_thread(void
 *arg)
 
   rping_test_server(cb);
   rdma_disconnect(cb-child_cm_id);
  +   pthread_join(cb-cqthread, NULL);
   rping_free_buffers(cb);
   rping_free_qp(cb);
  -   pthread_cancel(cb-cqthread);
  -   pthread_join(cb-cqthread, NULL);
   rdma_destroy_id(cb-child_cm_id);
   free_cb(cb);
   return NULL;
  @@ -890,6 +888,7 @@ static int rping_run_server(struct rping_cb *cb)
 
   rping_test_server(cb);
   rdma_disconnect(cb-child_cm_id);
  +   pthread_join(cb-cqthread, NULL);
   rdma_destroy_id(cb-child_cm_id);
err2:
   rping_free_buffers(cb);
  @@ -1057,6 +1056,7 @@ static int rping_run_client(struct rping_cb *cb)
 
   rping_test_client(cb);
   rdma_disconnect(cb-cm_id);
  +   pthread_join(cb-cqthread, NULL);
err2:
   rping_free_buffers(cb);
err1:

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] rping/cxgb3 regression

2011-02-15 Thread Hefty, Sean
Not a big deal.

Vlad, can you pull librdmacm 1.0.14.1 into the next OFED 1.5.3 RC?  The only 
change versus 1.0.14 is reverting a patch to the rping sample.

Thanks,
Sean


 -Original Message-
 From: Steve Wise [mailto:sw...@opengridcomputing.com]
 Sent: Tuesday, February 15, 2011 5:57 PM
 To: Hefty, Sean
 Cc: OpenFabrics EWG; Tziporet Koren
 Subject: Re: rping/cxgb3 regression
 
 I pulled it down, built/installed it on 2 nodes, then ran a bunch of
 rpings.  No hangs.  Looks good!
 
 Thanks Sean.  Sorry about this.
 
 Steve.
 
 On 2/15/2011 7:46 PM, Hefty, Sean wrote:
  I placed a 1.0.14.1 package on the ofa server in the downloads/rdmacm
 section.  Can you verify that it works?  If so, I'll ask to pull it into
 1.5.3
 
  -Original Message-
  From: Steve Wise [mailto:sw...@opengridcomputing.com]
  Sent: Tuesday, February 15, 2011 10:37 AM
  To: Hefty, Sean
  Cc: OpenFabrics EWG; Tziporet Koren
  Subject: Re: rping/cxgb3 regression
 
 
  On 02/15/2011 12:18 PM, Hefty, Sean wrote:
  I'm wondering if pulling the rping changes for ofed-1.5.3 would be ok?
  I
  guess to do this you would have to push a
  1-off librdmacm without those changes?  Or maybe back up what is in
  OFED-
  1.5.3 to the previous release without this
  rping change?
 
  Thoughts?
  Is the commit (93635fa33b41d356fa096242fec4ce788194b42f) below the
 issue?
  (Btw, the author listed in my git tree is wrong.)
  Yes.
 
  I don't think I want to drop back to 1.0.13 for 1.5.3, so maybe
 reverting
  this change and pushing out 1.0.14.1 would work.  There's just one other
  change after 1.0.14 at the moment, and it's to the build, so I'd skip a
  full release for now.
  Let me know if you think this would work.
 
  I just tested that removing this from 1.0.14 will resolve the issue for
  1.5.3.
 
 
  - Sean
 
  ---
 
librdmacm/rping: Make sure CQ event thread exits before
 destroying
  the CQ
It is possible for the CQ event thread to poll the CQ after it
 has
  been
destroyed which can result in a seg fault on T3 interfaces.  This
  patch
waits for the thread to exit before destroying the CQ.
 
Signed-off-by: Steve Wisesw...@opengridcomputing.com
Signed-off-by: Sean Heftysean.he...@intel.com
 
  diff --git a/examples/rping.c b/examples/rping.c
  index 2d4c2de..ee292ec 100644
  --- a/examples/rping.c
  +++ b/examples/rping.c
  @@ -280,12 +280,11 @@ static int rping_cq_event_handler(struct rping_cb
  *cb)
ret = 0;
 
if (wc.status) {
  -   if (wc.status != IBV_WC_WR_FLUSH_ERR) {
  +   if (wc.status != IBV_WC_WR_FLUSH_ERR)
fprintf(stderr,
cq completion failed status
  %d\n,
wc.status);
  -   ret = -1;
  -   }
  +   ret = -1;
goto error;
}
 
  @@ -802,10 +801,9 @@ static void *rping_persistent_server_thread(void
  *arg)
rping_test_server(cb);
rdma_disconnect(cb-child_cm_id);
  +   pthread_join(cb-cqthread, NULL);
rping_free_buffers(cb);
rping_free_qp(cb);
  -   pthread_cancel(cb-cqthread);
  -   pthread_join(cb-cqthread, NULL);
rdma_destroy_id(cb-child_cm_id);
free_cb(cb);
return NULL;
  @@ -890,6 +888,7 @@ static int rping_run_server(struct rping_cb *cb)
 
rping_test_server(cb);
rdma_disconnect(cb-child_cm_id);
  +   pthread_join(cb-cqthread, NULL);
rdma_destroy_id(cb-child_cm_id);
 err2:
rping_free_buffers(cb);
  @@ -1057,6 +1056,7 @@ static int rping_run_client(struct rping_cb *cb)
 
rping_test_client(cb);
rdma_disconnect(cb-cm_id);
  +   pthread_join(cb-cqthread, NULL);
 err2:
rping_free_buffers(cb);
 err1:

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg