Re: [ewg] rping/cxgb3 regression
On 02/16/2011 04:00 AM, Hefty, Sean wrote: Not a big deal. Vlad, can you pull librdmacm 1.0.14.1 into the next OFED 1.5.3 RC? The only change versus 1.0.14 is reverting a patch to the rping sample. Thanks, Sean Done, Regards, Vladimir ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] rping/cxgb3 regression
Not a big deal. Vlad, can you pull librdmacm 1.0.14.1 into the next OFED 1.5.3 RC? The only change versus 1.0.14 is reverting a patch to the rping sample. Thanks, Sean > -Original Message- > From: Steve Wise [mailto:sw...@opengridcomputing.com] > Sent: Tuesday, February 15, 2011 5:57 PM > To: Hefty, Sean > Cc: OpenFabrics EWG; Tziporet Koren > Subject: Re: rping/cxgb3 regression > > I pulled it down, built/installed it on 2 nodes, then ran a bunch of > rpings. No hangs. Looks good! > > Thanks Sean. Sorry about this. > > Steve. > > On 2/15/2011 7:46 PM, Hefty, Sean wrote: > > I placed a 1.0.14.1 package on the ofa server in the downloads/rdmacm > section. Can you verify that it works? If so, I'll ask to pull it into > 1.5.3 > > > >> -Original Message- > >> From: Steve Wise [mailto:sw...@opengridcomputing.com] > >> Sent: Tuesday, February 15, 2011 10:37 AM > >> To: Hefty, Sean > >> Cc: OpenFabrics EWG; Tziporet Koren > >> Subject: Re: rping/cxgb3 regression > >> > >> > >> On 02/15/2011 12:18 PM, Hefty, Sean wrote: > I'm wondering if pulling the rping changes for ofed-1.5.3 would be ok? > >> I > guess to do this you would have to push a > 1-off librdmacm without those changes? Or maybe back up what is in > >> OFED- > 1.5.3 to the previous release without this > rping change? > > Thoughts? > >>> Is the commit (93635fa33b41d356fa096242fec4ce788194b42f) below the > issue? > >> (Btw, the author listed in my git tree is wrong.) > >> Yes. > >> > >>> I don't think I want to drop back to 1.0.13 for 1.5.3, so maybe > reverting > >> this change and pushing out 1.0.14.1 would work. There's just one other > >> change after 1.0.14 at the moment, and it's to the build, so I'd skip a > >> full release for now. > >>> Let me know if you think this would work. > >>> > >> I just tested that removing this from 1.0.14 will resolve the issue for > >> 1.5.3. > >> > >> > >>> - Sean > >>> > >>> --- > >>> > >>> librdmacm/rping: Make sure CQ event thread exits before > destroying > >> the CQ > >>> It is possible for the CQ event thread to poll the CQ after it > has > >> been > >>> destroyed which can result in a seg fault on T3 interfaces. This > >> patch > >>> waits for the thread to exit before destroying the CQ. > >>> > >>> Signed-off-by: Steve Wise > >>> Signed-off-by: Sean Hefty > >>> > >>> diff --git a/examples/rping.c b/examples/rping.c > >>> index 2d4c2de..ee292ec 100644 > >>> --- a/examples/rping.c > >>> +++ b/examples/rping.c > >>> @@ -280,12 +280,11 @@ static int rping_cq_event_handler(struct rping_cb > >> *cb) > >>> ret = 0; > >>> > >>> if (wc.status) { > >>> - if (wc.status != IBV_WC_WR_FLUSH_ERR) { > >>> + if (wc.status != IBV_WC_WR_FLUSH_ERR) > >>> fprintf(stderr, > >>> "cq completion failed status > >> %d\n", > >>> wc.status); > >>> - ret = -1; > >>> - } > >>> + ret = -1; > >>> goto error; > >>> } > >>> > >>> @@ -802,10 +801,9 @@ static void *rping_persistent_server_thread(void > >> *arg) > >>> rping_test_server(cb); > >>> rdma_disconnect(cb->child_cm_id); > >>> + pthread_join(cb->cqthread, NULL); > >>> rping_free_buffers(cb); > >>> rping_free_qp(cb); > >>> - pthread_cancel(cb->cqthread); > >>> - pthread_join(cb->cqthread, NULL); > >>> rdma_destroy_id(cb->child_cm_id); > >>> free_cb(cb); > >>> return NULL; > >>> @@ -890,6 +888,7 @@ static int rping_run_server(struct rping_cb *cb) > >>> > >>> rping_test_server(cb); > >>> rdma_disconnect(cb->child_cm_id); > >>> + pthread_join(cb->cqthread, NULL); > >>> rdma_destroy_id(cb->child_cm_id); > >>>err2: > >>> rping_free_buffers(cb); > >>> @@ -1057,6 +1056,7 @@ static int rping_run_client(struct rping_cb *cb) > >>> > >>> rping_test_client(cb); > >>> rdma_disconnect(cb->cm_id); > >>> + pthread_join(cb->cqthread, NULL); > >>>err2: > >>> rping_free_buffers(cb); > >>>err1: ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] rping/cxgb3 regression
I pulled it down, built/installed it on 2 nodes, then ran a bunch of rpings. No hangs. Looks good! Thanks Sean. Sorry about this. Steve. On 2/15/2011 7:46 PM, Hefty, Sean wrote: I placed a 1.0.14.1 package on the ofa server in the downloads/rdmacm section. Can you verify that it works? If so, I'll ask to pull it into 1.5.3 -Original Message- From: Steve Wise [mailto:sw...@opengridcomputing.com] Sent: Tuesday, February 15, 2011 10:37 AM To: Hefty, Sean Cc: OpenFabrics EWG; Tziporet Koren Subject: Re: rping/cxgb3 regression On 02/15/2011 12:18 PM, Hefty, Sean wrote: I'm wondering if pulling the rping changes for ofed-1.5.3 would be ok? I guess to do this you would have to push a 1-off librdmacm without those changes? Or maybe back up what is in OFED- 1.5.3 to the previous release without this rping change? Thoughts? Is the commit (93635fa33b41d356fa096242fec4ce788194b42f) below the issue? (Btw, the author listed in my git tree is wrong.) Yes. I don't think I want to drop back to 1.0.13 for 1.5.3, so maybe reverting this change and pushing out 1.0.14.1 would work. There's just one other change after 1.0.14 at the moment, and it's to the build, so I'd skip a full release for now. Let me know if you think this would work. I just tested that removing this from 1.0.14 will resolve the issue for 1.5.3. - Sean --- librdmacm/rping: Make sure CQ event thread exits before destroying the CQ It is possible for the CQ event thread to poll the CQ after it has been destroyed which can result in a seg fault on T3 interfaces. This patch waits for the thread to exit before destroying the CQ. Signed-off-by: Steve Wise Signed-off-by: Sean Hefty diff --git a/examples/rping.c b/examples/rping.c index 2d4c2de..ee292ec 100644 --- a/examples/rping.c +++ b/examples/rping.c @@ -280,12 +280,11 @@ static int rping_cq_event_handler(struct rping_cb *cb) ret = 0; if (wc.status) { - if (wc.status != IBV_WC_WR_FLUSH_ERR) { + if (wc.status != IBV_WC_WR_FLUSH_ERR) fprintf(stderr, "cq completion failed status %d\n", wc.status); - ret = -1; - } + ret = -1; goto error; } @@ -802,10 +801,9 @@ static void *rping_persistent_server_thread(void *arg) rping_test_server(cb); rdma_disconnect(cb->child_cm_id); + pthread_join(cb->cqthread, NULL); rping_free_buffers(cb); rping_free_qp(cb); - pthread_cancel(cb->cqthread); - pthread_join(cb->cqthread, NULL); rdma_destroy_id(cb->child_cm_id); free_cb(cb); return NULL; @@ -890,6 +888,7 @@ static int rping_run_server(struct rping_cb *cb) rping_test_server(cb); rdma_disconnect(cb->child_cm_id); + pthread_join(cb->cqthread, NULL); rdma_destroy_id(cb->child_cm_id); err2: rping_free_buffers(cb); @@ -1057,6 +1056,7 @@ static int rping_run_client(struct rping_cb *cb) rping_test_client(cb); rdma_disconnect(cb->cm_id); + pthread_join(cb->cqthread, NULL); err2: rping_free_buffers(cb); err1: ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] rping/cxgb3 regression
I placed a 1.0.14.1 package on the ofa server in the downloads/rdmacm section. Can you verify that it works? If so, I'll ask to pull it into 1.5.3 > -Original Message- > From: Steve Wise [mailto:sw...@opengridcomputing.com] > Sent: Tuesday, February 15, 2011 10:37 AM > To: Hefty, Sean > Cc: OpenFabrics EWG; Tziporet Koren > Subject: Re: rping/cxgb3 regression > > > On 02/15/2011 12:18 PM, Hefty, Sean wrote: > >> I'm wondering if pulling the rping changes for ofed-1.5.3 would be ok? > I > >> guess to do this you would have to push a > >> 1-off librdmacm without those changes? Or maybe back up what is in > OFED- > >> 1.5.3 to the previous release without this > >> rping change? > >> > >> Thoughts? > > Is the commit (93635fa33b41d356fa096242fec4ce788194b42f) below the issue? > (Btw, the author listed in my git tree is wrong.) > > > > Yes. > > > I don't think I want to drop back to 1.0.13 for 1.5.3, so maybe reverting > this change and pushing out 1.0.14.1 would work. There's just one other > change after 1.0.14 at the moment, and it's to the build, so I'd skip a > full release for now. > > > > Let me know if you think this would work. > > > > I just tested that removing this from 1.0.14 will resolve the issue for > 1.5.3. > > > > - Sean > > > > --- > > > > librdmacm/rping: Make sure CQ event thread exits before destroying > the CQ > > > > It is possible for the CQ event thread to poll the CQ after it has > been > > destroyed which can result in a seg fault on T3 interfaces. This > patch > > waits for the thread to exit before destroying the CQ. > > > > Signed-off-by: Steve Wise > > Signed-off-by: Sean Hefty > > > > diff --git a/examples/rping.c b/examples/rping.c > > index 2d4c2de..ee292ec 100644 > > --- a/examples/rping.c > > +++ b/examples/rping.c > > @@ -280,12 +280,11 @@ static int rping_cq_event_handler(struct rping_cb > *cb) > > ret = 0; > > > > if (wc.status) { > > - if (wc.status != IBV_WC_WR_FLUSH_ERR) { > > + if (wc.status != IBV_WC_WR_FLUSH_ERR) > > fprintf(stderr, > > "cq completion failed status > %d\n", > > wc.status); > > - ret = -1; > > - } > > + ret = -1; > > goto error; > > } > > > > @@ -802,10 +801,9 @@ static void *rping_persistent_server_thread(void > *arg) > > > > rping_test_server(cb); > > rdma_disconnect(cb->child_cm_id); > > + pthread_join(cb->cqthread, NULL); > > rping_free_buffers(cb); > > rping_free_qp(cb); > > - pthread_cancel(cb->cqthread); > > - pthread_join(cb->cqthread, NULL); > > rdma_destroy_id(cb->child_cm_id); > > free_cb(cb); > > return NULL; > > @@ -890,6 +888,7 @@ static int rping_run_server(struct rping_cb *cb) > > > > rping_test_server(cb); > > rdma_disconnect(cb->child_cm_id); > > + pthread_join(cb->cqthread, NULL); > > rdma_destroy_id(cb->child_cm_id); > > err2: > > rping_free_buffers(cb); > > @@ -1057,6 +1056,7 @@ static int rping_run_client(struct rping_cb *cb) > > > > rping_test_client(cb); > > rdma_disconnect(cb->cm_id); > > + pthread_join(cb->cqthread, NULL); > > err2: > > rping_free_buffers(cb); > > err1: ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] rping/cxgb3 regression
On 02/15/2011 12:18 PM, Hefty, Sean wrote: I'm wondering if pulling the rping changes for ofed-1.5.3 would be ok? I guess to do this you would have to push a 1-off librdmacm without those changes? Or maybe back up what is in OFED- 1.5.3 to the previous release without this rping change? Thoughts? Is the commit (93635fa33b41d356fa096242fec4ce788194b42f) below the issue? (Btw, the author listed in my git tree is wrong.) Yes. I don't think I want to drop back to 1.0.13 for 1.5.3, so maybe reverting this change and pushing out 1.0.14.1 would work. There's just one other change after 1.0.14 at the moment, and it's to the build, so I'd skip a full release for now. Let me know if you think this would work. I just tested that removing this from 1.0.14 will resolve the issue for 1.5.3. - Sean --- librdmacm/rping: Make sure CQ event thread exits before destroying the CQ It is possible for the CQ event thread to poll the CQ after it has been destroyed which can result in a seg fault on T3 interfaces. This patch waits for the thread to exit before destroying the CQ. Signed-off-by: Steve Wise Signed-off-by: Sean Hefty diff --git a/examples/rping.c b/examples/rping.c index 2d4c2de..ee292ec 100644 --- a/examples/rping.c +++ b/examples/rping.c @@ -280,12 +280,11 @@ static int rping_cq_event_handler(struct rping_cb *cb) ret = 0; if (wc.status) { - if (wc.status != IBV_WC_WR_FLUSH_ERR) { + if (wc.status != IBV_WC_WR_FLUSH_ERR) fprintf(stderr, "cq completion failed status %d\n", wc.status); - ret = -1; - } + ret = -1; goto error; } @@ -802,10 +801,9 @@ static void *rping_persistent_server_thread(void *arg) rping_test_server(cb); rdma_disconnect(cb->child_cm_id); + pthread_join(cb->cqthread, NULL); rping_free_buffers(cb); rping_free_qp(cb); - pthread_cancel(cb->cqthread); - pthread_join(cb->cqthread, NULL); rdma_destroy_id(cb->child_cm_id); free_cb(cb); return NULL; @@ -890,6 +888,7 @@ static int rping_run_server(struct rping_cb *cb) rping_test_server(cb); rdma_disconnect(cb->child_cm_id); + pthread_join(cb->cqthread, NULL); rdma_destroy_id(cb->child_cm_id); err2: rping_free_buffers(cb); @@ -1057,6 +1056,7 @@ static int rping_run_client(struct rping_cb *cb) rping_test_client(cb); rdma_disconnect(cb->cm_id); + pthread_join(cb->cqthread, NULL); err2: rping_free_buffers(cb); err1: ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] rping/cxgb3 regression
> I'm wondering if pulling the rping changes for ofed-1.5.3 would be ok? I > guess to do this you would have to push a > 1-off librdmacm without those changes? Or maybe back up what is in OFED- > 1.5.3 to the previous release without this > rping change? > > Thoughts? Is the commit (93635fa33b41d356fa096242fec4ce788194b42f) below the issue? (Btw, the author listed in my git tree is wrong.) I don't think I want to drop back to 1.0.13 for 1.5.3, so maybe reverting this change and pushing out 1.0.14.1 would work. There's just one other change after 1.0.14 at the moment, and it's to the build, so I'd skip a full release for now. Let me know if you think this would work. - Sean --- librdmacm/rping: Make sure CQ event thread exits before destroying the CQ It is possible for the CQ event thread to poll the CQ after it has been destroyed which can result in a seg fault on T3 interfaces. This patch waits for the thread to exit before destroying the CQ. Signed-off-by: Steve Wise Signed-off-by: Sean Hefty diff --git a/examples/rping.c b/examples/rping.c index 2d4c2de..ee292ec 100644 --- a/examples/rping.c +++ b/examples/rping.c @@ -280,12 +280,11 @@ static int rping_cq_event_handler(struct rping_cb *cb) ret = 0; if (wc.status) { - if (wc.status != IBV_WC_WR_FLUSH_ERR) { + if (wc.status != IBV_WC_WR_FLUSH_ERR) fprintf(stderr, "cq completion failed status %d\n", wc.status); - ret = -1; - } + ret = -1; goto error; } @@ -802,10 +801,9 @@ static void *rping_persistent_server_thread(void *arg) rping_test_server(cb); rdma_disconnect(cb->child_cm_id); + pthread_join(cb->cqthread, NULL); rping_free_buffers(cb); rping_free_qp(cb); - pthread_cancel(cb->cqthread); - pthread_join(cb->cqthread, NULL); rdma_destroy_id(cb->child_cm_id); free_cb(cb); return NULL; @@ -890,6 +888,7 @@ static int rping_run_server(struct rping_cb *cb) rping_test_server(cb); rdma_disconnect(cb->child_cm_id); + pthread_join(cb->cqthread, NULL); rdma_destroy_id(cb->child_cm_id); err2: rping_free_buffers(cb); @@ -1057,6 +1056,7 @@ static int rping_run_client(struct rping_cb *cb) rping_test_client(cb); rdma_disconnect(cb->cm_id); + pthread_join(cb->cqthread, NULL); err2: rping_free_buffers(cb); err1: ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg