Re: [OMPI devel] Still problems with del_procs in trunkj

2014-05-27 Thread Nathan Hjelm
On Mon, May 26, 2014 at 12:09:38PM +0900, Gilles Gouaillardet wrote:
>Rolf,
> 
>the assert fails because the endpoint reference count is greater than one.
>the root cause is the endpoint has been added to the list of
>eager_rdma_buffers of the openib btl device (and hence OBJ_RETAIN'ed at
>ompi/mca/btl/openib/btl_openib_endpoint.c:1009)
> 
>a simple workaround is not to use eager rdma with the openib btl
>(e.g. export OMPI_MCA_btl_openib_use_eager_rdma=0)
> 
>here is attached a patch that solves the issue.
> 
>because of my poor understanding of the openib btl, i did not commit it.
>i am wondering wether it is safe to simply OBJ_RELEASE the endpoint
>(e.g. what happens if there are inflight messages ?)
>i also added some comments that indicates the patch might be suboptimal.

It should be safe as there should be no flying messages at del_procs. If
there are an error would probably be raised on the sending process.

>Nathan, could you please review the attached patch ?

Sure. I will take a look. It doesn't surprise me there are these sorts
of issues in del_procs. The functionality has been broken for some time.

-Nathan


pgp6CEyEnPudm.pgp
Description: PGP signature


Re: [OMPI devel] Still problems with del_procs in trunkj

2014-05-25 Thread Gilles Gouaillardet
Rolf,

the assert fails because the endpoint reference count is greater than one.
the root cause is the endpoint has been added to the list of
eager_rdma_buffers of the openib btl device (and hence OBJ_RETAIN'ed at
ompi/mca/btl/openib/btl_openib_endpoint.c:1009)

a simple workaround is not to use eager rdma with the openib btl
(e.g. export OMPI_MCA_btl_openib_use_eager_rdma=0)

here is attached a patch that solves the issue.

because of my poor understanding of the openib btl, i did not commit it.
i am wondering wether it is safe to simply OBJ_RELEASE the endpoint
(e.g. what happens if there are inflight messages ?)
i also added some comments that indicates the patch might be suboptimal.

Nathan, could you please review the attached patch ?

please note that if the faulty assertion is removed, the endpoint will be
OBJ_RELEASE'd  but only in the btl finalize.

Gilles



On Sat, May 24, 2014 at 12:31 AM, Rolf vandeVaart wrote:

> I am still seeing problems with del_procs with openib.  Do we believe
> everything should be working?  This is with the latest trunk (updated 1
> hour ago).
>
> [rvandevaart@drossetti-ivy0 examples]$ mpirun --mca btl_openib_if_include
> mlx5_0:1 -np 2 -host drossetti-ivy0,drossetti-ivy1
> connectivity_cConnectivity test on 2 processes PASSED.
> connectivity_c: ../../../../../ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> connectivity_c: ../../../../../ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> --
> mpirun noticed that process rank 1 with PID 28443 on node drossetti-ivy1
> exited on signal 11 (Segmentation fault).
> --
> [rvandevaart@drossetti-ivy0 examples]$
>
> ---
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information.  Any unauthorized review, use, disclosure or
> distribution
> is prohibited.  If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
>
> ---
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14836.php
>
Index: ompi/mca/btl/openib/btl_openib.c
===
--- ompi/mca/btl/openib/btl_openib.c(revision 31888)
+++ ompi/mca/btl/openib/btl_openib.c(working copy)
@@ -1128,7 +1128,7 @@
 struct ompi_proc_t **procs,
 struct mca_btl_base_endpoint_t ** peers)
 {
-int i,ep_index;
+int i, ep_index;
 mca_btl_openib_module_t* openib_btl = (mca_btl_openib_module_t*) btl;
 mca_btl_openib_endpoint_t* endpoint;
 
@@ -1144,8 +1144,19 @@
 continue;
 }
 if (endpoint == del_endpoint) {
+int j;
 BTL_VERBOSE(("in del_procs %d, setting another endpoint to 
null",
  ep_index));
+/* remove the endpoint from eager_rdma_buffers */
+for (j=0; jdevice->eager_rdma_buffers_count; j++) 
{
+if (openib_btl->device->eager_rdma_buffers[j] == endpoint) 
{
+/* should it be obj_reference_count == 2 ? */
+assert(((opal_object_t*)endpoint)->obj_reference_count 
> 1);
+OBJ_RELEASE(endpoint);
+openib_btl->device->eager_rdma_buffers[j] = NULL;
+/* can we simply break and leave the for loop ? */
+}
+}
 opal_pointer_array_set_item(openib_btl->device->endpoints,
 ep_index, NULL);
 assert(((opal_object_t*)endpoint)->obj_reference_count == 1);


[OMPI devel] Still problems with del_procs in trunkj

2014-05-23 Thread Rolf vandeVaart
I am still seeing problems with del_procs with openib.  Do we believe 
everything should be working?  This is with the latest trunk (updated 1 hour 
ago).

[rvandevaart@drossetti-ivy0 examples]$ mpirun --mca btl_openib_if_include 
mlx5_0:1 -np 2 -host drossetti-ivy0,drossetti-ivy1 connectivity_cConnectivity 
test on 2 processes PASSED.
connectivity_c: ../../../../../ompi/mca/btl/openib/btl_openib.c:1151: 
mca_btl_openib_del_procs: Assertion 
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
connectivity_c: ../../../../../ompi/mca/btl/openib/btl_openib.c:1151: 
mca_btl_openib_del_procs: Assertion 
`((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
--
mpirun noticed that process rank 1 with PID 28443 on node drossetti-ivy1 exited 
on signal 11 (Segmentation fault).
--
[rvandevaart@drossetti-ivy0 examples]$ 
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---