Rolf,
the assert fails because the endpoint reference count is greater than one.
the root cause is the endpoint has been added to the list of
eager_rdma_buffers of the openib btl device (and hence OBJ_RETAIN'ed at
ompi/mca/btl/openib/btl_openib_endpoint.c:1009)
a simple workaround is not to use eager rdma with the openib btl
(e.g. export OMPI_MCA_btl_openib_use_eager_rdma=0)
here is attached a patch that solves the issue.
because of my poor understanding of the openib btl, i did not commit it.
i am wondering wether it is safe to simply OBJ_RELEASE the endpoint
(e.g. what happens if there are inflight messages ?)
i also added some comments that indicates the patch might be suboptimal.
Nathan, could you please review the attached patch ?
please note that if the faulty assertion is removed, the endpoint will be
OBJ_RELEASE'd but only in the btl finalize.
Gilles
On Sat, May 24, 2014 at 12:31 AM, Rolf vandeVaart wrote:
> I am still seeing problems with del_procs with openib. Do we believe
> everything should be working? This is with the latest trunk (updated 1
> hour ago).
>
> [rvandevaart@drossetti-ivy0 examples]$ mpirun --mca btl_openib_if_include
> mlx5_0:1 -np 2 -host drossetti-ivy0,drossetti-ivy1
> connectivity_cConnectivity test on 2 processes PASSED.
> connectivity_c: ../../../../../ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> connectivity_c: ../../../../../ompi/mca/btl/openib/btl_openib.c:1151:
> mca_btl_openib_del_procs: Assertion
> `((opal_object_t*)endpoint)->obj_reference_count == 1' failed.
> --
> mpirun noticed that process rank 1 with PID 28443 on node drossetti-ivy1
> exited on signal 11 (Segmentation fault).
> --
> [rvandevaart@drossetti-ivy0 examples]$
>
> ---
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information. Any unauthorized review, use, disclosure or
> distribution
> is prohibited. If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
>
> ---
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14836.php
>
Index: ompi/mca/btl/openib/btl_openib.c
===
--- ompi/mca/btl/openib/btl_openib.c(revision 31888)
+++ ompi/mca/btl/openib/btl_openib.c(working copy)
@@ -1128,7 +1128,7 @@
struct ompi_proc_t **procs,
struct mca_btl_base_endpoint_t ** peers)
{
-int i,ep_index;
+int i, ep_index;
mca_btl_openib_module_t* openib_btl = (mca_btl_openib_module_t*) btl;
mca_btl_openib_endpoint_t* endpoint;
@@ -1144,8 +1144,19 @@
continue;
}
if (endpoint == del_endpoint) {
+int j;
BTL_VERBOSE(("in del_procs %d, setting another endpoint to
null",
ep_index));
+/* remove the endpoint from eager_rdma_buffers */
+for (j=0; jdevice->eager_rdma_buffers_count; j++)
{
+if (openib_btl->device->eager_rdma_buffers[j] == endpoint)
{
+/* should it be obj_reference_count == 2 ? */
+assert(((opal_object_t*)endpoint)->obj_reference_count
> 1);
+OBJ_RELEASE(endpoint);
+openib_btl->device->eager_rdma_buffers[j] = NULL;
+/* can we simply break and leave the for loop ? */
+}
+}
opal_pointer_array_set_item(openib_btl->device->endpoints,
ep_index, NULL);
assert(((opal_object_t*)endpoint)->obj_reference_count == 1);