Yes, it fails on the current trunk (r29112).  That is what started me on the 
journey to figure out when things went wrong.  It was working up until r29058.

From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Tuesday, September 03, 2013 2:49 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes

Are you all the way up to the current trunk? There have been a few typo fixes 
since the original commit.

I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
using free list, so I suspect it is something up in the OOB connect code 
itself. I'll take a look and see if something leaps out at me - it seems to be 
working fine on IU's odin cluster, which is the only IB-based system I can 
access


On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart 
<rvandeva...@nvidia.com<mailto:rvandeva...@nvidia.com>> wrote:


As mentioned in the weekly conference call, I am seeing some strange errors 
when using the openib BTL.  I have narrowed down the changeset that broke 
things to the ORTE async code.

https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
compile errors)

Changeset 29057 does not have these issues.  I do not have a very good 
characterization of the failures.  The failures are not consistent.  Sometimes 
they can pass.  Sometimes the stack trace can be different.  They seem to 
happen more with larger np, like np=4 and more.

The first failure mode is a segmentation violation and it always seems to be 
that we are trying to pop something of a free list.  But the upper parts of the 
stack trace can vary.  This is with the trunk version 29061.
Ralph, any thoughts on where we go from here?

[rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 
MPI_Irecv_comm_c
MPITEST info  (0): Starting:  MPI_Irecv_comm:
[compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] Signal: 
Segmentation fault (11) [compute-0-4:04752] Signal code: Address not mapped (1) 
[compute-0-4:04752] Failing at address: 0x28
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on signal 
11 (Segmentation fault).
--------------------------------------------------------------------------
[rolf@Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora 
(6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
Core was generated by `MPI_Irecv_comm_c'.
Program terminated with signal 11, Segmentation fault.
[New process 4753]
[New process 4756]
[New process 4755]
[New process 4754]
[New process 4752]
#0  0x00002aaaad6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
../../../../../opal/class/opal_atomic_lifo.h:111
111             lifo->opal_lifo_head = (opal_list_item_t*)item->opal_list_next;
(gdb) where
#0  0x00002aaaad6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
../../../../../opal/class/opal_atomic_lifo.h:111
#1  0x00002aaaad6ec5b4 in __ompi_free_list_wait_mt (fl=0x5996940, 
item=0x40ea8d50) at ../../../../../ompi/class/ompi_free_list.h:228
#2  0x00002aaaad6ec3f8 in post_recvs (ep=0x59f3120, qp=0, num_post=256) at 
../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:361
#3  0x00002aaaad6ec1ae in mca_btl_openib_endpoint_post_rr_nolock (ep=0x59f3120, 
qp=0)
    at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:405
#4  0x00002aaaad6ebfad in mca_btl_openib_endpoint_post_recvs 
(endpoint=0x59f3120)
    at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.c:494
#5  0x00002aaaad6fe71c in qp_create_all (endpoint=0x59f3120) at 
../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:432
#6  0x00002aaaad6fde2b in reply_start_connect (endpoint=0x59f3120, 
rem_info=0x40ea8ed0)
    at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:245
#7  0x00002aaaad7006ae in rml_recv_cb (status=0, process_name=0x5b0bb90, 
buffer=0x40ea8f80, tag=102, cbdata=0x0)
    at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:858
#8  0x00002ae802454601 in orte_rml_base_process_msg (fd=-1, flags=4, 
cbdata=0x5b0bac0)
    at ../../../../orte/mca/rml/base/rml_base_msg_handlers.c:172
#9  0x00002ae8027164a1 in event_process_active_single_queue (base=0x58ac620, 
activeq=0x58aa5b0)
    at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367
#10 0x00002ae802716b24 in event_process_active (base=0x58ac620) at 
../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437
#11 0x00002ae80271715c in opal_libevent2021_event_base_loop (base=0x58ac620, 
flags=1)
    at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645
#12 0x00002ae8023e7465 in orte_progress_thread_engine (obj=0x2ae8026902c0) at 
../../orte/runtime/orte_init.c:180
#13 0x0000003ab1e06367 in start_thread () from /lib64/libpthread.so.0
#14 0x0000003ab16d2f7d in clone () from /lib64/libc.so.6
(gdb)

________________________________
This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.
________________________________
_______________________________________________
devel mailing list
de...@open-mpi.org<mailto:de...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to