Dang - I just finished running it on odin without a problem. Are you seeing this with a debug or optimized build?
On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote: > Yes, it fails on the current trunk (r29112). That is what started me on the > journey to figure out when things went wrong. It was working up until r29058. > > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Tuesday, September 03, 2013 2:49 PM > To: Open MPI Developers > Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes > > Are you all the way up to the current trunk? There have been a few typo fixes > since the original commit. > > I'm not familiar with the OOB connect code in openib. The OOB itself isn't > using free list, so I suspect it is something up in the OOB connect code > itself. I'll take a look and see if something leaps out at me - it seems to > be working fine on IU's odin cluster, which is the only IB-based system I can > access > > > On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote: > > > As mentioned in the weekly conference call, I am seeing some strange errors > when using the openib BTL. I have narrowed down the changeset that broke > things to the ORTE async code. > > https://svn.open-mpi.org/trac/ompi/changeset/29058 (and > https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix > compile errors) > > Changeset 29057 does not have these issues. I do not have a very good > characterization of the failures. The failures are not consistent. > Sometimes they can pass. Sometimes the stack trace can be different. They > seem to happen more with larger np, like np=4 and more. > > The first failure mode is a segmentation violation and it always seems to be > that we are trying to pop something of a free list. But the upper parts of > the stack trace can vary. This is with the trunk version 29061. > Ralph, any thoughts on where we go from here? > > [rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 > MPI_Irecv_comm_c > MPITEST info (0): Starting: MPI_Irecv_comm: > [compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] > Signal: Segmentation fault (11) [compute-0-4:04752] Signal code: Address not > mapped (1) [compute-0-4:04752] Failing at address: 0x28 > -------------------------------------------------------------------------- > mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on > signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > [rolf@Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora > (6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > Core was generated by `MPI_Irecv_comm_c'. > Program terminated with signal 11, Segmentation fault. > [New process 4753] > [New process 4756] > [New process 4755] > [New process 4754] > [New process 4752] > #0 0x00002aaaad6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at > ../../../../../opal/class/opal_atomic_lifo.h:111 > 111 lifo->opal_lifo_head = > (opal_list_item_t*)item->opal_list_next; > (gdb) where > #0 0x00002aaaad6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at > ../../../../../opal/class/opal_atomic_lifo.h:111 > #1 0x00002aaaad6ec5b4 in __ompi_free_list_wait_mt (fl=0x5996940, > item=0x40ea8d50) at ../../../../../ompi/class/ompi_free_list.h:228 > #2 0x00002aaaad6ec3f8 in post_recvs (ep=0x59f3120, qp=0, num_post=256) at > ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:361 > #3 0x00002aaaad6ec1ae in mca_btl_openib_endpoint_post_rr_nolock > (ep=0x59f3120, qp=0) > at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:405 > #4 0x00002aaaad6ebfad in mca_btl_openib_endpoint_post_recvs > (endpoint=0x59f3120) > at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.c:494 > #5 0x00002aaaad6fe71c in qp_create_all (endpoint=0x59f3120) at > ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:432 > #6 0x00002aaaad6fde2b in reply_start_connect (endpoint=0x59f3120, > rem_info=0x40ea8ed0) > at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:245 > #7 0x00002aaaad7006ae in rml_recv_cb (status=0, process_name=0x5b0bb90, > buffer=0x40ea8f80, tag=102, cbdata=0x0) > at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:858 > #8 0x00002ae802454601 in orte_rml_base_process_msg (fd=-1, flags=4, > cbdata=0x5b0bac0) > at ../../../../orte/mca/rml/base/rml_base_msg_handlers.c:172 > #9 0x00002ae8027164a1 in event_process_active_single_queue (base=0x58ac620, > activeq=0x58aa5b0) > at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367 > #10 0x00002ae802716b24 in event_process_active (base=0x58ac620) at > ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437 > #11 0x00002ae80271715c in opal_libevent2021_event_base_loop (base=0x58ac620, > flags=1) > at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645 > #12 0x00002ae8023e7465 in orte_progress_thread_engine (obj=0x2ae8026902c0) at > ../../orte/runtime/orte_init.c:180 > #13 0x0000003ab1e06367 in start_thread () from /lib64/libpthread.so.0 > #14 0x0000003ab16d2f7d in clone () from /lib64/libc.so.6 > (gdb) > > This email message is for the sole use of the intended recipient(s) and may > contain confidential information. Any unauthorized review, use, disclosure > or distribution is prohibited. If you are not the intended recipient, please > contact the sender by reply email and destroy all copies of the original > message. > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel