Sigh - I cannot get it to fail. I've tried up to np=16 without getting a single 
hiccup.

Try a fresh checkout - let's make sure you don't have some old cruft laying 
around.

On Sep 3, 2013, at 12:26 PM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote:

> I am running a debug build.  Here is my configure line:
>  
> ../configure --enable-debug --enable-shared --disable-static 
> --prefix=/home/rolf/ompi-trunk-29061/64 --with- 
> wrapper-ldflags='-Wl,-rpath,${prefix}/lib' --disable-vt 
> --enable-orterun-prefix-by-default -disable-io-romio  --enable-picky
>  
> The test program is from the intel test suite in our test suite.
> http://svn.open-mpi.org/svn/ompi-tests/trunk/intel_tests/src/MPI_Irecv_comm_c.c
>  
> Run with at least np=4.  The more np, the better.
>  
>  
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Tuesday, September 03, 2013 3:22 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes
>  
> Also, send me your test code - maybe that is required to trigger it
>  
> On Sep 3, 2013, at 12:19 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> 
> Dang - I just finished running it on odin without a problem. Are you seeing 
> this with a debug or optimized build?
>  
>  
> On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote:
> 
> 
> Yes, it fails on the current trunk (r29112).  That is what started me on the 
> journey to figure out when things went wrong.  It was working up until r29058.
>  
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Tuesday, September 03, 2013 2:49 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes
>  
> Are you all the way up to the current trunk? There have been a few typo fixes 
> since the original commit.
>  
> I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
> using free list, so I suspect it is something up in the OOB connect code 
> itself. I'll take a look and see if something leaps out at me - it seems to 
> be working fine on IU's odin cluster, which is the only IB-based system I can 
> access
>  
>  
> On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote:
> 
> 
> 
> As mentioned in the weekly conference call, I am seeing some strange errors 
> when using the openib BTL.  I have narrowed down the changeset that broke 
> things to the ORTE async code.
>  
> https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
> https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
> compile errors)
>  
> Changeset 29057 does not have these issues.  I do not have a very good 
> characterization of the failures.  The failures are not consistent.  
> Sometimes they can pass.  Sometimes the stack trace can be different.  They 
> seem to happen more with larger np, like np=4 and more.   
>  
> The first failure mode is a segmentation violation and it always seems to be 
> that we are trying to pop something of a free list.  But the upper parts of 
> the stack trace can vary.  This is with the trunk version 29061.
> Ralph, any thoughts on where we go from here?
>  
> [rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 
> MPI_Irecv_comm_c
> MPITEST info  (0): Starting:  MPI_Irecv_comm:   
> [compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] 
> Signal: Segmentation fault (11) [compute-0-4:04752] Signal code: Address not 
> mapped (1) [compute-0-4:04752] Failing at address: 0x28
> --------------------------------------------------------------------------
> mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on 
> signal 11 (Segmentation fault).
> --------------------------------------------------------------------------
> [rolf@Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora 
> (6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> Core was generated by `MPI_Irecv_comm_c'.
> Program terminated with signal 11, Segmentation fault.
> [New process 4753]
> [New process 4756]
> [New process 4755]
> [New process 4754]
> [New process 4752]
> #0  0x00002aaaad6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
> ../../../../../opal/class/opal_atomic_lifo.h:111
> 111             lifo->opal_lifo_head = 
> (opal_list_item_t*)item->opal_list_next;
> (gdb) where
> #0  0x00002aaaad6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
> ../../../../../opal/class/opal_atomic_lifo.h:111
> #1  0x00002aaaad6ec5b4 in __ompi_free_list_wait_mt (fl=0x5996940, 
> item=0x40ea8d50) at ../../../../../ompi/class/ompi_free_list.h:228
> #2  0x00002aaaad6ec3f8 in post_recvs (ep=0x59f3120, qp=0, num_post=256) at 
> ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:361
> #3  0x00002aaaad6ec1ae in mca_btl_openib_endpoint_post_rr_nolock 
> (ep=0x59f3120, qp=0)
>     at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:405
> #4  0x00002aaaad6ebfad in mca_btl_openib_endpoint_post_recvs 
> (endpoint=0x59f3120)
>     at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.c:494
> #5  0x00002aaaad6fe71c in qp_create_all (endpoint=0x59f3120) at 
> ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:432
> #6  0x00002aaaad6fde2b in reply_start_connect (endpoint=0x59f3120, 
> rem_info=0x40ea8ed0)
>     at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:245
> #7  0x00002aaaad7006ae in rml_recv_cb (status=0, process_name=0x5b0bb90, 
> buffer=0x40ea8f80, tag=102, cbdata=0x0)
>     at ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:858
> #8  0x00002ae802454601 in orte_rml_base_process_msg (fd=-1, flags=4, 
> cbdata=0x5b0bac0)
>     at ../../../../orte/mca/rml/base/rml_base_msg_handlers.c:172
> #9  0x00002ae8027164a1 in event_process_active_single_queue (base=0x58ac620, 
> activeq=0x58aa5b0)
>     at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367
> #10 0x00002ae802716b24 in event_process_active (base=0x58ac620) at 
> ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437
> #11 0x00002ae80271715c in opal_libevent2021_event_base_loop (base=0x58ac620, 
> flags=1)
>     at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645
> #12 0x00002ae8023e7465 in orte_progress_thread_engine (obj=0x2ae8026902c0) at 
> ../../orte/runtime/orte_init.c:180
> #13 0x0000003ab1e06367 in start_thread () from /lib64/libpthread.so.0
> #14 0x0000003ab16d2f7d in clone () from /lib64/libc.so.6
> (gdb)
>  
> This email message is for the sole use of the intended recipient(s) and may 
> contain confidential information.  Any unauthorized review, use, disclosure 
> or distribution is prohibited.  If you are not the intended recipient, please 
> contact the sender by reply email and destroy all copies of the original 
> message.
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>  
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>  
>  
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to