Also, send me your test code - maybe that is required to trigger it

On Sep 3, 2013, at 12:19 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Dang - I just finished running it on odin without a problem. Are you seeing 
> this with a debug or optimized build?
> 
> 
> On Sep 3, 2013, at 12:16 PM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote:
> 
>> Yes, it fails on the current trunk (r29112).  That is what started me on the 
>> journey to figure out when things went wrong.  It was working up until 
>> r29058.
>>  
>> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
>> Sent: Tuesday, September 03, 2013 2:49 PM
>> To: Open MPI Developers
>> Subject: Re: [OMPI devel] openib BTL problems with ORTE async changes
>>  
>> Are you all the way up to the current trunk? There have been a few typo 
>> fixes since the original commit.
>>  
>> I'm not familiar with the OOB connect code in openib. The OOB itself isn't 
>> using free list, so I suspect it is something up in the OOB connect code 
>> itself. I'll take a look and see if something leaps out at me - it seems to 
>> be working fine on IU's odin cluster, which is the only IB-based system I 
>> can access
>>  
>>  
>> On Sep 3, 2013, at 11:34 AM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote:
>> 
>> 
>> As mentioned in the weekly conference call, I am seeing some strange errors 
>> when using the openib BTL.  I have narrowed down the changeset that broke 
>> things to the ORTE async code.
>>  
>> https://svn.open-mpi.org/trac/ompi/changeset/29058  (and 
>> https://svn.open-mpi.org/trac/ompi/changeset/29061 which was needed to fix 
>> compile errors)
>>  
>> Changeset 29057 does not have these issues.  I do not have a very good 
>> characterization of the failures.  The failures are not consistent.  
>> Sometimes they can pass.  Sometimes the stack trace can be different.  They 
>> seem to happen more with larger np, like np=4 and more.   
>>  
>> The first failure mode is a segmentation violation and it always seems to be 
>> that we are trying to pop something of a free list.  But the upper parts of 
>> the stack trace can vary.  This is with the trunk version 29061.
>> Ralph, any thoughts on where we go from here?
>>  
>> [rolf@Fermi-Cluster src]$ mpirun -np 4 -host c0-0,c0-1,c0-3,c0-4 
>> MPI_Irecv_comm_c
>> MPITEST info  (0): Starting:  MPI_Irecv_comm:   
>> [compute-0-4:04752] *** Process received signal *** [compute-0-4:04752] 
>> Signal: Segmentation fault (11) [compute-0-4:04752] Signal code: Address not 
>> mapped (1) [compute-0-4:04752] Failing at address: 0x28
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 3 with PID 4752 on node c0-4 exited on 
>> signal 11 (Segmentation fault).
>> --------------------------------------------------------------------------
>> [rolf@Fermi-Cluster src]$ gdb MPI_Irecv_comm_c core.4752 GNU gdb Fedora 
>> (6.8-27.el5) Copyright (C) 2008 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
>> Core was generated by `MPI_Irecv_comm_c'.
>> Program terminated with signal 11, Segmentation fault.
>> [New process 4753]
>> [New process 4756]
>> [New process 4755]
>> [New process 4754]
>> [New process 4752]
>> #0  0x00002aaaad6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
>> ../../../../../opal/class/opal_atomic_lifo.h:111
>> 111             lifo->opal_lifo_head = 
>> (opal_list_item_t*)item->opal_list_next;
>> (gdb) where
>> #0  0x00002aaaad6ecad6 in opal_atomic_lifo_pop (lifo=0x5996940) at 
>> ../../../../../opal/class/opal_atomic_lifo.h:111
>> #1  0x00002aaaad6ec5b4 in __ompi_free_list_wait_mt (fl=0x5996940, 
>> item=0x40ea8d50) at ../../../../../ompi/class/ompi_free_list.h:228
>> #2  0x00002aaaad6ec3f8 in post_recvs (ep=0x59f3120, qp=0, num_post=256) at 
>> ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:361
>> #3  0x00002aaaad6ec1ae in mca_btl_openib_endpoint_post_rr_nolock 
>> (ep=0x59f3120, qp=0)
>>     at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.h:405
>> #4  0x00002aaaad6ebfad in mca_btl_openib_endpoint_post_recvs 
>> (endpoint=0x59f3120)
>>     at ../../../../../ompi/mca/btl/openib/btl_openib_endpoint.c:494
>> #5  0x00002aaaad6fe71c in qp_create_all (endpoint=0x59f3120) at 
>> ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:432
>> #6  0x00002aaaad6fde2b in reply_start_connect (endpoint=0x59f3120, 
>> rem_info=0x40ea8ed0)
>>     at 
>> ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:245
>> #7  0x00002aaaad7006ae in rml_recv_cb (status=0, process_name=0x5b0bb90, 
>> buffer=0x40ea8f80, tag=102, cbdata=0x0)
>>     at 
>> ../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:858
>> #8  0x00002ae802454601 in orte_rml_base_process_msg (fd=-1, flags=4, 
>> cbdata=0x5b0bac0)
>>     at ../../../../orte/mca/rml/base/rml_base_msg_handlers.c:172
>> #9  0x00002ae8027164a1 in event_process_active_single_queue (base=0x58ac620, 
>> activeq=0x58aa5b0)
>>     at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1367
>> #10 0x00002ae802716b24 in event_process_active (base=0x58ac620) at 
>> ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1437
>> #11 0x00002ae80271715c in opal_libevent2021_event_base_loop (base=0x58ac620, 
>> flags=1)
>>     at ../../../../../../opal/mca/event/libevent2021/libevent/event.c:1645
>> #12 0x00002ae8023e7465 in orte_progress_thread_engine (obj=0x2ae8026902c0) 
>> at ../../orte/runtime/orte_init.c:180
>> #13 0x0000003ab1e06367 in start_thread () from /lib64/libpthread.so.0
>> #14 0x0000003ab16d2f7d in clone () from /lib64/libc.so.6
>> (gdb)
>>  
>> This email message is for the sole use of the intended recipient(s) and may 
>> contain confidential information.  Any unauthorized review, use, disclosure 
>> or distribution is prohibited.  If you are not the intended recipient, 
>> please contact the sender by reply email and destroy all copies of the 
>> original message.
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>  
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 

Reply via email to