I’m seeing similar failures in the master from several collectives. Looking at 
the stack, here is what I see on all of them:

(gdb) where
#0  0x00007fe49931a5d7 in raise () from /usr/lib64/libc.so.6
#1  0x00007fe49931be08 in abort () from /usr/lib64/libc.so.6
#2  0x00007fe49935ae07 in __libc_message () from /usr/lib64/libc.so.6
#3  0x00007fe4993621fd in _int_free () from /usr/lib64/libc.so.6
#4  0x00007fe498cfec95 in opal_list_destruct (list=0x25b06d0) at 
class/opal_list.c:108
#5  0x00007fe48f0d0fb0 in opal_obj_run_destructors (object=0x25b06d0) at 
../../../../opal/class/opal_object.h:460
#6  0x00007fe48f0d132a in mca_pml_ob1_comm_proc_destruct (proc=0x25b05a0) at 
pml_ob1_comm.c:42
#7  0x00007fe48f0d0fb0 in opal_obj_run_destructors (object=0x25b05a0) at 
../../../../opal/class/opal_object.h:460
#8  0x00007fe48f0d17c7 in mca_pml_ob1_comm_destruct (comm=0x25a0b40) at 
pml_ob1_comm.c:71
#9  0x00007fe48f0cdcd5 in opal_obj_run_destructors (object=0x25a0b40) at 
../../../../opal/class/opal_object.h:460
#10 0x00007fe48f0cfb05 in mca_pml_ob1_del_comm (comm=0x259db90) at pml_ob1.c:277
#11 0x00007fe4998ef19f in ompi_comm_destruct (comm=0x259db90) at 
communicator/comm_init.c:418
#12 0x00007fe4998efa02 in opal_obj_run_destructors (object=0x259db90) at 
../opal/class/opal_object.h:460
#13 0x00007fe4998f2bed in ompi_comm_free (comm=0x7ffdb43a6940) at 
communicator/comm.c:1532
#14 0x00007fe49993c858 in PMPI_Comm_disconnect (comm=0x7ffdb43a6940) at 
pcomm_disconnect.c:75
#15 0x00000000004014a6 in main (argc=1, argv=0x7ffdb43a6a58) at 
ibarrier_inter.c:68


This is with 16 procs on 2 nodes. Any ideas?
Ralph


> On Oct 27, 2015, at 12:32 PM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> Anyone have an idea of what this is all about?
> 
> >> Command: mpirun     --hostfile /home/common/hosts -np 16 --prefix 
> >> /home/common/openmpi/build/foobar/ collective/alltoall_in_place 
>    Elapsed:       00:00:00 0.00u 0.00s
>    Test: alltoall_in_place, np=16, variant=1: Passed
> *** Error in `collective/alltoallv_somezeros': free(): invalid pointer: 
> 0x000000000127a180 ***
> ======= Backtrace: =========
> /usr/lib64/libc.so.6(+0x7d1fd)[0x7f46e2fda1fd]
> /home/common/openmpi/build/foobar/lib/libopen-pal.so.0(+0x2cd05)[0x7f46e2976d05]
> /home/common/openmpi/build/foobar/lib/openmpi/mca_pml_ob1.so(+0x6f74)[0x7f46dcefaf74]
> /home/common/openmpi/build/foobar/lib/openmpi/mca_pml_ob1.so(+0x72ee)[0x7f46dcefb2ee]
> /home/common/openmpi/build/foobar/lib/openmpi/mca_pml_ob1.so(+0x6f74)[0x7f46dcefaf74]
> /home/common/openmpi/build/foobar/lib/openmpi/mca_pml_ob1.so(+0x76e8)[0x7f46dcefb6e8]
> /home/common/openmpi/build/foobar/lib/openmpi/mca_pml_ob1.so(+0x3c73)[0x7f46dcef7c73]
> /home/common/openmpi/build/foobar/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_del_comm+0xcf)[0x7f46dcef9acc]
> /home/common/openmpi/build/foobar/lib/libmpi.so.0(+0x2d1df)[0x7f46e35671df]
> /home/common/openmpi/build/foobar/lib/libmpi.so.0(+0x2b473)[0x7f46e3565473]
> /home/common/openmpi/build/foobar/lib/libmpi.so.0(ompi_comm_finalize+0x23f)[0x7f46e3566bbd]
> /home/common/openmpi/build/foobar/lib/libmpi.so.0(ompi_mpi_finalize+0x5fd)[0x7f46e3593df7]
> /home/common/openmpi/build/foobar/lib/libmpi.so.0(PMPI_Finalize+0x59)[0x7f46e35bd6e5]
> 
> Then I see a bunch of dump info, followed by:
> 
> >> Command: mpirun     --hostfile /home/common/hosts -np 16 --prefix 
> >> /home/common/openmpi/build/foobar/ collective/alltoallv_somezeros 
>    Elapsed:       00:00:01 0.00u 0.00s
>    Test: alltoallv_somezeros, np=16, variant=1: Passed
> 
> 
> 
> Ralph
> 
> 

Reply via email to