I’m seeing similar failures in the master from several collectives. Looking at the stack, here is what I see on all of them:
(gdb) where #0 0x00007fe49931a5d7 in raise () from /usr/lib64/libc.so.6 #1 0x00007fe49931be08 in abort () from /usr/lib64/libc.so.6 #2 0x00007fe49935ae07 in __libc_message () from /usr/lib64/libc.so.6 #3 0x00007fe4993621fd in _int_free () from /usr/lib64/libc.so.6 #4 0x00007fe498cfec95 in opal_list_destruct (list=0x25b06d0) at class/opal_list.c:108 #5 0x00007fe48f0d0fb0 in opal_obj_run_destructors (object=0x25b06d0) at ../../../../opal/class/opal_object.h:460 #6 0x00007fe48f0d132a in mca_pml_ob1_comm_proc_destruct (proc=0x25b05a0) at pml_ob1_comm.c:42 #7 0x00007fe48f0d0fb0 in opal_obj_run_destructors (object=0x25b05a0) at ../../../../opal/class/opal_object.h:460 #8 0x00007fe48f0d17c7 in mca_pml_ob1_comm_destruct (comm=0x25a0b40) at pml_ob1_comm.c:71 #9 0x00007fe48f0cdcd5 in opal_obj_run_destructors (object=0x25a0b40) at ../../../../opal/class/opal_object.h:460 #10 0x00007fe48f0cfb05 in mca_pml_ob1_del_comm (comm=0x259db90) at pml_ob1.c:277 #11 0x00007fe4998ef19f in ompi_comm_destruct (comm=0x259db90) at communicator/comm_init.c:418 #12 0x00007fe4998efa02 in opal_obj_run_destructors (object=0x259db90) at ../opal/class/opal_object.h:460 #13 0x00007fe4998f2bed in ompi_comm_free (comm=0x7ffdb43a6940) at communicator/comm.c:1532 #14 0x00007fe49993c858 in PMPI_Comm_disconnect (comm=0x7ffdb43a6940) at pcomm_disconnect.c:75 #15 0x00000000004014a6 in main (argc=1, argv=0x7ffdb43a6a58) at ibarrier_inter.c:68 This is with 16 procs on 2 nodes. Any ideas? Ralph > On Oct 27, 2015, at 12:32 PM, Ralph Castain <r...@open-mpi.org> wrote: > > Anyone have an idea of what this is all about? > > >> Command: mpirun --hostfile /home/common/hosts -np 16 --prefix > >> /home/common/openmpi/build/foobar/ collective/alltoall_in_place > Elapsed: 00:00:00 0.00u 0.00s > Test: alltoall_in_place, np=16, variant=1: Passed > *** Error in `collective/alltoallv_somezeros': free(): invalid pointer: > 0x000000000127a180 *** > ======= Backtrace: ========= > /usr/lib64/libc.so.6(+0x7d1fd)[0x7f46e2fda1fd] > /home/common/openmpi/build/foobar/lib/libopen-pal.so.0(+0x2cd05)[0x7f46e2976d05] > /home/common/openmpi/build/foobar/lib/openmpi/mca_pml_ob1.so(+0x6f74)[0x7f46dcefaf74] > /home/common/openmpi/build/foobar/lib/openmpi/mca_pml_ob1.so(+0x72ee)[0x7f46dcefb2ee] > /home/common/openmpi/build/foobar/lib/openmpi/mca_pml_ob1.so(+0x6f74)[0x7f46dcefaf74] > /home/common/openmpi/build/foobar/lib/openmpi/mca_pml_ob1.so(+0x76e8)[0x7f46dcefb6e8] > /home/common/openmpi/build/foobar/lib/openmpi/mca_pml_ob1.so(+0x3c73)[0x7f46dcef7c73] > /home/common/openmpi/build/foobar/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_del_comm+0xcf)[0x7f46dcef9acc] > /home/common/openmpi/build/foobar/lib/libmpi.so.0(+0x2d1df)[0x7f46e35671df] > /home/common/openmpi/build/foobar/lib/libmpi.so.0(+0x2b473)[0x7f46e3565473] > /home/common/openmpi/build/foobar/lib/libmpi.so.0(ompi_comm_finalize+0x23f)[0x7f46e3566bbd] > /home/common/openmpi/build/foobar/lib/libmpi.so.0(ompi_mpi_finalize+0x5fd)[0x7f46e3593df7] > /home/common/openmpi/build/foobar/lib/libmpi.so.0(PMPI_Finalize+0x59)[0x7f46e35bd6e5] > > Then I see a bunch of dump info, followed by: > > >> Command: mpirun --hostfile /home/common/hosts -np 16 --prefix > >> /home/common/openmpi/build/foobar/ collective/alltoallv_somezeros > Elapsed: 00:00:01 0.00u 0.00s > Test: alltoallv_somezeros, np=16, variant=1: Passed > > > > Ralph > >