Adrian -- Can you file this as a Github issue? Thanks.
> On Jan 17, 2015, at 12:58 PM, Adrian Reber <adr...@lisas.de> wrote: > > This time my bug report is not PSM related: > > I was able to reproduce the MTT error from > http://mtt.open-mpi.org/index.php?do_redir=2228 > on my system with openmpi-dev-720-gf4693c9: > > mpi_test_suite: btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != > 255' failed. > [n050409:06796] *** Process received signal *** > [n050409:06796] Signal: Aborted (6) > [n050409:06796] Signal code: (-6) > [n050409:06796] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b036d501710] > [n050409:06796] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2b036d741635] > [n050409:06796] [ 2] /lib64/libc.so.6(abort+0x175)[0x2b036d742e15] > [n050409:06796] [ 3] /lib64/libc.so.6(+0x2b75e)[0x2b036d73a75e] > [n050409:06796] [ 4] > /lib64/libc.so.6(__assert_perror_fail+0x0)[0x2b036d73a820] > [n050409:06796] [ 5] > /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_alloc+0x77)[0x2b03730cf6d0] > [n050409:06796] [ 6] > /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_sendi+0x5e5)[0x2b03730d1ae9] > [n050409:06796] [ 7] > /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(+0xd407)[0x2b0373961407] > [n050409:06796] [ 8] > /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(+0xde45)[0x2b0373961e45] > [n050409:06796] [ 9] > /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x1ce)[0x2b0373962501] > [n050409:06796] [10] > /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/libmpi.so.0(PMPI_Send+0x2b4)[0x2b036d20d1bb] > [n050409:06796] [11] mpi_test_suite[0x464424] > [n050409:06796] [12] mpi_test_suite[0x470304] > [n050409:06796] [13] mpi_test_suite[0x444a72] > [n050409:06796] [14] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b036d72dd5d] > [n050409:06796] [15] mpi_test_suite[0x4051a9] > [n050409:06796] *** End of error message *** > -------------------------------------------------------------------------- > mpirun noticed that process rank 0 with PID 0 on node n050409 exited on > signal 6 (Aborted). > -------------------------------------------------------------------------- > > Core was generated by `mpi_test_suite -t p2p'. > Program terminated with signal 6, Aborted. > (gdb) bt > #0 0x00002b036d741635 in raise () from /lib64/libc.so.6 > #1 0x00002b036d742d9d in abort () from /lib64/libc.so.6 > #2 0x00002b036d73a75e in __assert_fail_base () from /lib64/libc.so.6 > #3 0x00002b036d73a820 in __assert_fail () from /lib64/libc.so.6 > #4 0x00002b03730cf6d0 in mca_btl_openib_alloc (btl=0x224e740, ep=0x22b66a0, > order=255 '\377', size=73014, flags=3) at btl_openib.c:1200 > #5 0x00002b03730d1ae9 in mca_btl_openib_sendi (btl=0x224e740, ep=0x22b66a0, > convertor=0x7fff2c527bb0, header=0x7fff2c527cd0, header_size=14, > payload_size=73000, order=255 '\377', flags=3, > tag=65 'A', descriptor=0x7fff2c527ce8) at btl_openib.c:1829 > #6 0x00002b0373961407 in mca_bml_base_sendi (bml_btl=0x2198850, > convertor=0x7fff2c527bb0, header=0x7fff2c527cd0, header_size=14, > payload_size=73000, order=255 '\377', flags=3, tag=65 'A', > descriptor=0x7fff2c527ce8) at ../../../../ompi/mca/bml/bml.h:305 > #7 0x00002b0373961e45 in mca_pml_ob1_send_inline (buf=0x2b7b760, count=1, > datatype=0x2b97440, dst=1, tag=37, seqn=3639, dst_proc=0x21c2940, > endpoint=0x22dff00, comm=0x6939e0) at pml_ob1_isend.c:107 > #8 0x00002b0373962501 in mca_pml_ob1_send (buf=0x2b7b760, count=1, > datatype=0x2b97440, dst=1, tag=37, sendmode=MCA_PML_BASE_SEND_STANDARD, > comm=0x6939e0) at pml_ob1_isend.c:214 > #9 0x00002b036d20d1bb in PMPI_Send (buf=0x2b7b760, count=1, type=0x2b97440, > dest=1, tag=37, comm=0x6939e0) at psend.c:78 > #10 0x0000000000464424 in tst_p2p_simple_ring_xsend_run (env=0x7fff2c528530) > at p2p/tst_p2p_simple_ring_xsend.c:97 > #11 0x0000000000470304 in tst_test_run_func (env=0x7fff2c528530) at > tst_tests.c:1463 > #12 0x0000000000444a72 in main (argc=3, argv=0x7fff2c5287f8) at > mpi_test_suite.c:639 > > This is with --enable-debug. Without --enable-debug I get a > segmentation fault, but not always. Using fewer cores it works most > of the time. With 32 cores on 4 nodes it happens almost > all the time. If it does not crash using fewer cores I get messages like: > > [n050409][[36216,1],1][btl_openib_xrc.c:58:mca_btl_openib_xrc_check_api] XRC > error: bad XRC API (require XRC from OFED pre 3.12). > > Adrian > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/01/16797.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/