This time my bug report is not PSM related:

I was able to reproduce the MTT error from 
http://mtt.open-mpi.org/index.php?do_redir=2228
on my system with openmpi-dev-720-gf4693c9:

mpi_test_suite: btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 255' 
failed.
[n050409:06796] *** Process received signal ***
[n050409:06796] Signal: Aborted (6)
[n050409:06796] Signal code:  (-6)
[n050409:06796] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b036d501710]
[n050409:06796] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2b036d741635]
[n050409:06796] [ 2] /lib64/libc.so.6(abort+0x175)[0x2b036d742e15]
[n050409:06796] [ 3] /lib64/libc.so.6(+0x2b75e)[0x2b036d73a75e]
[n050409:06796] [ 4] /lib64/libc.so.6(__assert_perror_fail+0x0)[0x2b036d73a820]
[n050409:06796] [ 5] 
/lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_alloc+0x77)[0x2b03730cf6d0]
[n050409:06796] [ 6] 
/lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_sendi+0x5e5)[0x2b03730d1ae9]
[n050409:06796] [ 7] 
/lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(+0xd407)[0x2b0373961407]
[n050409:06796] [ 8] 
/lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(+0xde45)[0x2b0373961e45]
[n050409:06796] [ 9] 
/lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x1ce)[0x2b0373962501]
[n050409:06796] [10] 
/lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/libmpi.so.0(PMPI_Send+0x2b4)[0x2b036d20d1bb]
[n050409:06796] [11] mpi_test_suite[0x464424]
[n050409:06796] [12] mpi_test_suite[0x470304]
[n050409:06796] [13] mpi_test_suite[0x444a72]
[n050409:06796] [14] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b036d72dd5d]
[n050409:06796] [15] mpi_test_suite[0x4051a9]
[n050409:06796] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node n050409 exited on signal 
6 (Aborted).
--------------------------------------------------------------------------

Core was generated by `mpi_test_suite -t p2p'.
Program terminated with signal 6, Aborted.
(gdb) bt
#0  0x00002b036d741635 in raise () from /lib64/libc.so.6
#1  0x00002b036d742d9d in abort () from /lib64/libc.so.6
#2  0x00002b036d73a75e in __assert_fail_base () from /lib64/libc.so.6
#3  0x00002b036d73a820 in __assert_fail () from /lib64/libc.so.6
#4  0x00002b03730cf6d0 in mca_btl_openib_alloc (btl=0x224e740, ep=0x22b66a0, 
order=255 '\377', size=73014, flags=3) at btl_openib.c:1200
#5  0x00002b03730d1ae9 in mca_btl_openib_sendi (btl=0x224e740, ep=0x22b66a0, 
convertor=0x7fff2c527bb0, header=0x7fff2c527cd0, header_size=14, 
payload_size=73000, order=255 '\377', flags=3, 
    tag=65 'A', descriptor=0x7fff2c527ce8) at btl_openib.c:1829
#6  0x00002b0373961407 in mca_bml_base_sendi (bml_btl=0x2198850, 
convertor=0x7fff2c527bb0, header=0x7fff2c527cd0, header_size=14, 
payload_size=73000, order=255 '\377', flags=3, tag=65 'A', 
    descriptor=0x7fff2c527ce8) at ../../../../ompi/mca/bml/bml.h:305
#7  0x00002b0373961e45 in mca_pml_ob1_send_inline (buf=0x2b7b760, count=1, 
datatype=0x2b97440, dst=1, tag=37, seqn=3639, dst_proc=0x21c2940, 
endpoint=0x22dff00, comm=0x6939e0) at pml_ob1_isend.c:107
#8  0x00002b0373962501 in mca_pml_ob1_send (buf=0x2b7b760, count=1, 
datatype=0x2b97440, dst=1, tag=37, sendmode=MCA_PML_BASE_SEND_STANDARD, 
comm=0x6939e0) at pml_ob1_isend.c:214
#9  0x00002b036d20d1bb in PMPI_Send (buf=0x2b7b760, count=1, type=0x2b97440, 
dest=1, tag=37, comm=0x6939e0) at psend.c:78
#10 0x0000000000464424 in tst_p2p_simple_ring_xsend_run (env=0x7fff2c528530) at 
p2p/tst_p2p_simple_ring_xsend.c:97
#11 0x0000000000470304 in tst_test_run_func (env=0x7fff2c528530) at 
tst_tests.c:1463
#12 0x0000000000444a72 in main (argc=3, argv=0x7fff2c5287f8) at 
mpi_test_suite.c:639

This is with --enable-debug. Without --enable-debug I get a
segmentation fault, but not always. Using fewer cores it works most
of the time. With 32 cores on 4 nodes it happens almost
all the time. If it does not crash using fewer cores I get messages like:

[n050409][[36216,1],1][btl_openib_xrc.c:58:mca_btl_openib_xrc_check_api] XRC 
error: bad XRC API (require XRC from OFED pre 3.12).

                Adrian

Reply via email to