Adrian --

Can you file this as a Github issue?  Thanks.


> On Jan 17, 2015, at 12:58 PM, Adrian Reber <adr...@lisas.de> wrote:
> 
> This time my bug report is not PSM related:
> 
> I was able to reproduce the MTT error from 
> http://mtt.open-mpi.org/index.php?do_redir=2228
> on my system with openmpi-dev-720-gf4693c9:
> 
> mpi_test_suite: btl_openib.c:1200: mca_btl_openib_alloc: Assertion `qp != 
> 255' failed.
> [n050409:06796] *** Process received signal ***
> [n050409:06796] Signal: Aborted (6)
> [n050409:06796] Signal code:  (-6)
> [n050409:06796] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b036d501710]
> [n050409:06796] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2b036d741635]
> [n050409:06796] [ 2] /lib64/libc.so.6(abort+0x175)[0x2b036d742e15]
> [n050409:06796] [ 3] /lib64/libc.so.6(+0x2b75e)[0x2b036d73a75e]
> [n050409:06796] [ 4] 
> /lib64/libc.so.6(__assert_perror_fail+0x0)[0x2b036d73a820]
> [n050409:06796] [ 5] 
> /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_alloc+0x77)[0x2b03730cf6d0]
> [n050409:06796] [ 6] 
> /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_btl_openib.so(mca_btl_openib_sendi+0x5e5)[0x2b03730d1ae9]
> [n050409:06796] [ 7] 
> /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(+0xd407)[0x2b0373961407]
> [n050409:06796] [ 8] 
> /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(+0xde45)[0x2b0373961e45]
> [n050409:06796] [ 9] 
> /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x1ce)[0x2b0373962501]
> [n050409:06796] [10] 
> /lustre/ws1/ws/adrian-mtt-0/ompi-install/lib/libmpi.so.0(PMPI_Send+0x2b4)[0x2b036d20d1bb]
> [n050409:06796] [11] mpi_test_suite[0x464424]
> [n050409:06796] [12] mpi_test_suite[0x470304]
> [n050409:06796] [13] mpi_test_suite[0x444a72]
> [n050409:06796] [14] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b036d72dd5d]
> [n050409:06796] [15] mpi_test_suite[0x4051a9]
> [n050409:06796] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 0 on node n050409 exited on 
> signal 6 (Aborted).
> --------------------------------------------------------------------------
> 
> Core was generated by `mpi_test_suite -t p2p'.
> Program terminated with signal 6, Aborted.
> (gdb) bt
> #0  0x00002b036d741635 in raise () from /lib64/libc.so.6
> #1  0x00002b036d742d9d in abort () from /lib64/libc.so.6
> #2  0x00002b036d73a75e in __assert_fail_base () from /lib64/libc.so.6
> #3  0x00002b036d73a820 in __assert_fail () from /lib64/libc.so.6
> #4  0x00002b03730cf6d0 in mca_btl_openib_alloc (btl=0x224e740, ep=0x22b66a0, 
> order=255 '\377', size=73014, flags=3) at btl_openib.c:1200
> #5  0x00002b03730d1ae9 in mca_btl_openib_sendi (btl=0x224e740, ep=0x22b66a0, 
> convertor=0x7fff2c527bb0, header=0x7fff2c527cd0, header_size=14, 
> payload_size=73000, order=255 '\377', flags=3, 
>    tag=65 'A', descriptor=0x7fff2c527ce8) at btl_openib.c:1829
> #6  0x00002b0373961407 in mca_bml_base_sendi (bml_btl=0x2198850, 
> convertor=0x7fff2c527bb0, header=0x7fff2c527cd0, header_size=14, 
> payload_size=73000, order=255 '\377', flags=3, tag=65 'A', 
>    descriptor=0x7fff2c527ce8) at ../../../../ompi/mca/bml/bml.h:305
> #7  0x00002b0373961e45 in mca_pml_ob1_send_inline (buf=0x2b7b760, count=1, 
> datatype=0x2b97440, dst=1, tag=37, seqn=3639, dst_proc=0x21c2940, 
> endpoint=0x22dff00, comm=0x6939e0) at pml_ob1_isend.c:107
> #8  0x00002b0373962501 in mca_pml_ob1_send (buf=0x2b7b760, count=1, 
> datatype=0x2b97440, dst=1, tag=37, sendmode=MCA_PML_BASE_SEND_STANDARD, 
> comm=0x6939e0) at pml_ob1_isend.c:214
> #9  0x00002b036d20d1bb in PMPI_Send (buf=0x2b7b760, count=1, type=0x2b97440, 
> dest=1, tag=37, comm=0x6939e0) at psend.c:78
> #10 0x0000000000464424 in tst_p2p_simple_ring_xsend_run (env=0x7fff2c528530) 
> at p2p/tst_p2p_simple_ring_xsend.c:97
> #11 0x0000000000470304 in tst_test_run_func (env=0x7fff2c528530) at 
> tst_tests.c:1463
> #12 0x0000000000444a72 in main (argc=3, argv=0x7fff2c5287f8) at 
> mpi_test_suite.c:639
> 
> This is with --enable-debug. Without --enable-debug I get a
> segmentation fault, but not always. Using fewer cores it works most
> of the time. With 32 cores on 4 nodes it happens almost
> all the time. If it does not crash using fewer cores I get messages like:
> 
> [n050409][[36216,1],1][btl_openib_xrc.c:58:mca_btl_openib_xrc_check_api] XRC 
> error: bad XRC API (require XRC from OFED pre 3.12).
> 
>               Adrian
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/01/16797.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to