I have been testing some code against openmpi lately that always causes it
to crash during certain mpi function calls.  The code does not seem to be
the problem, as it runs just fine against mpich.  I have tested it against
openmpi 1.2.5, 1.2.6, and 1.2.7 and they all exhibit the same problem.
Also, the problem only occurs in openmpi when running more than 16
processes.  I have posted this stack trace to the list before, but I am
submitting it now as a potential bug report.  I need some help debugging it
and finding out exactly what is going on in openmpi when the segfault
occurs.  Are there any suggestions on how best to do this?  Is there an easy
way to attach gdb to one of the processes or something??  I have already
compiled openmpi with debugging, memory profiling, etc.  How can I best take
advantage of these features?

Here is the segfault:
[m4b-1-8:11481] *** Process received signal ***
[m4b-1-8:11481] Signal: Segmentation fault (11)
[m4b-1-8:11481] Signal code: Address not mapped (1)
[m4b-1-8:11481] Failing at address: 0x2b91c69eed
[m4b-1-8:11483] [ 0] /lib64/libpthread.so.0 [0x33e8c0de70]
[m4b-1-8:11483] [ 1] /fslhome/dhansen7/openmpi/lib/libmpi.so.0
[0x2aaaaabea7c0]
[m4b-1-8:11483] [ 2] /fslhome/dhansen7/openmpi/lib/libmpi.so.0
[0x2aaaaabea675]
[m4b-1-8:11483] [ 3]
/fslhome/dhansen7/openmpi/lib/libmpi.so.0(mca_pml_ob1_send+0x2da)
[0x2aaaaabeaf55]
[m4b-1-8:11483] [ 4]
/fslhome/dhansen7/openmpi/lib/libmpi.so.0(MPI_Send+0x28e) [0x2aaaaab52c5a]
[m4b-1-8:11483] [ 5]
/fslhome/dhansen7/compute/for_DanielHansen/replica_mpi_marylou2/Openmpi_md_twham(twham_init+0x708)
[0x42a8a8]
[m4b-1-8:11483] [ 6]
/fslhome/dhansen7/compute/for_DanielHansen/replica_mpi_marylou2/Openmpi_md_twham(repexch+0x73c)
[0x425d5c]
[m4b-1-8:11483] [ 7]
/fslhome/dhansen7/compute/for_DanielHansen/replica_mpi_marylou2/Openmpi_md_twham(main+0x855)
[0x4133a5]
[m4b-1-8:11483] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x33e841d8a4]
[m4b-1-8:11483] [ 9]
/fslhome/dhansen7/compute/for_DanielHansen/replica_mpi_marylou2/Openmpi_md_twham
[0x4040b9]
[m4b-1-8:11483] *** End of error message ***

There is also an mpi error that just precedes the segfault:
[m4b-1-7i][0,1,0][btl_tcp_component.c:623:mca_btl_tcp_component_recv_handler]
errno=11
[m4b-1-7i][0,1,0][btl_tcp_component.c:623:mca_btl_tcp_component_recv_handler]
errno=11
[m4b-1-7i][0,1,0][btl_tcp_component.c:623:mca_btl_tcp_component_recv_handler]
errno=11
[m4b-1-7i][0,1,0][btl_tcp_component.c:623:mca_btl_tcp_component_recv_handler]
errno=11

Thanks,
Daniel Hansen
Systems Administrator
BYU Fulton Supercomputing Lab

Reply via email to