Hi folks

In reviewing last night's MTT tests for the 1.3 branch, I am seeing several segfault failures in the shared memory BTL when using large messages. This occurred on both IU's sif machine and on Sun's tests.

Here is a typical stack from MTT:

MPITEST info  (0): Starting MPI_Sendrecv: Root to all model test
[burl-ct-v20z-13:14699] *** Process received signal ***
[burl-ct-v20z-13:14699] Signal: Segmentation fault (11)
[burl-ct-v20z-13:14699] Signal code:  (128)
[burl-ct-v20z-13:14699] Failing at address: (nil)
[burl-ct-v20z-13:14699] [ 0] /lib64/tls/libpthread.so.0 [0x2a960bc720]
[burl-ct-v20z-13:14699] [ 1]
/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball- testing/installs/ZCcL/install/lib/lib64/openmpi/ mca_btl_sm.so(mca_btl_sm_send+0x7b)
[0x2a9786a7d3]
[burl-ct-v20z-13:14699] [ 2]
/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball- testing/installs/ZCcL/install/lib/lib64/openmpi/ mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x5b2)
[0x2a97453942]
[burl-ct-v20z-13:14699] [ 3]
/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball- testing/installs/ZCcL/install/lib/lib64/openmpi/ mca_pml_ob1.so(mca_pml_ob1_isend+0x4f2)
[0x2a9744b446]
[burl-ct-v20z-13:14699] [ 4]
/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball- testing/installs/ZCcL/install/lib/lib64/openmpi/ mca_coll_tuned.so(ompi_coll_tuned_sendrecv_actual+0x7e)
[0x2a98120bca]
[burl-ct-v20z-13:14699] [ 5]
/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball- testing/installs/ZCcL/install/lib/lib64/openmpi/ mca_coll_tuned.so(ompi_coll_tuned_barrier_intra_recursivedoubling+0x119)
[0x2a9812b111]
[burl-ct-v20z-13:14699] [ 6]
/workspace/em162155/hpc/mtt-scratch/burl-ct-v20z-12/ompi-tarball- testing/installs/ZCcL/install/lib/lib64/libmpi.so.0(PMPI_Barrier+0x8e)
[0x2a9584ca42]
[burl-ct-v20z-13:14699] [ 7] src/MPI_Sendrecv_rtoa_c [0x403009]
[burl-ct-v20z-13:14699] [ 8] /lib64/tls/libc.so.6(__libc_start_main +0xea) [0x2a961e0aaa] [burl-ct-v20z-13:14699] [ 9] src/MPI_Sendrecv_rtoa_c(strtok+0x66) [0x4019f2]
[burl-ct-v20z-13:14699] *** End of error message ***
[burl-ct-v20z-12][[13280,1],0][btl_tcp_endpoint.c: 456:mca_btl_tcp_endpoint_recv_blocking] recv(13)
failed: Connection reset by peer (104)
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 14699 on node burl-ct- v20z-13 exited on signal 11
(Segmentation fault).
--------------------------------------------------------------------------


Seems like this is something we need to address before release - yes?

Ralph

Reply via email to