Different from what?
You and Terry saw something that was occurring about 0.01% of the time
during MPI_Init during add_procs. That does not seem to be what we are
seeing here.
But we have seen failures in 1.3.1 and 1.3.2 that look like the one
here. They occur more like 1% of the time and can occur during MPI_Init
*OR* later during a collective call. What we're looking at here seems
to be related. E.g., see
http://www.open-mpi.org/community/lists/devel/2009/03/5768.php
Jeff Squyres wrote:
Hmm -- this looks like a different error to me.
The <1% error rate sm error we were seeing was in MPI_INIT. This
looks like it is beyond MPI_INIT and in the sending path...?
On May 4, 2009, at 11:00 AM, Eugene Loh wrote:
Ralph Castain wrote:
> In reviewing last night's MTT tests for the 1.3 branch, I am seeing
> several segfault failures in the shared memory BTL when using large
> messages. This occurred on both IU's sif machine and on Sun's tests.
>
> Here is a typical stack from MTT:
>
> MPITEST info (0): Starting MPI_Sendrecv: Root to all model test
> [burl-ct-v20z-13:14699] *** Process received signal ***
> [burl-ct-v20z-13:14699] Signal: Segmentation fault (11)
> [burl-ct-v20z-13:14699] Signal code: (128)
> [burl-ct-v20z-13:14699] Failing at address: (nil)
> [burl-ct-v20z-13:14699] [ 0] /lib64/tls/libpthread.so.0
[0x2a960bc720]
> [burl-ct-v20z-13:14699] [ 1]
/workspace/.../lib/lib64/openmpi/mca_btl_sm.so(mca_btl_sm_send+0x7b)
[0x2a9786a7d3]
> [burl-ct-v20z-13:14699] [ 2]
/workspace/.../lib/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_copy+0x5b2)
[0x2a97453942]
> [burl-ct-v20z-13:14699] [ 3]
/workspace/.../lib/lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x4f2)
[0x2a9744b446]
> [burl-ct-v20z-13:14699] [ 4]
/workspace/.../lib/lib64/openmpi/mca_coll_tuned.so(ompi_coll_tuned_sendrecv_actual+0x7e)
[0x2a98120bca]
> [burl-ct-v20z-13:14699] [ 5]
/workspace/.../lib/lib64/openmpi/mca_coll_tuned.so(ompi_coll_tuned_barrier_intra_recursivedoubling
+0x119) [0x2a9812b111]
> [burl-ct-v20z-13:14699] [ 6]
/workspace/.../lib/lib64/libmpi.so.0(PMPI_Barrier +0x8e) [0x2a9584ca42]
> [burl-ct-v20z-13:14699] [ 7] src/MPI_Sendrecv_rtoa_c [0x403009]
> [burl-ct-v20z-13:14699] [ 8]
/lib64/tls/libc.so.6(__libc_start_main+0xea) [0x2a961e0aaa]
> [burl-ct-v20z-13:14699] [ 9] src/MPI_Sendrecv_rtoa_c(strtok+0x66)
[0x4019f2]
> [burl-ct-v20z-13:14699] *** End of error message ***
>
--------------------------------------------------------------------------
>
> Seems like this is something we need to address before release - yes?
I don't know if this needs to be addressed before release, but it
was my
impression that we've been living with these errors for a long time.
They're intermittent (1% incidence rate????) and stacks come through
coll_tuned or coll_hierarch or something and end up in the sm BTL. We
discussed them not too long ago on this list. They predate 1.3.2. I
think Terry said they seem hard to reproduce outside of MTT. (Terry is
out this week.)
Anyhow, my impression was that these were not new with this release.
Would be nice to get off the books in any case. Need to figure out how
to improve reproducibility and then dive into coll/sm stuff.