Ralph Castain wrote:
Hi folks
Er, perhaps pronounced "Eugene". :^(
It looks like the SM revisions we inserted into 1.3.2 are a great detector for shared memory init failures
How delicately put! I appreciate the gentleness.
- it segfaulted 143 times last night on IU's sif computer, 34 times on Sun/Linux, and 3 times on Sun/SunOS...almost every single time due to "Address not mapped" errors in the sm btl during init.
Any guess as to frequency or what it'd take for me to reproduce this? I tried with 1.3.1... 200K times and no failures on np=8 MPI_Init() jobs. I'm starting now with a single-queue version, but wouldn't be surprised if, again, I can't turn anything up.
Might be worth someone looking at the MTT output stack traces -this is something that now appears to be reproducible, and should be addressed before we release 1.3.2 as it seems far more likely to happen than in the past.
Great (in a weird way, I guess). Can you tell me how to look at the MTT output stack traces? I found http://www.open-mpi.org/projects/mtt/ but expect it'll take me awhile to wade through that.