You are correct - the Sun errors are in a version prior to the
insertion of the SM changes. We didn't relabel the version to 1.3.2
until -after- those changes went in, so you have to look for anything
with an r number >= 20839.
The sif errors are all in that group - I would suggest starting there.
I suspect Josh or someone at IU could tell you the compiler. I would
be very surprised if it wasn't gcc, but I don't know what version. I
suspect they could even find a way to run some debugging on for you,
if that would help.
The Cisco errors were caused by some config/fabric problems - Jeff is
physically there today, so hopefully those will get fixed and we'll
see his tests. IIRC, he was seeing these problems before, so hopefully
we can see if they are still present.
On Mar 26, 2009, at 3:25 PM, Eugene Loh wrote:
Ralph Castain wrote:
It looks like the SM revisions we inserted into 1.3.2 are a great
detector for shared memory init failures - it segfaulted 143 times
last night on IU's sif computer, 34 times on Sun/Linux, and 3 times
on Sun/SunOS...almost every single time due to "Address not
mapped" errors in the sm btl during init.
Might be worth someone looking at the MTT output stack traces -this
is something that now appears to be reproducible, and should be
addressed before we release 1.3.2 as it seems far more likely to
happen than in the past.
Okay. I look at http://www.open-mpi.org/mtt/index.php?do_redir=973
If we start with the 3 Sun/SunOS failures (row #7), these seem to be
labeled 1.3.1 ("MPI Version"). So, not 1.3.2. And, I don't know
how to make sense of the stack trace... there an
"mca_common_sm_mmap_init" ftruncate problem and stuff apparently
much later on. How can this be?
The Sun/Linux problems must be row #6. Yes? Again, the "MPI
Version" is labeled 1.3.1. Is that informative or misleading? Lots
of stacks looking like this is happening during MPI_Init. I try
running a code that just does MPI_Init on similar configs and seem
unable to trigger this problem.
How do I figure out the compiler used?
I need help reproducing this problem.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel