You are correct - the Sun errors are in a version prior to the insertion of the SM changes. We didn't relabel the version to 1.3.2 until -after- those changes went in, so you have to look for anything with an r number >= 20839.

The sif errors are all in that group - I would suggest starting there.

I suspect Josh or someone at IU could tell you the compiler. I would be very surprised if it wasn't gcc, but I don't know what version. I suspect they could even find a way to run some debugging on for you, if that would help.

The Cisco errors were caused by some config/fabric problems - Jeff is physically there today, so hopefully those will get fixed and we'll see his tests. IIRC, he was seeing these problems before, so hopefully we can see if they are still present.


On Mar 26, 2009, at 3:25 PM, Eugene Loh wrote:

Ralph Castain wrote:

It looks like the SM revisions we inserted into 1.3.2 are a great detector for shared memory init failures - it segfaulted 143 times last night on IU's sif computer, 34 times on Sun/Linux, and 3 times on Sun/SunOS...almost every single time due to "Address not mapped" errors in the sm btl during init.

Might be worth someone looking at the MTT output stack traces -this is something that now appears to be reproducible, and should be addressed before we release 1.3.2 as it seems far more likely to happen than in the past.

Okay.  I look at http://www.open-mpi.org/mtt/index.php?do_redir=973

If we start with the 3 Sun/SunOS failures (row #7), these seem to be labeled 1.3.1 ("MPI Version"). So, not 1.3.2. And, I don't know how to make sense of the stack trace... there an "mca_common_sm_mmap_init" ftruncate problem and stuff apparently much later on. How can this be?

The Sun/Linux problems must be row #6. Yes? Again, the "MPI Version" is labeled 1.3.1. Is that informative or misleading? Lots of stacks looking like this is happening during MPI_Init. I try running a code that just does MPI_Init on similar configs and seem unable to trigger this problem.

How do I figure out the compiler used?

I need help reproducing this problem.
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to