Re: [OMPI devel] OMPI 1.3 branch

Terry Dontje Thu, 14 May 2009 12:49:09 -0400

Ralph Castain wrote:

Hi folks
I encourage people to please look at your MTT outputs. As we arepreparing to roll the 1.3.3 release, I am seeing a lot of problems onthe branch:
1. timeouts, coming in two forms: (a) MPI_Abort hanging, and (b)collectives hanging (this is mostly on Solaris)

Can you clarify or send me a link that makes you believe b is mostlysolaris. Looking at last night's Sun's MTT 1.3 nightly runs I see 47timeouts on Linux and 24 timeouts on Solaris. That doesn't constitutemostly Solaris to me. Also how are you determining these timeouts areCollective based? I have a theory they are but I don't have a clearsmoking gun as of yet.

I've been looking at some collective hangs and segv's. These seem tohappen across different platform and OS (Linux and Solaris). I've beenfinding it really hard to reproduce. I ran MPI_Allreduce_loc_c on athree clusters for 2 days without a hang or segv. I am really concernedwhether we'll even be able to get this to fail with debugging on.I have not been able to get a core or time with a hung run in order toget more information.

2. segfaults - mostly on sif, but occasionally elsewhere

3. daemon failed to report back - this was only on sif
We will need to correct many of these for the release - unless itproves to be due to trivial errors, I don't see how we will be readyto roll release candidates next week.
So let's please start taking a look at these?!

I've actually been looking at ours though I have not been extremelyvocal. I was hoping to get more info on our timeouts before requestinghelp.

Ralph

------------------------------------------------------------------------

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] OMPI 1.3 branch

Reply via email to