Re: [OMPI devel] possible bug in 1.3.2 sm transport

Eugene Loh Mon, 11 May 2009 20:09:21 -0400

Bryan Lally wrote:

I think I've run across a race condition in your latest release.Since my demonstrator is somewhat large and cumbersome, I'd like toknow if you already know about this issue before we start the processof providing code and details.
Basics: openmpi 1.3.2, Fedora 9, 2 x86_64 quad-core cpus in one machine.
Symptoms: our code hangs, always in the same vicinity, usually at thesame place, 10-25% of the time. Sometimes more often, sometimes less.Our code has run reliably with many MPI implementations for years. Wehaven't added anything recently that is a likely culprit. While wehave our own issues, this doesn't feel like one of ours.We see that there is new code in the shared memory transport between1.3.1 and 1.3.2. Our code doesn't hang with 1.3.1 (nor 1.2.9). Onlywith 1.3.2.If we switch to tcp for transport (with mpirun --mca btl tcp,self ...)we don't see any hangs. Running using --mca btl sm,self results inhangs.If we sprinkle a few calls (3) to MPI_Barrier in the vicinity of theproblem, we no longer see hangs.
We demonstrate this with 4 processes. When we attach a debugger tothe hung processes, we see that the hang results from anMPI_Allreduce. All processes have made the same call toMPI_Allreduce. The processes are all in opal_progress, called (withintervening calls) by MPI_Allreduce.
My question is, have you seen anything like this before? If not, whatdo we do next?

Another user reports something somewhat similar athttp://www.open-mpi.org/community/lists/users/2009/04/9154.php . Thatproblem seems to be associated with GCC 4.4.0. What compiler are you using?

In some test runs, we see some MPI_Allreduce hangs, but only after about40K trials (rather than 10-25% of the time).

So, it may be that others have seen what you are seeing, but we don't (Idon't) currently understand what's going on.

Re: [OMPI devel] possible bug in 1.3.2 sm transport

Reply via email to