On Nov 20, 2007, at 6:52 AM, Terry Frankcombe wrote:

I posted this to the devel list the other day, but it raised no
responses.  Maybe people will have more to say here.

Sorry Terry; many of us were at the SC conference last week, and this week is short because of the US holiday. Some of the inbox got dropped/delayed as a result...

(case in point: this mail sat unfinished on my laptop until I returned from the holiday today -- sorry!)

Questions:  How much does using the MPI wrappers influence the memory
management at runtime?

I'm not sure what you mean here, but it's not really the MPI wrappers that are at issue. Rather, it's whether support for the memory manager was compiled into the Open MPI libraries or not. For example (and I just double checked this to be sure) -- I compiled OMPI with and without the memory manager on RHEL4U4 and the output from "mpicc -- showme" is exactly the same.

What has changed in this regard from 1.2.3 to 1.2.4?

Nothing, AFAIK...? I don't see anything in NEWS w.r.t. the memory manager stuff for v1.2.4.

The reason I ask is that I have an f90 code that does very strange
things. The structure of the code is not all that straightforward, with a "tree" of modules usually allocating their own storage (all with save
applied globally within the module).  Compiling with OpenMPI 1.2.4
coupled to a gcc 4.3.0 prerelease and running as a single process (with no explicit mpirun), the elements of one particular array seem to revert
to previous values between where they are set and a later part of the
code.  (I'll refer to this as The Bug, and having the matrix elements
stay as set as "expected behaviour".)

Yoinks.  :-(

The most obvious explanation would be a coding error.  However,
compiling and running this with OpenMPI 1.2.3 gives me the expected
behaviour!  As does compiling and running with a different MPI
implementation and compiler set.  Replacing the prerelease gcc 4.3.0
with the released 4.2.2 makes no change.

The Bug is unstable. Removing calls to various routines in used modules (that otherwise do not effect the results) returns to expected behaviour at runtime. Removing a call to MPI_Recv that is never called returns to
expected behaviour.

Because of this I can't reduce the problem to a small testcase, and so
have not included any code at this stage.

Ugh.  Heisenbugs are the worst.

Have you tried with a memory checking debugger, such as valgrind, or a parallel debugger? Is there a chance that there's a simple errant posted receive (perhaps in a race condition) that is unexpectedly receiving into the Bug's memory location when you don't expect it?

If I run the code with mpirun -np 1 the problem goes away. So one could
presumably simply say "always run it with mpirun."  But if this is
required, why does OpenMPI not detect it?

I'm not sure what you're asking -- Open MPI does not *require* you to run with mpirun. Indeed, the memory management stuff that is in Open MPI doesn't require the use of mpirun (or not). If you run without mpirun, you'll get an MPI_COMM_WORLD size of 1 (known as a "singleton" MPI job).

And why the difference
between 1.2.3 and 1.2.4?

There are lots of differences between 1.2.3 and 1.2.4 -- see:

    https://svn.open-mpi.org/trac/ompi/browser/branches/v1.2/NEWS

As for what exactly would cause it to exhibit the Bug behavior in 1.2.4 and not in 1.2.3 -- I don't know. As I said above, Heisenbugs are the worst -- changing one thing makes it [seem to] go away, etc. It could be that the Bug still exists but simply is not being obvious when you use 1.2.3. Buffer overflows can be like that, for example -- if you overflow into an area of memory that doesn't matter, then you'll never notice the bug. But if you move some data around, now perhaps that same buffer overflow will overwrite some critical memory and you *will* notice the Bug.

--
Jeff Squyres
Cisco Systems

Reply via email to