On Aug 24, 2007, at 4:18 PM, Josh Aune wrote:

We are using open-mpi on several 1000+ node clusters.  We received
several new clusters using the Infiniserve 3.X software stack recently
and are having several problems with the vapi btl (yes, I know, it is
very very old and shouldn't be used.  I couldn't agree with you more
but those are my marching orders).

Thankfully, Infiniserve is not within my prevue. But -- FWIW -- you should be using OFED. :-) (I know you know)

I have a new application that is running into swap for an unknown
reason.  If I run and force it to use the tcp btl I don't seem to run
into swap (the job just takes a very very long time).  I have tried
restricting the size of the free lists, forcing to use send mode, and
use an open-mpi compiled w/ no memory manager but nothing seems to
help.  I've profiled with valgrind --tool=massif and the memtrace
capabilities of ptmalloc but I don't have any smoking guns yet.  It is
a fortran app an I don't know anything about debugging fortran memory
problems, can someone point me in the proper direction?

Hmm. If you compile Open MPI with no memory manager, then it *shouldn't* be Open MPI's fault (unless there's a leak in the mvapi BTL...?). Verify that you did not actually compile Open MPI with a memory manager by running "ompi_info| grep ptmalloc2" -- it should come up empty.

The fact that you can run this under TCP without memory leaking would seem to indicate that it's not the app that's leaking memory, but rather either the MPI or the network stack.

--
Jeff Squyres
Cisco Systems

Reply via email to