On Aug 24, 2007, at 4:18 PM, Josh Aune wrote:
We are using open-mpi on several 1000+ node clusters. We received several new clusters using the Infiniserve 3.X software stack recently and are having several problems with the vapi btl (yes, I know, it is very very old and shouldn't be used. I couldn't agree with you more but those are my marching orders).
Thankfully, Infiniserve is not within my prevue. But -- FWIW -- you should be using OFED. :-) (I know you know)
I have a new application that is running into swap for an unknown reason. If I run and force it to use the tcp btl I don't seem to run into swap (the job just takes a very very long time). I have tried restricting the size of the free lists, forcing to use send mode, and use an open-mpi compiled w/ no memory manager but nothing seems to help. I've profiled with valgrind --tool=massif and the memtrace capabilities of ptmalloc but I don't have any smoking guns yet. It is a fortran app an I don't know anything about debugging fortran memory problems, can someone point me in the proper direction?
Hmm. If you compile Open MPI with no memory manager, then it *shouldn't* be Open MPI's fault (unless there's a leak in the mvapi BTL...?). Verify that you did not actually compile Open MPI with a memory manager by running "ompi_info| grep ptmalloc2" -- it should come up empty.
The fact that you can run this under TCP without memory leaking would seem to indicate that it's not the app that's leaking memory, but rather either the MPI or the network stack.
-- Jeff Squyres Cisco Systems