Brad replied with more information here:

    https://svn.open-mpi.org/trac/ompi/ticket/1791#comment:3


On Feb 17, 2009, at 12:54 PM, Jeff Squyres wrote:

Per the call today, I ran the reduce-hang.c test on a single x86 4- core xeon node with both 1.2.8 and 1.2.9. I see the same behavior with both:

- none of the processes hang; they all keep iterating over MPI_REDUCE
- the iteration numbers sent to stdout are more-or-less in sync
- memory usage (via top) is fairly steady
- ...except when MPI_COMM_WORLD rank 0 memory usage periodically jumps up significantly

I do not believe that this is a 1.2.9 regression; see below.

For example, I've had the test running for 40+ minutes right now; here's the current output from top:

-----
 PID USER      PR  NI %CPU    TIME+  %MEM  VIRT  RES  SHR S COMMAND
7332 jsquyres 25 0 100 43:56.79 36.1 33.8g 1.3g 67m R reduce- hang 7333 jsquyres 25 0 100 43:54.10 0.7 202m 26m 18m R reduce- hang 7335 jsquyres 25 0 100 43:56.94 1.0 203m 35m 26m R reduce- hang 7334 jsquyres 25 0 100 43:56.81 1.0 203m 36m 27m R reduce- hang
-----

33.8GB virtual size, 1.3GB resident. Obviously, this number started off much lower than that; it took several "jumps" to get to that size. It's now more-or-less stable at these sizes; I haven't seen any jumps recently.

This smells quite a bit like MCW rank 0 simply got busy and "fell behind" for some [temporary] reason. It therefore accumulated a whole pile of unexpected messages, thereby pushing memory utilization up. Those unexpected messages eventually got consumed, but the memory is placed on a freelist and not released.

This is also congruent with Brad's observation that he's much more likely to see this behavior if he's on a machine with SMT enabled -- where it's much more likely that MCW rank 0 can "fall behind", and possibly never catch up.

So my guess is that this is the same old unbounded unexpected message issue that we've known about for a while.

--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

Reply via email to