On Mon, 2006-11-27 at 15:57 -0800, Mark A. Grondona wrote: > > On Mon, 2006-11-27 at 16:29 -0700, Brian W Barrett wrote: > > > On Nov 27, 2006, at 4:19 PM, Matt Leininger wrote: > > > > > > > I've been running more tests of OpenMPI v1.2b. I've run into several > > > > cases where the app+MPI use too much memory and the OOM handler kills > > > > off tasks. Sometimes the ompi mpirun shuts down gracefully, but other > > > > times the OOM handler may kill off 1 to 4 MPI tasks per node (when I'm > > > > using 8 MPI tasks per node). The remaining MPI tasks keep > > > > running/polling and have to be killed off by hand. Has anyone seen > > > > this > > > > behavior before? > > > > > > Are the orteds also getting killed? > > > > Not sure. I'll check the next time I see this. > > > > I haven't seen any evidence that orteds are being killed by the Out of Memory > killer. Only MPI application processes seem to be the chosen victim(s).
I can confirm this. I'm running a 2 node 16 MPI task job. On one node all 8 mpi tasks where killed and the other node only had 1 mpi task killed. The orted's are still running on each node, but it's not cleaning up. - Matt > > > > > > > > I'm not really familiar with the OOM killer -- does it cause the > > > parent of the killed process to get a SIGCHLD? If not, that could be > > > a fairly serious problem for us, as we rely on SIGCHLDs being > > > received by the orteds when things die... > > > > Mark Grondona could answer this. His reply to devel-core bounced so > > I'm including de...@open-mpi.org on this thread. > > > No, being killed by the OOM killer should be the same as being sent > SIGKILL as far as userspace is concerned. SIGCHLD to the parent will still > be sent (and wait(2) will return, etc.) > > mark >