Re: [OMPI devel] Open MPI and CRIU stdout/stderr

2014-03-19 Thread Jeff Squyres (jsquyres)
On Mar 19, 2014, at 9:13 AM, Adrian Reber  wrote:

> What does Open MPI do with the file descriptors for stdout/stderr?

We admittedly do funny things with stdin, stdout, and stderr...  The short 
version is that OMPI intercepts all the stdin, stdout, and stderr from each MPI 
process and relays it back up to mpirun through our IOF subsystem (IOF = I/O 
forwarding).

Consider: users launch N processes (potentially on multiple different servers) 
via

   mpirun --hostfile hosts -np N my_mpi_executable

They also expect to be able to use standard shell redirection via the mpirun 
command.  For example:

   mpirun --hostfile hosts -np N my_mpi_executable |& tee out.txt

To explain what happens, we have to explain a little of how OMPI launches 
processes. Let's take the ssh case, for simplicity (there are other mechanisms 
it can use to launch on remote servers, but for the purposes of this 
discussion, they're basically variants of what happens with ssh).

1. mpirun parses the hosts hostfile and extracts the list of servers on which 
to launch.
2. mpirun fork/execs an ssh command to each remote node, and launches the Open 
MPI helper daemon "orted"
3. The orted launches on the remote server, does some housekeeping, and 
eventually receives the launch command from mpirun
4. The launch command contains the executable and argv to fork/exec, and how 
many of them.  
5. For example: mpirun --hostfile hosts -np 4 my_mpi_executable.  If the 
"hosts" file contains serverA and serverB, then mpirun would launch 2 ssh's -- 
one each to serverA and serverB.  After some startup negotiation, mpirun would 
send a launch command telling the orted on each of serverA and serverB to 
launch 2 copies of my_mpi_executable.
6. For each child that the orted will create, it:
   - creates (up to) 3 pipes, for: stdin, stdout, stderr
   - forks
   - closes stdin, stdout, stderr
   - dups the pipes into 0, 1, 2
   - (by default, we actually close stdin on all processes except the first one)
   - execs my_mpi_application
7. In this way, the orted can intercept the stdout/stderr from the process and 
send it back to mpirun, which can then write it on its own stdout/stderr.  And 
therefore shell redirection from mpirun works as expected.
8. Similarly, the stdin from mpirun can be sent to any process where we kept 
stdin open (as mentioned above, by default, this is only the first process).

In short: the orted acts as a proxy for the stdout and stderr (and potentially 
stdin) for all launched processes.

> Would it make sense to close stdout/stderr of each checkpointed process
> before checkpointing it?

Maybe...?

But my gut reaction is that you don't want to because of the "continue" case.  
I.e., having the orted go through all the IOF setup again could be a bit 
tricky...  We didn't need to do this for other checkpointing systems.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



[OMPI devel] Open MPI and CRIU stdout/stderr

2014-03-19 Thread Adrian Reber
Cross-posting to criu and openmpi devel mailinglists.

To get fault tolerance back into Open MPI I added code to use criu as
a checkpoint/restart tool. I can checkpoint a process successfully
but I have troubles restarting it. CRIU has currently problems restoring
the process which is probably related stdout/stderr handling.

(00.026198)  15852: Error (tty.c:541): tty: Can't dup SELF_STDIN_OFF: Bad file 
descriptor

What does Open MPI do with the file descriptors for stdout/stderr?

Would it make sense to close stdout/stderr of each checkpointed process
before checkpointing it?

Is there something concerning stdout/stderr which I forgot to handle?

Adrian