Re: [OMPI users] "Connection to lifeline lost" when developing a new rsh agent
Have you looked thru the code in orte/mca/plm/rsh/plm_rsh_module.c? It is executing a tree-like spawn pattern by default, but there isn't anything magic about what ssh is doing. However, there are things done to prep the remote shell (setting paths etc.), and the tree spawn passes some additional parameters. It would be worth your while to read thru it to see if just replacing ssh is going to be enough for your environment. The OOB output is telling you that the connection is being attempted, but being rejected for some reason during the return "ACK". Not sure why that would be happening, unless the remote daemon died during the connection handshake. --debug-daemons doesn't do anything but (a) turn on the debug output, and (b) cause ssh to leave the session open by telling the orted not to "daemonize" itself. The --leave-session-attached option does (b) without all the debug output. On Aug 21, 2012, at 8:15 AM, Yann RADENAC wrote: > > Le 20/08/2012 15:56, Ralph Castain wrote : > > You might try adding "-mca plm_base_verbose 5 --debug-daemons" to watch the > > debug output from the daemons as they are launched. > > There seems to be an interference here: my problem is "solved" by enabling > option --debug-daemons with a verbose level > 0 !! > > This command fails (3 processes on 3 different machines): > > mpirun --mca orte_rsh_agent xos-createProcess --leave-session-attached -np > 3 -host `xreservation -a $XOS_RSVID` mpi/hello_world_MPI > > > This command works !!! > (just adding the debug-daemons with verbose level > 0) : > > mpirun --mca orte_rsh_agent xos-createProcess --leave-session-attached -mca > plm_base_verbose 5 --debug-daemons -np 3 -host `xreservation -a $XOS_RSVID` > mpi/hello_world_MPI > > > Anyway, this is just a workaround, and requiring the users to set the > debug-daemons option is not acceptable. > > So what ssh is doing, and also the debug-daemons, that my agent > xos-createProcess is not doing? > > > >> The lifeline is a socket connection between the daemons and mpirun. For some >> reason, the socket from your remote daemon back to mpirun is being closed, >> which the remote daemon interprets as "lifeline lost" and terminates itself. >> You could try setting the verbosity on the OOB to get the debug output from >> it (see "ompi_info --param oob tcp" for the settings), though it's likely to >> just tell you that the socket closed. > > By the way, no firewall is running on any of my machines. > > Using the oob_tcp options: > > mpirun --mca orte_rsh_agent xos-createProcess --leave-session-attached -mca > oob_tcp_debug 1 -mca oob_tcp_verbose 2 -np 3 -host `xreservation -a > $XOS_RSVID` mpi/hello_world_MPI > > > On the machine running the mpirun, the process is still waiting (polling) and > standard error output is: > > [paradent-26.rennes.grid5000.fr:27762] [[1338,0],0]-[[1338,0],2] accepted: > 172.16.97.26 - 172.16.97.6 nodelay 1 sndbuf 262142 rcvbuf 262142 flags > 0802 > [paradent-26.rennes.grid5000.fr:27762] [[1338,0],0]-[[1338,0],2] > mca_oob_tcp_recv_handler: rejected connection from [[1338,0],2] connection > state 4 > > > > On the remote machine running the orted, orted fails and standard error > output is: > > [paradent-6.rennes.grid5000.fr:10391] [[1338,0],2] routed:binomial: > Connection to lifeline [[1338,0],0] lost > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] "Connection to lifeline lost" when developing a new rsh agent
Le 20/08/2012 15:56, Ralph Castain wrote : > You might try adding "-mca plm_base_verbose 5 --debug-daemons" to watch the debug output from the daemons as they are launched. There seems to be an interference here: my problem is "solved" by enabling option --debug-daemons with a verbose level > 0 !! This command fails (3 processes on 3 different machines): mpirun --mca orte_rsh_agent xos-createProcess --leave-session-attached -np 3 -host `xreservation -a $XOS_RSVID` mpi/hello_world_MPI This command works !!! (just adding the debug-daemons with verbose level > 0) : mpirun --mca orte_rsh_agent xos-createProcess --leave-session-attached -mca plm_base_verbose 5 --debug-daemons -np 3 -host `xreservation -a $XOS_RSVID` mpi/hello_world_MPI Anyway, this is just a workaround, and requiring the users to set the debug-daemons option is not acceptable. So what ssh is doing, and also the debug-daemons, that my agent xos-createProcess is not doing? The lifeline is a socket connection between the daemons and mpirun. For some reason, the socket from your remote daemon back to mpirun is being closed, which the remote daemon interprets as "lifeline lost" and terminates itself. You could try setting the verbosity on the OOB to get the debug output from it (see "ompi_info --param oob tcp" for the settings), though it's likely to just tell you that the socket closed. By the way, no firewall is running on any of my machines. Using the oob_tcp options: mpirun --mca orte_rsh_agent xos-createProcess --leave-session-attached -mca oob_tcp_debug 1 -mca oob_tcp_verbose 2 -np 3 -host `xreservation -a $XOS_RSVID` mpi/hello_world_MPI On the machine running the mpirun, the process is still waiting (polling) and standard error output is: [paradent-26.rennes.grid5000.fr:27762] [[1338,0],0]-[[1338,0],2] accepted: 172.16.97.26 - 172.16.97.6 nodelay 1 sndbuf 262142 rcvbuf 262142 flags 0802 [paradent-26.rennes.grid5000.fr:27762] [[1338,0],0]-[[1338,0],2] mca_oob_tcp_recv_handler: rejected connection from [[1338,0],2] connection state 4 On the remote machine running the orted, orted fails and standard error output is: [paradent-6.rennes.grid5000.fr:10391] [[1338,0],2] routed:binomial: Connection to lifeline [[1338,0],0] lost
Re: [OMPI users] "Connection to lifeline lost" when developing a new rsh agent
Just to be clear: what you are launching is an orted daemon, not your application process. Once the daemons are running, then we use them to launch the actual application process. So the issue here is with starting the daemons themselves. You might try adding "-mca plm_base_verbose 5 --debug-daemons" to watch the debug output from the daemons as they are launched. The lifeline is a socket connection between the daemons and mpirun. For some reason, the socket from your remote daemon back to mpirun is being closed, which the remote daemon interprets as "lifeline lost" and terminates itself. You could try setting the verbosity on the OOB to get the debug output from it (see "ompi_info --param oob tcp" for the settings), though it's likely to just tell you that the socket closed. On Aug 20, 2012, at 5:11 AM, Yann RADENAC wrote: > > Hi, > > I'm developing MPI support for XtreemOS (www.xtreemos.eu) so that an MPI > program is managed as a single XtreemOS job. > To manage all processes as a single XtreemOS job, I've developed the program > xos-createProcess that plays the role of the rsh agent (replacing ssh/rsh) to > start a process on a remote machine that is part of the ones reserved for the > current job. > > I'm running a simple hello world MPI program where each processes sends a > string to the process 0 that itself prints them on standard output. > > When using OpenMPI with ssh, this program works perfectly on several machines. > > When using OpenMPI with my launcher xos-createProcess, it works with an MPI > program of 2 processes on 2 different machines. > > However I cannot pass through the following error that happens when running > an MPI program of 3 processes on 3 different machines (or any n processes on > n different machines with n >= 3). > > A process started by xos-createProcess on a remote machine ends with the > following error: > > [paradent-5.rennes.grid5000.fr:08191] [[50627,0],2] routed:binomial: > Connection to lifeline [[50627,0],0] lost > > But, process 0 is still running! lifeline should not have been lost! > Actually, process 0 is still waiting for remote process to terminate (checked > with gdb, the initial process is calling libc's poll()). > > > The run command is: > > -bash -c '(mpirun --mca orte_rsh_agent xos-createProcess > --leave-session-attached -np 2 -host `xreservation -a $XOS_RSVID` > mpi/hello_world_MPI < /dev/null > mpirun.out) >& mpirun.err' > > Same problem with or without option --leave-session-attached. > > > > So, how is the lifeline implemented? why does it work with 2 processes but > start failing when using 3 or more processes? > > > I'm using Open MPI 1.6. > > > Thanks for your help. > > -- > Yann Radenac > Research Engineer, INRIA > Myriads research team, INRIA Rennes - Bretagne Atlantique > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] "Connection to lifeline lost" when developing a new rsh agent
Hi, I'm developing MPI support for XtreemOS (www.xtreemos.eu) so that an MPI program is managed as a single XtreemOS job. To manage all processes as a single XtreemOS job, I've developed the program xos-createProcess that plays the role of the rsh agent (replacing ssh/rsh) to start a process on a remote machine that is part of the ones reserved for the current job. I'm running a simple hello world MPI program where each processes sends a string to the process 0 that itself prints them on standard output. When using OpenMPI with ssh, this program works perfectly on several machines. When using OpenMPI with my launcher xos-createProcess, it works with an MPI program of 2 processes on 2 different machines. However I cannot pass through the following error that happens when running an MPI program of 3 processes on 3 different machines (or any n processes on n different machines with n >= 3). A process started by xos-createProcess on a remote machine ends with the following error: [paradent-5.rennes.grid5000.fr:08191] [[50627,0],2] routed:binomial: Connection to lifeline [[50627,0],0] lost But, process 0 is still running! lifeline should not have been lost! Actually, process 0 is still waiting for remote process to terminate (checked with gdb, the initial process is calling libc's poll()). The run command is: -bash -c '(mpirun --mca orte_rsh_agent xos-createProcess --leave-session-attached -np 2 -host `xreservation -a $XOS_RSVID` mpi/hello_world_MPI < /dev/null > mpirun.out) >& mpirun.err' Same problem with or without option --leave-session-attached. So, how is the lifeline implemented? why does it work with 2 processes but start failing when using 3 or more processes? I'm using Open MPI 1.6. Thanks for your help. -- Yann Radenac Research Engineer, INRIA Myriads research team, INRIA Rennes - Bretagne Atlantique