[OMPI devel] orterun hanging
I'm running into a hang that is very easy to reproduce. Basically, something like this: % mpirun -H remote_node hostname remote_node ^C That is, I run a program (doesn't need to be MPI) on a remote node. The program runs, but my local orterun doesn't return. The problem seems to be correlated to the OS version (some very recent builds of Solaris) running on the remote node. The problem would seem to be in the OS, though arguably it could be a long-time OMPI problem that is being exposed by a change in the OS. Regardless, does anyone have suggestions where I should be looking? So far, it looks to me that the HNP orterun forks a child who launches an ssh process to start the remote orted. Then, the remote orted daemonizes itself (forks a child and kills the parent, thereby detaching the daemon from the controlling terminal) and runs the user binary. It seems to me that this daemonization is related to the problem. Specifically, if I use "mpirun --debug-daemons", there is no daemonization and the hang does not occur. Perhaps, with some recent OS changes, the daemonized process is no longer alerting the HNP orterun when it's done. Any suggestions where I should focus my efforts? I'm working with v1.5.
Re: [OMPI devel] Add child to another parent.
Hi all. I corrected the error with the port. The mistake was because he tried to start theprocess back and the ports are static, the process was taking a port where an app was already running. Initially, the process was running on [[65478,0],1] and then it moves to [[65478,0],2]. So now i get the socket binded, but i'm getting a communication failure in [[65478,0],1]. I'm sending as an atachment my debug output (there are some things in spanish, but there still are the open-mpi default debug output), where you can see the moment where i kill the process running con *clus5 *to the moment where it is restored in *clus3. *And then i get a TERMINATED WITHOUT SYNC in the proc restarted: *clus3:21615] [[65478,0],2] errmgr:orted got state TERMINATED WITHOUT SYNC for proc [[65478,1],1] pid 21705* * * Here i put the output of my stdout after the socket is binded again when the process restarts. [1,1]:SOCKET BINDED [1,1]:[clus5:19425] App) notify_response: Waiting for final handshake. [1,1]:[clus5:19425] App) update_status: Update checkpoint status (13, /tmp/radic/1) for [[65478,1],1] [1,0]:INICIEI O BROADCAST (6) [1,0]:FINALIZEI O BROADCAST (6) [1,0]:INICIEI O BROADCAST [1,3]:INICIEI O BROADCAST (6) [1,2]:INICIEI O BROADCAST (6) [1,3]:FINALIZEI O BROADCAST (6) [1,3]:INICIEI O BROADCAST [1,2]:FINALIZEI O BROADCAST (6) [1,2]:INICIEI O BROADCAST [1,1]:[clus5:19425] [[65478,1],1] errmgr:app: job [65478,0] reported state COMMUNICATION FAILURE for proc [[65478,0],1] state COMMUNICATION FAILURE exit_code 1 [1,1]:[clus5:19425] [[65478,1],1] routed:cm: Connection to lifeline [[65478,0],1] lost [1,1]:[[65478,1],1] assigned port 31256 Any help on how to solve this error, or how to interpret it will be greatly appreciated. Best regards. Hugo 2011/4/5 Hugo Meyer > Hello Ralph and @ll. > > Ralph, by following your recomendations i've already restart the process in > another node from his checkpoint. But now i'm having a small problem with > the oob_tcp. There is the output: > > odls_base_default_fns:SETEANDO BLCR CONTEXT > CKPT-FILE: /tmp/radic/1/ompi_blcr_context.13374 > ODLS_BASE_DEFAULT_FNS: REINICIO PROCESO EN [[34224,0],2] > [1,1]:INICIEI O BROADCAST (2) > [1,1]:[clus5:13374] snapc:single:app do_checkpoint: RESTART (3) > *[1,1]:[clus5:13374] mca_oob_tcp_init: creating listen socket* > *[1,1]:[clus5:13374] mca_oob_tcp_init: unable to create IPv4 > listen socket: Unable to open a TCP socket for out-of-band communications* > [1,1]:[clus5:13374] App) notify_response: Waiting for final > handshake*.* > [1,1]:[clus5:13374] App) update_status: Update checkpoint status > (13, /tmp/radic/1) for [[34224,1],1] > [1,0]:INICIEI O BROADCAST (6) > [1,0]:FINALIZEI O BROADCAST (6) > [1,0]:INICIEI O BROADCAST > [1,3]:INICIEI O BROADCAST (6) > [1,3]:FINALIZEI O BROADCAST (6) > [1,3]:INICIEI O BROADCAST > *[1,1]:[clus5:13374] [[34224,1],1] errmgr:app: job [34224,0] > reported state COMMUNICATION FAILURE for proc [[34224,0],1] state > COMMUNICATION FAILURE exit_code 1* > *[1,1]:[clus5:13374] [[34224,1],1] routed:cm: Connection to > lifeline [[34224,0],1] lost* > > > I'm thinking that this error ocurrs because the process want to create the > socket using the port that was previously assigned to it. So, if i want to > restart it using another port or something how the other daemons and process > will find out about this? Is this a good choice? > > Best regards. > > Hugo Meyer > > 2011/3/31 Hugo Meyer > >> Ok Ralph. >> Thanks a lot, i will resend this message with a new subject. >> >> Best Regards. >> >> Hugo >> >> >> 2011/3/31 Ralph Castain >> >>> Sorry - should have included the devel list when I sent this. >>> >>> >>> On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote: >>> >>> I'm not the expert on this area - Josh is, so I'll defer to him. I did >>> take a quick glance at the sstore framework, though, and it looks like there >>> are some params you could set that might help. >>> >>> "ompi_info --param sstore all" >>> >>> should tell you what's available. Also, note that Josh created a man page >>> to explain how sstore works. It's in section 7, looks like "man orte_sstore" >>> should get it. >>> >>> >>> On Mar 30, 2011, at 3:09 PM, Hugo Meyer wrote: >>> >>> Hello again. >>> >>> I'm working in the launch code to handle my checkpoints, but i'm a little >>> stuck in how to set the path to my checkpoint and the executable >>> (ompi_blcr_context.PID). I take a look at the code in >>> odls_base_default_fns.c and this piece of code took my attention: >>> >>> #if OPAL_ENABLE_FT_CR == 1 >>> /* >>> * OPAL CRS components need the opportunity to take action >>> before a process >>> * is forked. >>> * Needs access to: >>> * - Environment >>> * - Rank/ORTE Name >>> * - Binary to exec >>> */ >>> if( NULL != opal_crs.crs_prelaunch ) { >>> if( OPAL_SUCCESS != (rc = >>> opal_crs.crs_prelaunch(child->name->vpid,
Re: [OMPI devel] Add child to another parent.
Looks like the lifeline is still pointing to its old daemon instead of being updated to the new one. Look in orte/mca/routed/cm/routed_cm.c - should be something in there that updates the lifeline during restart of a checkpoint. On Apr 6, 2011, at 7:50 AM, Hugo Meyer wrote: > Hi all. > > I corrected the error with the port. The mistake was because he tried to > start theprocess back and the ports are static, the process was taking a port > where an app was already running. > > Initially, the process was running on [[65478,0],1] and then it moves to > [[65478,0],2]. > > So now i get the socket binded, but i'm getting a communication failure in > [[65478,0],1]. I'm sending as an atachment my debug output (there are some > things in spanish, but there still are the open-mpi default debug output), > where you can see the moment where i kill the process running con clus5 to > the moment where it is restored in clus3. And then i get a TERMINATED WITHOUT > SYNC in the proc restarted: > clus3:21615] [[65478,0],2] errmgr:orted got state TERMINATED WITHOUT SYNC for > proc [[65478,1],1] pid 21705 > > Here i put the output of my stdout after the socket is binded again when the > process restarts. > > [1,1]:SOCKET BINDED > [1,1]:[clus5:19425] App) notify_response: Waiting for final handshake. > [1,1]:[clus5:19425] App) update_status: Update checkpoint status (13, > /tmp/radic/1) for [[65478,1],1] > [1,0]:INICIEI O BROADCAST (6) > [1,0]:FINALIZEI O BROADCAST (6) > [1,0]:INICIEI O BROADCAST > [1,3]:INICIEI O BROADCAST (6) > [1,2]:INICIEI O BROADCAST (6) > [1,3]:FINALIZEI O BROADCAST (6) > [1,3]:INICIEI O BROADCAST > [1,2]:FINALIZEI O BROADCAST (6) > [1,2]:INICIEI O BROADCAST > [1,1]:[clus5:19425] [[65478,1],1] errmgr:app: job [65478,0] reported > state COMMUNICATION FAILURE for proc [[65478,0],1] state COMMUNICATION > FAILURE exit_code 1 > [1,1]:[clus5:19425] [[65478,1],1] routed:cm: Connection to lifeline > [[65478,0],1] lost > [1,1]:[[65478,1],1] assigned port 31256 > > Any help on how to solve this error, or how to interpret it will be greatly > appreciated. > > Best regards. > > Hugo > > 2011/4/5 Hugo Meyer > Hello Ralph and @ll. > > Ralph, by following your recomendations i've already restart the process in > another node from his checkpoint. But now i'm having a small problem with the > oob_tcp. There is the output: > > odls_base_default_fns:SETEANDO BLCR CONTEXT > CKPT-FILE: /tmp/radic/1/ompi_blcr_context.13374 > ODLS_BASE_DEFAULT_FNS: REINICIO PROCESO EN [[34224,0],2] > [1,1]:INICIEI O BROADCAST (2) > [1,1]:[clus5:13374] snapc:single:app do_checkpoint: RESTART (3) > [1,1]:[clus5:13374] mca_oob_tcp_init: creating listen socket > [1,1]:[clus5:13374] mca_oob_tcp_init: unable to create IPv4 listen > socket: Unable to open a TCP socket for out-of-band communications > [1,1]:[clus5:13374] App) notify_response: Waiting for final handshake. > [1,1]:[clus5:13374] App) update_status: Update checkpoint status (13, > /tmp/radic/1) for [[34224,1],1] > [1,0]:INICIEI O BROADCAST (6) > [1,0]:FINALIZEI O BROADCAST (6) > [1,0]:INICIEI O BROADCAST > [1,3]:INICIEI O BROADCAST (6) > [1,3]:FINALIZEI O BROADCAST (6) > [1,3]:INICIEI O BROADCAST > [1,1]:[clus5:13374] [[34224,1],1] errmgr:app: job [34224,0] reported > state COMMUNICATION FAILURE for proc [[34224,0],1] state COMMUNICATION > FAILURE exit_code 1 > [1,1]:[clus5:13374] [[34224,1],1] routed:cm: Connection to lifeline > [[34224,0],1] lost > > I'm thinking that this error ocurrs because the process want to create the > socket using the port that was previously assigned to it. So, if i want to > restart it using another port or something how the other daemons and process > will find out about this? Is this a good choice? > > Best regards. > > Hugo Meyer > > 2011/3/31 Hugo Meyer > Ok Ralph. > Thanks a lot, i will resend this message with a new subject. > > Best Regards. > > Hugo > > > 2011/3/31 Ralph Castain > Sorry - should have included the devel list when I sent this. > > > On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote: > >> I'm not the expert on this area - Josh is, so I'll defer to him. I did take >> a quick glance at the sstore framework, though, and it looks like there are >> some params you could set that might help. >> >> "ompi_info --param sstore all" >> >> should tell you what's available. Also, note that Josh created a man page to >> explain how sstore works. It's in section 7, looks like "man orte_sstore" >> should get it. >> >> >> On Mar 30, 2011, at 3:09 PM, Hugo Meyer wrote: >> >>> Hello again. >>> >>> I'm working in the launch code to handle my checkpoints, but i'm a little >>> stuck in how to set the path to my checkpoint and the executable >>> (ompi_blcr_context.PID). I take a look at the code in >>> odls_base_default_fns.c and this piece of code took my attention: >>> >>> #if OPAL_ENABLE_FT_CR == 1 >>> /* >>> * OPAL CRS components need the opportunity to take acti
[OMPI devel] Open MPI Developers Meeting Agenda
Reminder: If you are interested in attending the May 3-5 Open MPI Developers Meeting at ORNL let let Rich (rlgraham -at- ornl -dot- gov) and I know as soon as possible so we can start the paperwork. This is of particular importance for non-US citizens since the paperwork takes considerably more time. The meeting will be three full days (May 3-5) on the ORNL campus. I intend to setup a teleconf for some/most of the sessions for those that cannot attend in person. Once we have the agenda topics on the table we can start negotiating time allotments. Below are the agenda items that I have gathered so far (in no particular order): - MPI 2.2 implementation tickets - MPI 3.0 implementation planning - ORNL: Hierarchical Collectives discussion - Runtime integration discussion - New Process Affinity functionality - Update on ORTE development - Fault tolerance feature development and integration (C/R, logging, replication, FT-MPI, MPI 3.0, message reliability, ...) Other topics that I thought of which folks might want to discuss - Is there anyone that wants to include these and lead their discussions?: - Threading design - Performance tuning (Point-to-point and/or Collective) - Testing infrastructure (MTT) Keep sending agenda items to the list (or me directly if you would rather). I hope to have the agenda sketched out by the teleconf on 4/12 so we can fine tune it on the call. Thanks, Josh Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey