[OMPI devel] orterun hanging

2011-04-06 Thread Eugene Loh
I'm running into a hang that is very easy to reproduce.  Basically, 
something like this:


% mpirun -H remote_node hostname
remote_node
^C

That is, I run a program (doesn't need to be MPI) on a remote node.  The 
program runs, but my local orterun doesn't return.  The problem seems to 
be correlated to the OS version (some very recent builds of Solaris) 
running on the remote node.


The problem would seem to be in the OS, though arguably it could be a 
long-time OMPI problem that is being exposed by a change in the OS.  
Regardless, does anyone have suggestions where I should be looking?


So far, it looks to me that the HNP orterun forks a child who launches 
an ssh process to start the remote orted.  Then, the remote orted 
daemonizes itself (forks a child and kills the parent, thereby detaching 
the daemon from the controlling terminal) and runs the user binary.  It 
seems to me that this daemonization is related to the problem.  
Specifically, if I use "mpirun --debug-daemons", there is no 
daemonization and the hang does not occur.  Perhaps, with some recent OS 
changes, the daemonized process is no longer alerting the HNP orterun 
when it's done.


Any suggestions where I should focus my efforts?  I'm working with v1.5.


Re: [OMPI devel] Add child to another parent.

2011-04-06 Thread Hugo Meyer
Hi all.


I corrected the error with the port. The mistake was because he tried to
start theprocess back and the ports are static, the process was taking a port
where an app was already running.

Initially, the process was running on [[65478,0],1] and then it moves to
[[65478,0],2].

So now i get the socket binded, but i'm getting a communication
failure in [[65478,0],1].
I'm sending as an atachment my debug output (there are some things in
spanish, but there still are the open-mpi default debug output), where you
can see the moment where i kill the process running con *clus5 *to the
moment where it is restored in *clus3. *And then i get a TERMINATED WITHOUT
SYNC in the proc restarted:

*clus3:21615] [[65478,0],2] errmgr:orted got state TERMINATED WITHOUT SYNC
for proc [[65478,1],1] pid 21705*

*
*
Here i put the output of my stdout after the socket is binded again when the
process restarts.


[1,1]:SOCKET BINDED
[1,1]:[clus5:19425] App) notify_response: Waiting for final
handshake.
[1,1]:[clus5:19425] App) update_status: Update checkpoint status
(13, /tmp/radic/1) for [[65478,1],1]
[1,0]:INICIEI O BROADCAST (6)
[1,0]:FINALIZEI O BROADCAST (6)
[1,0]:INICIEI O BROADCAST
[1,3]:INICIEI O BROADCAST (6)
[1,2]:INICIEI O BROADCAST (6)
[1,3]:FINALIZEI O BROADCAST (6)
[1,3]:INICIEI O BROADCAST
[1,2]:FINALIZEI O BROADCAST (6)
[1,2]:INICIEI O BROADCAST
[1,1]:[clus5:19425] [[65478,1],1] errmgr:app: job [65478,0] reported
state COMMUNICATION FAILURE for proc [[65478,0],1] state COMMUNICATION
FAILURE exit_code 1
[1,1]:[clus5:19425] [[65478,1],1] routed:cm: Connection to lifeline
[[65478,0],1] lost
[1,1]:[[65478,1],1] assigned port 31256

Any help on how to solve this error, or how to interpret it will be greatly
appreciated.

Best regards.

Hugo

2011/4/5 Hugo Meyer 

> Hello Ralph and @ll.
>
> Ralph, by following your recomendations i've already restart the process in
> another node from his checkpoint. But now i'm having a small problem with
> the oob_tcp. There is the output:
>
> odls_base_default_fns:SETEANDO BLCR CONTEXT
> CKPT-FILE: /tmp/radic/1/ompi_blcr_context.13374
> ODLS_BASE_DEFAULT_FNS: REINICIO PROCESO EN [[34224,0],2]
> [1,1]:INICIEI O BROADCAST (2)
> [1,1]:[clus5:13374] snapc:single:app do_checkpoint: RESTART (3)
> *[1,1]:[clus5:13374] mca_oob_tcp_init: creating listen socket*
> *[1,1]:[clus5:13374] mca_oob_tcp_init: unable to create IPv4
> listen socket: Unable to open a TCP socket for out-of-band communications*
> [1,1]:[clus5:13374] App) notify_response: Waiting for final
> handshake*.*
> [1,1]:[clus5:13374] App) update_status: Update checkpoint status
> (13, /tmp/radic/1) for [[34224,1],1]
> [1,0]:INICIEI O BROADCAST (6)
> [1,0]:FINALIZEI O BROADCAST (6)
> [1,0]:INICIEI O BROADCAST
> [1,3]:INICIEI O BROADCAST (6)
> [1,3]:FINALIZEI O BROADCAST (6)
> [1,3]:INICIEI O BROADCAST
> *[1,1]:[clus5:13374] [[34224,1],1] errmgr:app: job [34224,0]
> reported state COMMUNICATION FAILURE for proc [[34224,0],1] state
> COMMUNICATION FAILURE exit_code 1*
> *[1,1]:[clus5:13374] [[34224,1],1] routed:cm: Connection to
> lifeline [[34224,0],1] lost*
>
>
> I'm thinking that this error ocurrs because the process want to create the
> socket using the port that was previously assigned to it. So, if i want to
> restart it using another port or something how the other daemons and process
> will find out about this? Is this a good choice?
>
> Best regards.
>
> Hugo Meyer
>
> 2011/3/31 Hugo Meyer 
>
>> Ok Ralph.
>> Thanks a lot, i will resend this message with a new subject.
>>
>> Best Regards.
>>
>> Hugo
>>
>>
>> 2011/3/31 Ralph Castain 
>>
>>> Sorry - should have included the devel list when I sent this.
>>>
>>>
>>> On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote:
>>>
>>> I'm not the expert on this area - Josh is, so I'll defer to him. I did
>>> take a quick glance at the sstore framework, though, and it looks like there
>>> are some params you could set that might help.
>>>
>>> "ompi_info --param sstore all"
>>>
>>> should tell you what's available. Also, note that Josh created a man page
>>> to explain how sstore works. It's in section 7, looks like "man orte_sstore"
>>> should get it.
>>>
>>>
>>> On Mar 30, 2011, at 3:09 PM, Hugo Meyer wrote:
>>>
>>> Hello again.
>>>
>>> I'm working in the launch code to handle my checkpoints, but i'm a little
>>> stuck in how to set the path to my checkpoint and the executable
>>> (ompi_blcr_context.PID). I take a look at the code in
>>> odls_base_default_fns.c and this piece of code took my attention:
>>>
>>> #if OPAL_ENABLE_FT_CR == 1
>>> /*
>>>  * OPAL CRS components need the opportunity to take action
>>> before a process
>>>  * is forked.
>>>  * Needs access to:
>>>  *   - Environment
>>>  *   - Rank/ORTE Name
>>>  *   - Binary to exec
>>>  */
>>> if( NULL != opal_crs.crs_prelaunch ) {
>>> if( OPAL_SUCCESS != (rc =
>>> opal_crs.crs_prelaunch(child->name->vpid,

Re: [OMPI devel] Add child to another parent.

2011-04-06 Thread Ralph Castain
Looks like the lifeline is still pointing to its old daemon instead of being 
updated to the new one. Look in orte/mca/routed/cm/routed_cm.c - should be 
something in there that updates the lifeline during restart of a checkpoint.


On Apr 6, 2011, at 7:50 AM, Hugo Meyer wrote:

> Hi all.
> 
> I corrected the error with the port. The mistake was because he tried to 
> start theprocess back and the ports are static, the process was taking a port 
> where an app was already running.
> 
> Initially, the process was running on [[65478,0],1] and then it moves to 
> [[65478,0],2].
> 
> So now i get the socket binded, but i'm getting a communication failure in 
> [[65478,0],1]. I'm sending as an atachment my debug output (there are some 
> things in spanish, but there still are the open-mpi default debug output), 
> where you can see the moment where i kill the process running con clus5 to 
> the moment where it is restored in clus3. And then i get a TERMINATED WITHOUT 
> SYNC in the proc restarted:
> clus3:21615] [[65478,0],2] errmgr:orted got state TERMINATED WITHOUT SYNC for 
> proc [[65478,1],1] pid 21705
> 
> Here i put the output of my stdout after the socket is binded again when the 
> process restarts.
> 
> [1,1]:SOCKET BINDED 
> [1,1]:[clus5:19425] App) notify_response: Waiting for final handshake.
> [1,1]:[clus5:19425] App) update_status: Update checkpoint status (13, 
> /tmp/radic/1) for [[65478,1],1]
> [1,0]:INICIEI O BROADCAST (6)
> [1,0]:FINALIZEI O BROADCAST (6)
> [1,0]:INICIEI O BROADCAST
> [1,3]:INICIEI O BROADCAST (6)
> [1,2]:INICIEI O BROADCAST (6)
> [1,3]:FINALIZEI O BROADCAST (6)
> [1,3]:INICIEI O BROADCAST
> [1,2]:FINALIZEI O BROADCAST (6)
> [1,2]:INICIEI O BROADCAST
> [1,1]:[clus5:19425] [[65478,1],1] errmgr:app: job [65478,0] reported 
> state COMMUNICATION FAILURE for proc [[65478,0],1] state COMMUNICATION 
> FAILURE exit_code 1
> [1,1]:[clus5:19425] [[65478,1],1] routed:cm: Connection to lifeline 
> [[65478,0],1] lost
> [1,1]:[[65478,1],1] assigned port 31256
> 
> Any help on how to solve this error, or how to interpret it will be greatly 
> appreciated.
> 
> Best regards.
> 
> Hugo
> 
> 2011/4/5 Hugo Meyer 
> Hello Ralph and @ll.
> 
> Ralph, by following your recomendations i've already restart the process in 
> another node from his checkpoint. But now i'm having a small problem with the 
> oob_tcp. There is the output:
> 
> odls_base_default_fns:SETEANDO BLCR CONTEXT
> CKPT-FILE: /tmp/radic/1/ompi_blcr_context.13374
> ODLS_BASE_DEFAULT_FNS: REINICIO PROCESO EN [[34224,0],2] 
> [1,1]:INICIEI O BROADCAST (2)
> [1,1]:[clus5:13374] snapc:single:app do_checkpoint: RESTART (3)
> [1,1]:[clus5:13374] mca_oob_tcp_init: creating listen socket
> [1,1]:[clus5:13374] mca_oob_tcp_init: unable to create IPv4 listen 
> socket: Unable to open a TCP socket for out-of-band communications
> [1,1]:[clus5:13374] App) notify_response: Waiting for final handshake.
> [1,1]:[clus5:13374] App) update_status: Update checkpoint status (13, 
> /tmp/radic/1) for [[34224,1],1]
> [1,0]:INICIEI O BROADCAST (6)
> [1,0]:FINALIZEI O BROADCAST (6)
> [1,0]:INICIEI O BROADCAST
> [1,3]:INICIEI O BROADCAST (6)
> [1,3]:FINALIZEI O BROADCAST (6)
> [1,3]:INICIEI O BROADCAST
> [1,1]:[clus5:13374] [[34224,1],1] errmgr:app: job [34224,0] reported 
> state COMMUNICATION FAILURE for proc [[34224,0],1] state COMMUNICATION 
> FAILURE exit_code 1
> [1,1]:[clus5:13374] [[34224,1],1] routed:cm: Connection to lifeline 
> [[34224,0],1] lost
> 
> I'm thinking that this error ocurrs because the process want to create the 
> socket using the port that was previously assigned to it. So, if i want to 
> restart it using another port or something how the other daemons and process 
> will find out about this? Is this a good choice?
> 
> Best regards.
> 
> Hugo Meyer
> 
> 2011/3/31 Hugo Meyer 
> Ok Ralph. 
> Thanks a lot, i will resend this message with a new subject.
> 
> Best Regards.
> 
> Hugo
> 
> 
> 2011/3/31 Ralph Castain 
> Sorry - should have included the devel list when I sent this.
> 
> 
> On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote:
> 
>> I'm not the expert on this area - Josh is, so I'll defer to him. I did take 
>> a quick glance at the sstore framework, though, and it looks like there are 
>> some params you could set that might help.
>> 
>> "ompi_info --param sstore all"
>> 
>> should tell you what's available. Also, note that Josh created a man page to 
>> explain how sstore works. It's in section 7, looks like "man orte_sstore" 
>> should get it.
>> 
>> 
>> On Mar 30, 2011, at 3:09 PM, Hugo Meyer wrote:
>> 
>>> Hello again.
>>> 
>>> I'm working in the launch code to handle my checkpoints, but i'm a little 
>>> stuck in how to set the path to my checkpoint and the executable 
>>> (ompi_blcr_context.PID). I take a look at the code in 
>>> odls_base_default_fns.c and this piece of code took my attention:
>>> 
>>> #if OPAL_ENABLE_FT_CR == 1
>>> /*
>>>  * OPAL CRS components need the opportunity to take acti

[OMPI devel] Open MPI Developers Meeting Agenda

2011-04-06 Thread Joshua Hursey
Reminder:
  If you are interested in attending the May 3-5 Open MPI Developers Meeting at 
ORNL let let Rich (rlgraham -at- ornl -dot- gov) and I know as soon as possible 
so we can start the paperwork. This is of particular importance for non-US 
citizens since the paperwork takes considerably more time.


The meeting will be three full days (May 3-5) on the ORNL campus. I intend to 
setup a teleconf for some/most of the sessions for those that cannot attend in 
person. Once we have the agenda topics on the table we can start negotiating 
time allotments.

Below are the agenda items that I have gathered so far (in no particular order):
 - MPI 2.2 implementation tickets
 - MPI 3.0 implementation planning
 - ORNL: Hierarchical Collectives discussion
 - Runtime integration discussion
 - New Process Affinity functionality
 - Update on ORTE development
 - Fault tolerance feature development and integration
   (C/R, logging, replication, FT-MPI, MPI 3.0, message reliability, ...)

Other topics that I thought of which folks might want to discuss - Is there 
anyone that wants to include these and lead their discussions?:
 - Threading design
 - Performance tuning (Point-to-point and/or Collective)
 - Testing infrastructure (MTT)


Keep sending agenda items to the list (or me directly if you would rather). I 
hope to have the agenda sketched out by the teleconf on 4/12 so we can fine 
tune it on the call.

Thanks,
Josh


Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey