Everything operates via the state machine - events trigger moving the job from 
one state to the next, with each state being tied to a callback function that 
implements that state. If you set state_base_verbose=5, you’ll see when and 
where each state gets executed.

By default, the launch_app state goes to a function in the plm/base:

https://github.com/open-mpi/ompi/blob/master/orte/mca/plm/base/plm_base_launch_support.c#L477
 
<https://github.com/open-mpi/ompi/blob/master/orte/mca/plm/base/plm_base_launch_support.c#L477>

I suspect the problem is that your plm component isn’t activating the next step 
upon completion of launch_daemons.


> On May 3, 2017, at 8:15 AM, Justin Cinkelj <justin.cink...@xlab.si> wrote:
> 
> So "remote spawn" and children refer to orted daemons only, and I was looking 
> into wrong modules.
> 
> Which module(s) are then responsible to send command to orted to start mpi 
> application?
> Which event names should I search for?
> 
> Thank you,
> Justin
> 
> ----- Original Message -----
>> From: r...@open-mpi.org
>> To: "OpenMPI Devel" <devel@lists.open-mpi.org>
>> Sent: Wednesday, May 3, 2017 3:29:16 PM
>> Subject: Re: [OMPI devel] remote spawn - have no children
>> 
>> I should have looked more closely as you already have the routed verbose
>> output there. Everything in fact looks correct. The node with mpirun has 1
>> child, which is the daemon on the other node. The vpid=1 daemon on node 250
>> doesn’t have any children as there aren’t any more daemons in the system.
>> 
>> Note that the output has nothing to do with spawning your mpi_hello - it is
>> solely describing the startup of the daemons.
>> 
>> 
>>> On May 3, 2017, at 6:26 AM, r...@open-mpi.org wrote:
>>> 
>>> The orte routed framework does that for you - there is an API for that
>>> purpose.
>>> 
>>> 
>>>> On May 3, 2017, at 12:17 AM, Justin Cinkelj <justin.cink...@xlab.si>
>>>> wrote:
>>>> 
>>>> Important detail first: I get this message from significantly modified
>>>> Open MPI code, so problem exists solely due to my mistake.
>>>> 
>>>> Orterun on 192.168.122.90 starts orted on remote node 192.168.122.91, than
>>>> orted figures out it has nothing to do.
>>>> If I request to start workers on the same 192.168.122.90 IP, the mpi_hello
>>>> is started.
>>>> 
>>>> Partial log:
>>>> /usr/bin/mpirun -np 1 ... mpi_hello
>>>> #
>>>> [osv:00252] [[50738,0],0] plm:base:setup_job
>>>> [osv:00252] [[50738,0],0] plm:base:setup_vm
>>>> [osv:00252] [[50738,0],0] plm:base:setup_vm creating map
>>>> [osv:00252] [[50738,0],0] setup:vm: working unmanaged allocation
>>>> [osv:00252] [[50738,0],0] using dash_host
>>>> [osv:00252] [[50738,0],0] checking node 192.168.122.91
>>>> [osv:00252] [[50738,0],0] plm:base:setup_vm add new daemon [[50738,0],1]
>>>> [osv:00252] [[50738,0],0] plm:base:setup_vm assigning new daemon
>>>> [[50738,0],1] to node 192.168.122.91
>>>> [osv:00252] [[50738,0],0] routed:binomial rank 0 parent 0 me 0 num_procs 2
>>>> [osv:00252] [[50738,0],0] routed:binomial 0 found child 1
>>>> [osv:00252] [[50738,0],0] routed:binomial rank 0 parent 0 me 1 num_procs 2
>>>> [osv:00252] [[50738,0],0] routed:binomial find children of rank 0
>>>> [osv:00252] [[50738,0],0] routed:binomial find children checking peer 1
>>>> [osv:00252] [[50738,0],0] routed:binomial find children computing tree
>>>> [osv:00252] [[50738,0],0] routed:binomial rank 1 parent 0 me 1 num_procs 2
>>>> [osv:00252] [[50738,0],0] routed:binomial find children returning found
>>>> value 0
>>>> [osv:00252] [[50738,0],0]: parent 0 num_children 1
>>>> [osv:00252] [[50738,0],0]:      child 1
>>>> [osv:00252] [[50738,0],0] plm:osvrest: launching vm
>>>> #
>>>> [osv:00250] [[50738,0],1] plm:osvrest: remote spawn called
>>>> [osv:00250] [[50738,0],1] routed:binomial rank 0 parent 0 me 1 num_procs 2
>>>> [osv:00250] [[50738,0],1] routed:binomial find children of rank 0
>>>> [osv:00250] [[50738,0],1] routed:binomial find children checking peer 1
>>>> [osv:00250] [[50738,0],1] routed:binomial find children computing tree
>>>> [osv:00250] [[50738,0],1] routed:binomial rank 1 parent 0 me 1 num_procs 2
>>>> [osv:00250] [[50738,0],1] routed:binomial find children returning found
>>>> value 0
>>>> [osv:00250] [[50738,0],1]: parent 0 num_children 0
>>>> [osv:00250] [[50738,0],1] plm:osvrest: remote spawn - have no children!
>>>> 
>>>> In the plm mca module remote_spawn() function (my plm is based on
>>>> orte/mca/plm/rsh/), the &coll.targets list has zero length. My question
>>>> is, which module(s) are responsible for filling in the coll.targets? Then
>>>> I will turn on the correct mca xzy_base_verbose level, and hopefully
>>>> narrow down my problem. I have quite a problem guessing/finding out what
>>>> various xyz strings mean :)
>>>> 
>>>> Thank you, Justin
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>> 
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to