Everything operates via the state machine - events trigger moving the job from one state to the next, with each state being tied to a callback function that implements that state. If you set state_base_verbose=5, you’ll see when and where each state gets executed.
By default, the launch_app state goes to a function in the plm/base: https://github.com/open-mpi/ompi/blob/master/orte/mca/plm/base/plm_base_launch_support.c#L477 <https://github.com/open-mpi/ompi/blob/master/orte/mca/plm/base/plm_base_launch_support.c#L477> I suspect the problem is that your plm component isn’t activating the next step upon completion of launch_daemons. > On May 3, 2017, at 8:15 AM, Justin Cinkelj <justin.cink...@xlab.si> wrote: > > So "remote spawn" and children refer to orted daemons only, and I was looking > into wrong modules. > > Which module(s) are then responsible to send command to orted to start mpi > application? > Which event names should I search for? > > Thank you, > Justin > > ----- Original Message ----- >> From: r...@open-mpi.org >> To: "OpenMPI Devel" <devel@lists.open-mpi.org> >> Sent: Wednesday, May 3, 2017 3:29:16 PM >> Subject: Re: [OMPI devel] remote spawn - have no children >> >> I should have looked more closely as you already have the routed verbose >> output there. Everything in fact looks correct. The node with mpirun has 1 >> child, which is the daemon on the other node. The vpid=1 daemon on node 250 >> doesn’t have any children as there aren’t any more daemons in the system. >> >> Note that the output has nothing to do with spawning your mpi_hello - it is >> solely describing the startup of the daemons. >> >> >>> On May 3, 2017, at 6:26 AM, r...@open-mpi.org wrote: >>> >>> The orte routed framework does that for you - there is an API for that >>> purpose. >>> >>> >>>> On May 3, 2017, at 12:17 AM, Justin Cinkelj <justin.cink...@xlab.si> >>>> wrote: >>>> >>>> Important detail first: I get this message from significantly modified >>>> Open MPI code, so problem exists solely due to my mistake. >>>> >>>> Orterun on 192.168.122.90 starts orted on remote node 192.168.122.91, than >>>> orted figures out it has nothing to do. >>>> If I request to start workers on the same 192.168.122.90 IP, the mpi_hello >>>> is started. >>>> >>>> Partial log: >>>> /usr/bin/mpirun -np 1 ... mpi_hello >>>> # >>>> [osv:00252] [[50738,0],0] plm:base:setup_job >>>> [osv:00252] [[50738,0],0] plm:base:setup_vm >>>> [osv:00252] [[50738,0],0] plm:base:setup_vm creating map >>>> [osv:00252] [[50738,0],0] setup:vm: working unmanaged allocation >>>> [osv:00252] [[50738,0],0] using dash_host >>>> [osv:00252] [[50738,0],0] checking node 192.168.122.91 >>>> [osv:00252] [[50738,0],0] plm:base:setup_vm add new daemon [[50738,0],1] >>>> [osv:00252] [[50738,0],0] plm:base:setup_vm assigning new daemon >>>> [[50738,0],1] to node 192.168.122.91 >>>> [osv:00252] [[50738,0],0] routed:binomial rank 0 parent 0 me 0 num_procs 2 >>>> [osv:00252] [[50738,0],0] routed:binomial 0 found child 1 >>>> [osv:00252] [[50738,0],0] routed:binomial rank 0 parent 0 me 1 num_procs 2 >>>> [osv:00252] [[50738,0],0] routed:binomial find children of rank 0 >>>> [osv:00252] [[50738,0],0] routed:binomial find children checking peer 1 >>>> [osv:00252] [[50738,0],0] routed:binomial find children computing tree >>>> [osv:00252] [[50738,0],0] routed:binomial rank 1 parent 0 me 1 num_procs 2 >>>> [osv:00252] [[50738,0],0] routed:binomial find children returning found >>>> value 0 >>>> [osv:00252] [[50738,0],0]: parent 0 num_children 1 >>>> [osv:00252] [[50738,0],0]: child 1 >>>> [osv:00252] [[50738,0],0] plm:osvrest: launching vm >>>> # >>>> [osv:00250] [[50738,0],1] plm:osvrest: remote spawn called >>>> [osv:00250] [[50738,0],1] routed:binomial rank 0 parent 0 me 1 num_procs 2 >>>> [osv:00250] [[50738,0],1] routed:binomial find children of rank 0 >>>> [osv:00250] [[50738,0],1] routed:binomial find children checking peer 1 >>>> [osv:00250] [[50738,0],1] routed:binomial find children computing tree >>>> [osv:00250] [[50738,0],1] routed:binomial rank 1 parent 0 me 1 num_procs 2 >>>> [osv:00250] [[50738,0],1] routed:binomial find children returning found >>>> value 0 >>>> [osv:00250] [[50738,0],1]: parent 0 num_children 0 >>>> [osv:00250] [[50738,0],1] plm:osvrest: remote spawn - have no children! >>>> >>>> In the plm mca module remote_spawn() function (my plm is based on >>>> orte/mca/plm/rsh/), the &coll.targets list has zero length. My question >>>> is, which module(s) are responsible for filling in the coll.targets? Then >>>> I will turn on the correct mca xzy_base_verbose level, and hopefully >>>> narrow down my problem. I have quite a problem guessing/finding out what >>>> various xyz strings mean :) >>>> >>>> Thank you, Justin >>>> _______________________________________________ >>>> devel mailing list >>>> devel@lists.open-mpi.org >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel