Important detail first: I get this message from significantly modified Open MPI code, so problem exists solely due to my mistake.

Orterun on 192.168.122.90 starts orted on remote node 192.168.122.91, than orted figures out it has nothing to do. If I request to start workers on the same 192.168.122.90 IP, the mpi_hello is started.

Partial log:
/usr/bin/mpirun -np 1 ... mpi_hello
#
[osv:00252] [[50738,0],0] plm:base:setup_job
[osv:00252] [[50738,0],0] plm:base:setup_vm
[osv:00252] [[50738,0],0] plm:base:setup_vm creating map
[osv:00252] [[50738,0],0] setup:vm: working unmanaged allocation
[osv:00252] [[50738,0],0] using dash_host
[osv:00252] [[50738,0],0] checking node 192.168.122.91
[osv:00252] [[50738,0],0] plm:base:setup_vm add new daemon [[50738,0],1]
[osv:00252] [[50738,0],0] plm:base:setup_vm assigning new daemon [[50738,0],1] to node 192.168.122.91
[osv:00252] [[50738,0],0] routed:binomial rank 0 parent 0 me 0 num_procs 2
[osv:00252] [[50738,0],0] routed:binomial 0 found child 1
[osv:00252] [[50738,0],0] routed:binomial rank 0 parent 0 me 1 num_procs 2
[osv:00252] [[50738,0],0] routed:binomial find children of rank 0
[osv:00252] [[50738,0],0] routed:binomial find children checking peer 1
[osv:00252] [[50738,0],0] routed:binomial find children computing tree
[osv:00252] [[50738,0],0] routed:binomial rank 1 parent 0 me 1 num_procs 2
[osv:00252] [[50738,0],0] routed:binomial find children returning found value 0
[osv:00252] [[50738,0],0]: parent 0 num_children 1
[osv:00252] [[50738,0],0]:      child 1
[osv:00252] [[50738,0],0] plm:osvrest: launching vm
#
[osv:00250] [[50738,0],1] plm:osvrest: remote spawn called
[osv:00250] [[50738,0],1] routed:binomial rank 0 parent 0 me 1 num_procs 2
[osv:00250] [[50738,0],1] routed:binomial find children of rank 0
[osv:00250] [[50738,0],1] routed:binomial find children checking peer 1
[osv:00250] [[50738,0],1] routed:binomial find children computing tree
[osv:00250] [[50738,0],1] routed:binomial rank 1 parent 0 me 1 num_procs 2
[osv:00250] [[50738,0],1] routed:binomial find children returning found value 0
[osv:00250] [[50738,0],1]: parent 0 num_children 0
[osv:00250] [[50738,0],1] plm:osvrest: remote spawn - have no children!

In the plm mca module remote_spawn() function (my plm is based on orte/mca/plm/rsh/), the &coll.targets list has zero length. My question is, which module(s) are responsible for filling in the coll.targets? Then I will turn on the correct mca xzy_base_verbose level, and hopefully narrow down my problem. I have quite a problem guessing/finding out what various xyz strings mean :)

Thank you, Justin
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to