Important detail first: I get this message from significantly modified
Open MPI code, so problem exists solely due to my mistake.
Orterun on 192.168.122.90 starts orted on remote node 192.168.122.91,
than orted figures out it has nothing to do.
If I request to start workers on the same 192.168.122.90 IP, the
mpi_hello is started.
Partial log:
/usr/bin/mpirun -np 1 ... mpi_hello
#
[osv:00252] [[50738,0],0] plm:base:setup_job
[osv:00252] [[50738,0],0] plm:base:setup_vm
[osv:00252] [[50738,0],0] plm:base:setup_vm creating map
[osv:00252] [[50738,0],0] setup:vm: working unmanaged allocation
[osv:00252] [[50738,0],0] using dash_host
[osv:00252] [[50738,0],0] checking node 192.168.122.91
[osv:00252] [[50738,0],0] plm:base:setup_vm add new daemon [[50738,0],1]
[osv:00252] [[50738,0],0] plm:base:setup_vm assigning new daemon
[[50738,0],1] to node 192.168.122.91
[osv:00252] [[50738,0],0] routed:binomial rank 0 parent 0 me 0 num_procs 2
[osv:00252] [[50738,0],0] routed:binomial 0 found child 1
[osv:00252] [[50738,0],0] routed:binomial rank 0 parent 0 me 1 num_procs 2
[osv:00252] [[50738,0],0] routed:binomial find children of rank 0
[osv:00252] [[50738,0],0] routed:binomial find children checking peer 1
[osv:00252] [[50738,0],0] routed:binomial find children computing tree
[osv:00252] [[50738,0],0] routed:binomial rank 1 parent 0 me 1 num_procs 2
[osv:00252] [[50738,0],0] routed:binomial find children returning found
value 0
[osv:00252] [[50738,0],0]: parent 0 num_children 1
[osv:00252] [[50738,0],0]: child 1
[osv:00252] [[50738,0],0] plm:osvrest: launching vm
#
[osv:00250] [[50738,0],1] plm:osvrest: remote spawn called
[osv:00250] [[50738,0],1] routed:binomial rank 0 parent 0 me 1 num_procs 2
[osv:00250] [[50738,0],1] routed:binomial find children of rank 0
[osv:00250] [[50738,0],1] routed:binomial find children checking peer 1
[osv:00250] [[50738,0],1] routed:binomial find children computing tree
[osv:00250] [[50738,0],1] routed:binomial rank 1 parent 0 me 1 num_procs 2
[osv:00250] [[50738,0],1] routed:binomial find children returning found
value 0
[osv:00250] [[50738,0],1]: parent 0 num_children 0
[osv:00250] [[50738,0],1] plm:osvrest: remote spawn - have no children!
In the plm mca module remote_spawn() function (my plm is based on
orte/mca/plm/rsh/), the &coll.targets list has zero length. My question
is, which module(s) are responsible for filling in the coll.targets?
Then I will turn on the correct mca xzy_base_verbose level, and
hopefully narrow down my problem. I have quite a problem
guessing/finding out what various xyz strings mean :)
Thank you, Justin
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel