Thank you for the fix,
I could have tried only today, I confirm it works with the patch and with
the mca option.


Cheers,
Federico Reghenzani

2015-11-18 6:15 GMT+01:00 Gilles Gouaillardet <gil...@rist.or.jp>:

> Federico,
>
> i made PR #772 https://github.com/open-mpi/ompi-release/pull/772
>
> feel free to manually patch your ompi install or use the workaround i
> previously described
>
> Cheers,
>
> Gilles
>
>
> On 11/18/2015 11:31 AM, Gilles Gouaillardet wrote:
>
> Federico,
>
> thanks for the report, i will push a fix shortly
>
> meanwhile, and as a workaround, you can add the
> --mca orte_keep_fqdn_hostnames true
> to your mpirun command line when using --host user@ip
>
> Cheers,
>
> Gilles
>
> On 11/17/2015 7:19 PM, Federico Reghenzani wrote:
>
> I'm trying to execute this command:
>
>
> *mpirun -np 8 --host openmpi@10.10.1.1 <openmpi@10.10.1.1>,
> <openmpi@10.10.1.2>openmpi@10.10.1.2 <openmpi@10.10.1.2>,
> <openmpi@10.10.1.3>openmpi@10.10.1.3 <openmpi@10.10.1.3>,
> <openmpi@10.10.1.4>openmpi@10.10.1.4 <openmpi@10.10.1.4> --mca
> oob_tcp_if_exclude lo,wlp2s0 ompi_info *
>
> Everything goes find if I execute the same command with only 2 nodes
> (independently of which nodes).
>
> With 3 or more nodes I obtain:
> *ssh: connect to host 10 port 22: Invalid argument*
> followed by "ORTE was unable to reliably start one or more daemons." error.
>
> Searching with plm_base_verbose, I found:
>
> ...
> [Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon
> [[53718,0],1]
> [Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon
> [[53718,0],1] to node <openmpi@10.10.1.1>openmpi@10.10.1.1
> [Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon
> [[53718,0],2]
> [Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon
> [[53718,0],2] to node <openmpi@10.10.1.2>openmpi@10.10.1.2
> [Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon
> [[53718,0],3]
> [Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon
> [[53718,0],3] to node <openmpi@10.10.1.3>openmpi@10.10.1.3
> [Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon
> [[53718,0],4]
> [Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon
> [[53718,0],4] to node <openmpi@10.10.1.4>openmpi@10.10.1.4
> ...
> [Neptune:22627] [[53718,0],0] plm:rsh:launch daemon 0 not a child of mine
> [Neptune:22627] [[53718,0],0] plm:rsh: adding node <openmpi@10.10.1.1>
> openmpi@10.10.1.1 to launch list
> [Neptune:22627] [[53718,0],0] plm:rsh: adding node <openmpi@10.10.1.2>
> openmpi@10.10.1.2 to launch list
> [Neptune:22627] [[53718,0],0] plm:rsh:launch daemon 3 not a child of mine
> [Neptune:22627] [[53718,0],0] plm:rsh: adding node <openmpi@10.10.1.4>
> openmpi@10.10.1.4 to launch list
> ...
> [roaster-vm1:00593] [[53718,0],1] plm:rsh: remote spawn called
> [roaster-vm1:00593] [[53718,0],1] plm:rsh: local shell: 0 (bash)
> [roaster-vm1:00593] [[53718,0],1] plm:rsh: assuming same remote shell as
> local shell
> [roaster-vm1:00593] [[53718,0],1] plm:rsh: remote shell: 0 (bash)
> [roaster-vm1:00593] [[53718,0],1] plm:rsh: final template argv:
> /usr/bin/ssh <template>  orted --hnp-topo-sig
> 0N:1S:0L3:1L2:2L1:2C:2H:x86_64 -mca ess "env" -mca orte_ess_jobid
> "3520462848" -mca orte_ess_vpid "<template>" -mca orte_ess_num_procs "5"
> -mca orte_parent_uri "3520462848.1;tcp://10.10.1.1:35489" -mca
> orte_hnp_uri "3520462848.0;tcp://10.10.10.2:43771" --mca
> oob_tcp_if_exclude "lo,wlp2s0" --mca plm_base_verbose "100" -mca plm "rsh"
> --tree-spawn
> [roaster-vm1:00593] [[53718,0],1] plm:rsh: activating launch event
> [roaster-vm1:00593] [[53718,0],1] plm:rsh: recording launch of daemon
> [[53718,0],3]
> [roaster-vm1:00593] [[53718,0],1] plm:rsh: executing: (/usr/bin/ssh) 
> [*/usr/bin/ssh
> openmpi@10  orted* --hnp-topo-sig 0N:1S:0L3:1L2:2L1:2C:2H:x86_64 -mca ess
> "env" -mca orte_ess_jobid "3520462848" -mca orte_ess_vpid 3 -mca
> orte_ess_num_procs "5" -mca orte_parent_uri "3520462848.1;tcp://
> 10.10.1.1:35489" -mca orte_hnp_uri "3520462848.0;tcp://10.10.10.2:43771"
> --mca oob_tcp_if_exclude "lo,wlp2s0" --mca plm_base_verbose "100" -mca plm
> "rsh" --tree-spawn]
> *ssh: connect to host 10 port 22: Invalid argument*
>
> It seems it corrupts the ip address during remote spawn. Any idea?
>
> (I'm using 1.10.0rc7 version)
>
>
> Cheers,
> Federico
>
> __
> Federico Reghenzani
> M.Eng. Student @ Politecnico di Milano
> Computer Science and Engineering
>
>
>
>
> _______________________________________________
> users mailing listus...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/11/28042.php
>
>
>
>
> _______________________________________________
> users mailing listus...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/11/28044.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/11/28045.php
>

Reply via email to