Thank you for the fix, I could have tried only today, I confirm it works with the patch and with the mca option.
Cheers, Federico Reghenzani 2015-11-18 6:15 GMT+01:00 Gilles Gouaillardet <gil...@rist.or.jp>: > Federico, > > i made PR #772 https://github.com/open-mpi/ompi-release/pull/772 > > feel free to manually patch your ompi install or use the workaround i > previously described > > Cheers, > > Gilles > > > On 11/18/2015 11:31 AM, Gilles Gouaillardet wrote: > > Federico, > > thanks for the report, i will push a fix shortly > > meanwhile, and as a workaround, you can add the > --mca orte_keep_fqdn_hostnames true > to your mpirun command line when using --host user@ip > > Cheers, > > Gilles > > On 11/17/2015 7:19 PM, Federico Reghenzani wrote: > > I'm trying to execute this command: > > > *mpirun -np 8 --host openmpi@10.10.1.1 <openmpi@10.10.1.1>, > <openmpi@10.10.1.2>openmpi@10.10.1.2 <openmpi@10.10.1.2>, > <openmpi@10.10.1.3>openmpi@10.10.1.3 <openmpi@10.10.1.3>, > <openmpi@10.10.1.4>openmpi@10.10.1.4 <openmpi@10.10.1.4> --mca > oob_tcp_if_exclude lo,wlp2s0 ompi_info * > > Everything goes find if I execute the same command with only 2 nodes > (independently of which nodes). > > With 3 or more nodes I obtain: > *ssh: connect to host 10 port 22: Invalid argument* > followed by "ORTE was unable to reliably start one or more daemons." error. > > Searching with plm_base_verbose, I found: > > ... > [Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon > [[53718,0],1] > [Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon > [[53718,0],1] to node <openmpi@10.10.1.1>openmpi@10.10.1.1 > [Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon > [[53718,0],2] > [Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon > [[53718,0],2] to node <openmpi@10.10.1.2>openmpi@10.10.1.2 > [Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon > [[53718,0],3] > [Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon > [[53718,0],3] to node <openmpi@10.10.1.3>openmpi@10.10.1.3 > [Neptune:22627] [[53718,0],0] plm:base:setup_vm add new daemon > [[53718,0],4] > [Neptune:22627] [[53718,0],0] plm:base:setup_vm assigning new daemon > [[53718,0],4] to node <openmpi@10.10.1.4>openmpi@10.10.1.4 > ... > [Neptune:22627] [[53718,0],0] plm:rsh:launch daemon 0 not a child of mine > [Neptune:22627] [[53718,0],0] plm:rsh: adding node <openmpi@10.10.1.1> > openmpi@10.10.1.1 to launch list > [Neptune:22627] [[53718,0],0] plm:rsh: adding node <openmpi@10.10.1.2> > openmpi@10.10.1.2 to launch list > [Neptune:22627] [[53718,0],0] plm:rsh:launch daemon 3 not a child of mine > [Neptune:22627] [[53718,0],0] plm:rsh: adding node <openmpi@10.10.1.4> > openmpi@10.10.1.4 to launch list > ... > [roaster-vm1:00593] [[53718,0],1] plm:rsh: remote spawn called > [roaster-vm1:00593] [[53718,0],1] plm:rsh: local shell: 0 (bash) > [roaster-vm1:00593] [[53718,0],1] plm:rsh: assuming same remote shell as > local shell > [roaster-vm1:00593] [[53718,0],1] plm:rsh: remote shell: 0 (bash) > [roaster-vm1:00593] [[53718,0],1] plm:rsh: final template argv: > /usr/bin/ssh <template> orted --hnp-topo-sig > 0N:1S:0L3:1L2:2L1:2C:2H:x86_64 -mca ess "env" -mca orte_ess_jobid > "3520462848" -mca orte_ess_vpid "<template>" -mca orte_ess_num_procs "5" > -mca orte_parent_uri "3520462848.1;tcp://10.10.1.1:35489" -mca > orte_hnp_uri "3520462848.0;tcp://10.10.10.2:43771" --mca > oob_tcp_if_exclude "lo,wlp2s0" --mca plm_base_verbose "100" -mca plm "rsh" > --tree-spawn > [roaster-vm1:00593] [[53718,0],1] plm:rsh: activating launch event > [roaster-vm1:00593] [[53718,0],1] plm:rsh: recording launch of daemon > [[53718,0],3] > [roaster-vm1:00593] [[53718,0],1] plm:rsh: executing: (/usr/bin/ssh) > [*/usr/bin/ssh > openmpi@10 orted* --hnp-topo-sig 0N:1S:0L3:1L2:2L1:2C:2H:x86_64 -mca ess > "env" -mca orte_ess_jobid "3520462848" -mca orte_ess_vpid 3 -mca > orte_ess_num_procs "5" -mca orte_parent_uri "3520462848.1;tcp:// > 10.10.1.1:35489" -mca orte_hnp_uri "3520462848.0;tcp://10.10.10.2:43771" > --mca oob_tcp_if_exclude "lo,wlp2s0" --mca plm_base_verbose "100" -mca plm > "rsh" --tree-spawn] > *ssh: connect to host 10 port 22: Invalid argument* > > It seems it corrupts the ip address during remote spawn. Any idea? > > (I'm using 1.10.0rc7 version) > > > Cheers, > Federico > > __ > Federico Reghenzani > M.Eng. Student @ Politecnico di Milano > Computer Science and Engineering > > > > > _______________________________________________ > users mailing listus...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/11/28042.php > > > > > _______________________________________________ > users mailing listus...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/11/28044.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/11/28045.php >