Hello All,

I'm trying to set up OpenMPI 3.0.1 on a pair of linux machines, but I'm
running into a problem where mpirun hangs when I try to execute a simple
command across the two machines:

$ mpirun --host b09-30,b09-32 hostname

I'd appreciate any assistance with this problem. I'm a new MPI user and
suspect I'm just missing something, but have checked the documentation at
www.open-mpi.org and various forums and have not been able to figure it out.

Thanks,
Max

Here are some configuration details:

- Both machines running Ubuntu 16.04
- b09-30 is the local host
- b09-32 is remote host
- Installed OpenMPI 3.0.1 from .tar on both machines in /usr/local
(following instructions from www.open-mpi.org)
- Configured PATH and LD_LIBRARY_PATH on both machines
- Can ssh without prompt between machines
- UFW firewall is disabled on both machines

Here's some terminal output, including running the command above with --mca
plm_base_verbose 100  set:

user@b09-30:~$ sudo ufw status
Status: inactive
user@b09-30:~$ cat .bashrc
export
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
export LD_LIBRARY_PATH=/usr/local/lib
user@b09-30:~$ ssh b09-32 hostname
b09-32
user@b09-30:~$ mpirun --host b09-30 hostname
b09-30
user@b09-30:~$ mpirun --host b09-30,b09-32 --mca plm_base_verbose 100
hostname
[b09-30:76987] mca: base: components_register: registering framework plm
components
[b09-30:76987] mca: base: components_register: found loaded component slurm
[b09-30:76987] mca: base: components_register: component slurm register
function successful
[b09-30:76987] mca: base: components_register: found loaded component rsh
[b09-30:76987] mca: base: components_register: component rsh register
function successful
[b09-30:76987] mca: base: components_register: found loaded component
isolated
[b09-30:76987] mca: base: components_register: component isolated has no
register or open function
[b09-30:76987] mca: base: components_open: opening plm components
[b09-30:76987] mca: base: components_open: found loaded component slurm
[b09-30:76987] mca: base: components_open: component slurm open function
successful
[b09-30:76987] mca: base: components_open: found loaded component rsh
[b09-30:76987] mca: base: components_open: component rsh open function
successful
[b09-30:76987] mca: base: components_open: found loaded component isolated
[b09-30:76987] mca: base: components_open: component isolated open function
successful
[b09-30:76987] mca:base:select: Auto-selecting plm components
[b09-30:76987] mca:base:select:(  plm) Querying component [slurm]
[b09-30:76987] mca:base:select:(  plm) Querying component [rsh]
[b09-30:76987] mca:base:select:(  plm) Query of component [rsh] set
priority to 10
[b09-30:76987] mca:base:select:(  plm) Querying component [isolated]
[b09-30:76987] mca:base:select:(  plm) Query of component [isolated] set
priority to 0
[b09-30:76987] mca:base:select:(  plm) Selected component [rsh]
[b09-30:76987] mca: base: close: component slurm closed
[b09-30:76987] mca: base: close: unloading component slurm
[b09-30:76987] mca: base: close: component isolated closed
[b09-30:76987] mca: base: close: unloading component isolated
[b09-30:76987] [[36418,0],0] plm:rsh: final template argv:
        /usr/bin/ssh <template>  orted -mca ess "env" -mca ess_base_jobid
"2386690048" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2"
-mca orte_node_regex "b[2:9]-30,b[2:9]-32@0(2)" -mca orte_hnp_uri
"2386690048.0;tcp://169.228.66.102,10.1.100.30:55714" --mca
plm_base_verbose "100" -mca plm "rsh" -mca pmix "^s1,s2,cray,isolated"
^C[b09-30:76987] mca: base: close: component rsh closed
[b09-30:76987] mca: base: close: unloading component rsh
user@b09-30:~$

(I have to kill the process or it will hang for an undetermined amount of
time > 10 minutes.)
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to