Hello All, I'm trying to set up OpenMPI 3.0.1 on a pair of linux machines, but I'm running into a problem where mpirun hangs when I try to execute a simple command across the two machines:
$ mpirun --host b09-30,b09-32 hostname I'd appreciate any assistance with this problem. I'm a new MPI user and suspect I'm just missing something, but have checked the documentation at www.open-mpi.org and various forums and have not been able to figure it out. Thanks, Max Here are some configuration details: - Both machines running Ubuntu 16.04 - b09-30 is the local host - b09-32 is remote host - Installed OpenMPI 3.0.1 from .tar on both machines in /usr/local (following instructions from www.open-mpi.org) - Configured PATH and LD_LIBRARY_PATH on both machines - Can ssh without prompt between machines - UFW firewall is disabled on both machines Here's some terminal output, including running the command above with --mca plm_base_verbose 100 set: user@b09-30:~$ sudo ufw status Status: inactive user@b09-30:~$ cat .bashrc export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin export LD_LIBRARY_PATH=/usr/local/lib user@b09-30:~$ ssh b09-32 hostname b09-32 user@b09-30:~$ mpirun --host b09-30 hostname b09-30 user@b09-30:~$ mpirun --host b09-30,b09-32 --mca plm_base_verbose 100 hostname [b09-30:76987] mca: base: components_register: registering framework plm components [b09-30:76987] mca: base: components_register: found loaded component slurm [b09-30:76987] mca: base: components_register: component slurm register function successful [b09-30:76987] mca: base: components_register: found loaded component rsh [b09-30:76987] mca: base: components_register: component rsh register function successful [b09-30:76987] mca: base: components_register: found loaded component isolated [b09-30:76987] mca: base: components_register: component isolated has no register or open function [b09-30:76987] mca: base: components_open: opening plm components [b09-30:76987] mca: base: components_open: found loaded component slurm [b09-30:76987] mca: base: components_open: component slurm open function successful [b09-30:76987] mca: base: components_open: found loaded component rsh [b09-30:76987] mca: base: components_open: component rsh open function successful [b09-30:76987] mca: base: components_open: found loaded component isolated [b09-30:76987] mca: base: components_open: component isolated open function successful [b09-30:76987] mca:base:select: Auto-selecting plm components [b09-30:76987] mca:base:select:( plm) Querying component [slurm] [b09-30:76987] mca:base:select:( plm) Querying component [rsh] [b09-30:76987] mca:base:select:( plm) Query of component [rsh] set priority to 10 [b09-30:76987] mca:base:select:( plm) Querying component [isolated] [b09-30:76987] mca:base:select:( plm) Query of component [isolated] set priority to 0 [b09-30:76987] mca:base:select:( plm) Selected component [rsh] [b09-30:76987] mca: base: close: component slurm closed [b09-30:76987] mca: base: close: unloading component slurm [b09-30:76987] mca: base: close: component isolated closed [b09-30:76987] mca: base: close: unloading component isolated [b09-30:76987] [[36418,0],0] plm:rsh: final template argv: /usr/bin/ssh <template> orted -mca ess "env" -mca ess_base_jobid "2386690048" -mca ess_base_vpid "<template>" -mca ess_base_num_procs "2" -mca orte_node_regex "b[2:9]-30,b[2:9]-32@0(2)" -mca orte_hnp_uri "2386690048.0;tcp://169.228.66.102,10.1.100.30:55714" --mca plm_base_verbose "100" -mca plm "rsh" -mca pmix "^s1,s2,cray,isolated" ^C[b09-30:76987] mca: base: close: component rsh closed [b09-30:76987] mca: base: close: unloading component rsh user@b09-30:~$ (I have to kill the process or it will hang for an undetermined amount of time > 10 minutes.)
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users