Max,

the 'T' state of the ssh process is very puzzling.

can you try to run
/usr/bin/ssh -x b09-32 orted
on b09-30 and see what happens ?
(it should fail with an error message, instead of hanging)

In order to check there is no firewall, can you run instead
iptables -L
Also, is 'selinux' enabled ? there could be some rules that prevent
'ssh' from working as expected


Cheers,

Gilles

On Sat, May 12, 2018 at 7:38 AM, Max Mellette <wmell...@ucsd.edu> wrote:
> Hi Jeff,
>
> Thanks for the reply. FYI since I originally posted this, I uninstalled
> OpenMPI 3.0.1 and installed 3.1.0, but I'm still experiencing the same
> problem.
>
> When I run the command without the `--mca plm_base_verbose 100` flag, it
> hangs indefinitely with no output.
>
> As far as I can tell, these are the additional processes running on each
> machine while mpirun is hanging (printed using `ps -aux | less`):
>
> On executing host b09-30:
>
> user     361714  0.4  0.0 293016  8444 pts/0    Sl+  15:10   0:00 mpirun
> --host b09-30,b09-32 hostname
> user     361719  0.0  0.0  37092  5112 pts/0    T    15:10   0:00
> /usr/bin/ssh -x b09-32  orted -mca ess "env" -mca ess_base_jobid "638517248"
> -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
> "b[2:9]-30,b[2:9]-32@0(2)" -mca orte_hnp_uri
> "638517248.0;tcp://169.228.66.102,10.1.100.30:55090" -mca plm "rsh" -mca
> pmix "^s1,s2,cray,isolated"
>
> On remote host b09-32:
>
> root     175273  0.0  0.0  61752  5824 ?        Ss   15:10   0:00 sshd:
> [accepted]
> sshd     175274  0.0  0.0  61752   708 ?        S    15:10   0:00 sshd:
> [net]
>
> I only see orted showing up in the ssh flags on b09-30. Any ideas what I
> should try next?
>
> Thanks,
> Max
>
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to