Yes, that "T" state is quite puzzling. You didn't attach a debugger or hit the ssh with a signal, did you?
(we had a similar situation on the devel list recently, but it only happened with a very old version of Slurm. We concluded that it was a SLURM bug that has since been fixed. And just to be sure, I just double checked: the srun that hangs in that case is *not* in the "T" state -- it's in the "S" state, which appears to be a normal state) > On May 12, 2018, at 4:56 AM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > > Max, > > the 'T' state of the ssh process is very puzzling. > > can you try to run > /usr/bin/ssh -x b09-32 orted > on b09-30 and see what happens ? > (it should fail with an error message, instead of hanging) > > In order to check there is no firewall, can you run instead > iptables -L > Also, is 'selinux' enabled ? there could be some rules that prevent > 'ssh' from working as expected > > > Cheers, > > Gilles > > On Sat, May 12, 2018 at 7:38 AM, Max Mellette <wmell...@ucsd.edu> wrote: >> Hi Jeff, >> >> Thanks for the reply. FYI since I originally posted this, I uninstalled >> OpenMPI 3.0.1 and installed 3.1.0, but I'm still experiencing the same >> problem. >> >> When I run the command without the `--mca plm_base_verbose 100` flag, it >> hangs indefinitely with no output. >> >> As far as I can tell, these are the additional processes running on each >> machine while mpirun is hanging (printed using `ps -aux | less`): >> >> On executing host b09-30: >> >> user 361714 0.4 0.0 293016 8444 pts/0 Sl+ 15:10 0:00 mpirun >> --host b09-30,b09-32 hostname >> user 361719 0.0 0.0 37092 5112 pts/0 T 15:10 0:00 >> /usr/bin/ssh -x b09-32 orted -mca ess "env" -mca ess_base_jobid "638517248" >> -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex >> "b[2:9]-30,b[2:9]-32@0(2)" -mca orte_hnp_uri >> "638517248.0;tcp://169.228.66.102,10.1.100.30:55090" -mca plm "rsh" -mca >> pmix "^s1,s2,cray,isolated" >> >> On remote host b09-32: >> >> root 175273 0.0 0.0 61752 5824 ? Ss 15:10 0:00 sshd: >> [accepted] >> sshd 175274 0.0 0.0 61752 708 ? S 15:10 0:00 sshd: >> [net] >> >> I only see orted showing up in the ssh flags on b09-30. Any ideas what I >> should try next? >> >> Thanks, >> Max >> >> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users -- Jeff Squyres jsquy...@cisco.com _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users