Yes, that "T" state is quite puzzling.  You didn't attach a debugger or hit the 
ssh with a signal, did you?

(we had a similar situation on the devel list recently, but it only happened 
with a very old version of Slurm.  We concluded that it was a SLURM bug that 
has since been fixed.  And just to be sure, I just double checked: the srun 
that hangs in that case is *not* in the "T" state -- it's in the "S" state, 
which appears to be a normal state)


> On May 12, 2018, at 4:56 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> 
> Max,
> 
> the 'T' state of the ssh process is very puzzling.
> 
> can you try to run
> /usr/bin/ssh -x b09-32 orted
> on b09-30 and see what happens ?
> (it should fail with an error message, instead of hanging)
> 
> In order to check there is no firewall, can you run instead
> iptables -L
> Also, is 'selinux' enabled ? there could be some rules that prevent
> 'ssh' from working as expected
> 
> 
> Cheers,
> 
> Gilles
> 
> On Sat, May 12, 2018 at 7:38 AM, Max Mellette <wmell...@ucsd.edu> wrote:
>> Hi Jeff,
>> 
>> Thanks for the reply. FYI since I originally posted this, I uninstalled
>> OpenMPI 3.0.1 and installed 3.1.0, but I'm still experiencing the same
>> problem.
>> 
>> When I run the command without the `--mca plm_base_verbose 100` flag, it
>> hangs indefinitely with no output.
>> 
>> As far as I can tell, these are the additional processes running on each
>> machine while mpirun is hanging (printed using `ps -aux | less`):
>> 
>> On executing host b09-30:
>> 
>> user     361714  0.4  0.0 293016  8444 pts/0    Sl+  15:10   0:00 mpirun
>> --host b09-30,b09-32 hostname
>> user     361719  0.0  0.0  37092  5112 pts/0    T    15:10   0:00
>> /usr/bin/ssh -x b09-32  orted -mca ess "env" -mca ess_base_jobid "638517248"
>> -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
>> "b[2:9]-30,b[2:9]-32@0(2)" -mca orte_hnp_uri
>> "638517248.0;tcp://169.228.66.102,10.1.100.30:55090" -mca plm "rsh" -mca
>> pmix "^s1,s2,cray,isolated"
>> 
>> On remote host b09-32:
>> 
>> root     175273  0.0  0.0  61752  5824 ?        Ss   15:10   0:00 sshd:
>> [accepted]
>> sshd     175274  0.0  0.0  61752   708 ?        S    15:10   0:00 sshd:
>> [net]
>> 
>> I only see orted showing up in the ssh flags on b09-30. Any ideas what I
>> should try next?
>> 
>> Thanks,
>> Max
>> 
>> 
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to