Re: [OMPI users] mpirun issue using more than 64 hosts

2018-02-12 Thread Adam Sylvester
A... thanks Gilles. That makes sense. I was stuck thinking there was an ssh problem on rank 0; it never occurred to me mpirun was doing something clever there and that those ssh errors were from a different instance altogether. It's no problem to put my private key on all instances - I'll go

Re: [OMPI users] mpirun issue using more than 64 hosts

2018-02-12 Thread Gilles Gouaillardet
Adam, by default, when more than 64 hosts are involved, mpirun uses a tree spawn in order to remote launch the orted daemons. That means you have two options here : - allow all compute nodes to ssh each other (e.g. the ssh private key of *all* the nodes should be in *all* the authorized_keys -

[OMPI users] mpirun issue using more than 64 hosts

2018-02-12 Thread Adam Sylvester
I'm running OpenMPI 2.1.0, built from source, on RHEL 7. I'm using the default ssh-based launcher, where I have my private ssh key on rank 0 and the associated public key on all ranks. I create a hosts file with a list of unique IPs, with the host that I'm running mpirun from on the first line, a

Re: [OMPI users] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32)

2018-02-12 Thread Gilles Gouaillardet
William, On a typical HPC cluster, the internal interface is not protected by the firewall. If this is eth0, then you can mpirun --mca oob_tcp_if_include eth0 --mca btl_tcp_if_include eth0 ... If only a small range of port is available, then you will also need to use the oob_tcp_dynamic_ipv4_po

Re: [OMPI users] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32)

2018-02-12 Thread William Mitchell
Thanks, George. My sysadmin now says he is pretty sure it is the firewall, but that "isn't going to change" so we need to find a solution. On 9 February 2018 at 16:58, George Bosilca wrote: > What are the settings of the firewall on your 2 nodes ? > > George. > > > > On Fri, Feb 9, 2018 at 3: