Re: [OMPI users] Open MPI over RoCE using breakout cable and switch

2017-01-20 Thread Howard Pritchard
Hi Brendan

I doubt this kind of config has gotten any testing with OMPI.  Could you
rerun with

--mca btl_base_verbose 100

added to the command line and post the output to the list?

Howard


Brendan Myers  schrieb am Fr. 20. Jan. 2017
um 15:04:

> Hello,
>
> I am attempting to get Open MPI to run over 2 nodes using a switch and a
> single breakout cable with this design:
>
> (100GbE)QSFP ßà 2x (50GbE)QSFP
>
>
>
> Hardware Layout:
>
> Breakout cable module A connects to switch (100GbE)
>
> Breakout cable module B1 connects to node 1 RoCE NIC (50GbE)
>
> Breakout cable module B2 connects to node 2 RoCE NIC (50GbE)
>
> Switch is Mellanox SN 2700 100GbE RoCE switch
>
>
>
> · I  am able to pass RDMA traffic between the nodes with perftest
> (ib_write_bw) when using the breakout cable as the IC from both nodes to
> the switch.
>
> · When attempting to run a job using the breakout cable as the IC
> Open MPI aborts with failure to initialize open fabrics device errors.
>
> · If I replace the breakout cable with 2 standard QSFP cables the
> Open MPI job will complete correctly.
>
>
>
>
>
> This is the command I use, it works unless I attempt a run with the
> breakout cable used as IC:
>
> *mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues
> P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm  -hostfile
> mpi-hosts-ce /usr/local/bin/IMB-MPI1*
>
>
>
> If anyone has any idea as to why using a breakout cable is causing my jobs
> to fail please let me know.
>
>
>
> Thank you,
>
>
>
> Brendan T. W. Myers
>
> brendan.my...@soft-forge.com
>
> Software Forge Inc
>
>
> ___
>
> users mailing list
>
> users@lists.open-mpi.org
>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Open MPI over RoCE using breakout cable and switch

2017-01-20 Thread Brendan Myers
Hello,

I am attempting to get Open MPI to run over 2 nodes using a switch and a
single breakout cable with this design:

(100GbE)QSFP <> 2x (50GbE)QSFP   

 

Hardware Layout:

Breakout cable module A connects to switch (100GbE)

Breakout cable module B1 connects to node 1 RoCE NIC (50GbE)

Breakout cable module B2 connects to node 2 RoCE NIC (50GbE)

Switch is Mellanox SN 2700 100GbE RoCE switch

 

* I  am able to pass RDMA traffic between the nodes with perftest
(ib_write_bw) when using the breakout cable as the IC from both nodes to the
switch.

* When attempting to run a job using the breakout cable as the IC
Open MPI aborts with failure to initialize open fabrics device errors.

* If I replace the breakout cable with 2 standard QSFP cables the
Open MPI job will complete correctly.  

 

 

This is the command I use, it works unless I attempt a run with the breakout
cable used as IC:

mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues
P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm  -hostfile
mpi-hosts-ce /usr/local/bin/IMB-MPI1

 

If anyone has any idea as to why using a breakout cable is causing my jobs
to fail please let me know.

 

Thank you,

 

Brendan T. W. Myers

brendan.my...@soft-forge.com  

Software Forge Inc

 

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Startup limited to 128 remote hosts in some situations?

2017-01-20 Thread r...@open-mpi.org
Well, it appears we are already forwarding all envars, which should include 
PATH. Here is the qrsh command line we use:

“qrsh --inherit --nostdin -V"

So would you please try the following patch:

diff --git a/orte/mca/plm/rsh/plm_rsh_component.c 
b/orte/mca/plm/rsh/plm_rsh_component.c
index 0183bcc..1cc5aa4 100644
--- a/orte/mca/plm/rsh/plm_rsh_component.c
+++ b/orte/mca/plm/rsh/plm_rsh_component.c
@@ -288,8 +288,6 @@ static int rsh_component_query(mca_base_module_t **module, 
int *priority)
 }
 mca_plm_rsh_component.agent = tmp;
 mca_plm_rsh_component.using_qrsh = true;
-/* no tree spawn allowed under qrsh */
-mca_plm_rsh_component.no_tree_spawn = true;
 goto success;
 } else if (!mca_plm_rsh_component.disable_llspawn &&
NULL != getenv("LOADL_STEP_ID")) {


> On Jan 19, 2017, at 5:29 PM, r...@open-mpi.org wrote:
> 
> I’ll create a patch that you can try - if it works okay, we can commit it
> 
>> On Jan 18, 2017, at 3:29 AM, William Hay  wrote:
>> 
>> On Tue, Jan 17, 2017 at 09:56:54AM -0800, r...@open-mpi.org wrote:
>>> As I recall, the problem was that qrsh isn???t available on the backend 
>>> compute nodes, and so we can???t use a tree for launch. If that isn???t 
>>> true, then we can certainly adjust it.
>>> 
>> qrsh should be available on all nodes of a SoGE cluster but, depending on 
>> how things are set up, may not be 
>> findable (ie not in the PATH) when you qrsh -inherit into a node.  A 
>> workaround would be to start backend 
>> processes with qrsh -inherit -v PATH which will copy the PATH from the 
>> master node to the slave node 
>> process or otherwise pass the location of qrsh from one node or another.  
>> That of course assumes that 
>> qrsh is in the same location on all nodes.
>> 
>> I've tested that it is possible to qrsh from the head node of a job to a 
>> slave node and then on to
>> another slave node by this method.
>> 
>> William
>> 
>> 
 On Jan 17, 2017, at 9:37 AM, Mark Dixon  wrote:
 
 Hi,
 
 While commissioning a new cluster, I wanted to run HPL across the whole 
 thing using openmpi 2.0.1.
 
 I couldn't get it to start on more than 129 hosts under Son of Gridengine 
 (128 remote plus the localhost running the mpirun command). openmpi would 
 sit there, waiting for all the orted's to check in; however, there were 
 "only" a maximum of 128 qrsh processes, therefore a maximum of 128 
 orted's, therefore waiting a lng time.
 
 Increasing plm_rsh_num_concurrent beyond the default of 128 gets the job 
 to launch.
 
 Is this intentional, please?
 
 Doesn't openmpi use a tree-like startup sometimes - any particular reason 
 it's not using it here?
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users