Re: [OMPI users] How to select specific out of multiple interfaces for communication and support for heterogeneous fabrics

2013-07-05 Thread Ralph Castain
I can't speak for MVAPICH - you probably need to ask them about this scenario. 
OMPI will automatically select whatever available transport that can reach the 
intended process. This requires that each communicating pair of processes have 
access to at least one common transport.

So if a process that is on a node with only 1G-E wants to communicate with 
another process, then the node where that other process is running must also 
have access to a compatible Ethernet interface (1G can talk to 10G, so they can 
have different capabilities) on that subnet (or on a subnet that knows how to 
route to the other one). If both nodes have 10G-E as well as 1G-E interfaces, 
then OMPI will automatically take the 10G interface as it is the faster of the 
two.

Note this means that if a process is on a node that only has IB, and wants to 
communicate to a process on a node that only has 1G-E, then the two processes 
cannot communicate.

HTH
Ralph

On Jul 5, 2013, at 2:34 PM, Michael Thomadakis  wrote:

> Hello OpenMPI
> 
> We area seriously considering deploying OpenMPI 1.6.5 for production (and 
> 1.7.2 for testing) on HPC clusters which consists of nodes with different 
> types of networking interfaces.
> 
> 
> 1) Interface selection
> 
> We are using OpenMPI 1.6.5 and was wondering how one would go about selecting 
> at run time which networking interface to use for MPI communications in case 
> that both IB, 10GigE and 1 GigE are present. 
> 
> This issues arises in a cluster with nodes that are equipped with different 
> types of interfaces:
> 
> Some have both IB-QDR or FDR and 10- and 1-GigE. Others only have 10-GigE and 
> 1-GigE and simply others only 1-GigE.
> 
> 
> 2) OpenMPI 1.6.5 level of support for Heterogeneous Fabric
> 
> Can OpenMPI support running an MPI application using a mix of nodes with all 
> of the above networking interface combinations ? 
> 
>   2.a) Can the same MPI code (SPMD or MPMD) have a subset of its ranks run on 
> nodes with QDR IB and another subset on FDR IB simultaneously? These are 
> Mellanox QDR and FDR HCAs. 
> 
> Mellanox mentioned to us that they support both QDR and FDR HCAs attached to 
> the same IB subnet. Do you think MVAPICH2 will have any issue with this?
> 
> 2.b) Can the same MPI code (SPMD or MPMD) have a subset of its ranks run on 
> nodes with IB and another subset over 10GiGE simultaneously? 
> 
> That is imagine nodes I1, I2, ..., IN having say QDR HCAs and nodes G1, G2, 
> GM having only 10GigE interfaces. Could we have the same MPI application run 
> across both types of nodes? 
> 
> Or should there be say 2 communicators with one of them explicitly overlaid 
> on a IB only subnet and the other on a 10GigE only subnet? 
> 
> 
> Please let me know if the above are not very clear.
> 
> Thank you much
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] How to select specific out of multiple interfaces for communication and support for heterogeneous fabrics

2013-07-05 Thread Michael Thomadakis
Sorry on the mvapich2 reference :)

All nodes are attached over a common 1GigE network. We wish ofcourse that
if a node-pair is connected via a higher-speed fabric *as well* (IB FDR or
10GigE) then that this would be leveraged instead of the common 1GigE.

One question: suppose that we use nodes having either FDR or QDR IB
interfaces available, connected to one common IB fabric, all defined over a
common IP subnet: Will OpenMPI have any problem with this? Can MPI
communication take place over this type of hybrid IB fabric? We already
have a sub-cluster with QDR HCAs and we are attaching it to IB fabric with
FDR "backbone" and another cluster with FDR HCAs.

Do you think there may be some issue with this? The HCAs are FDR and QDR
Mellanox devices and the switching is also over FDR Mellanox fabric.
Mellanox claims that at the IB level this is doable (i.e., FDR link pairs
talk to each other at FDR speeds and QDR link pairs at QDR).

I guess if we use the RC connection types then it does not matter to
OpenMPI.

thanks 
Michael




On Fri, Jul 5, 2013 at 4:59 PM, Ralph Castain  wrote:

> I can't speak for MVAPICH - you probably need to ask them about this
> scenario. OMPI will automatically select whatever available transport that
> can reach the intended process. This requires that each communicating pair
> of processes have access to at least one common transport.
>
> So if a process that is on a node with only 1G-E wants to communicate with
> another process, then the node where that other process is running must
> also have access to a compatible Ethernet interface (1G can talk to 10G, so
> they can have different capabilities) on that subnet (or on a subnet that
> knows how to route to the other one). If both nodes have 10G-E as well as
> 1G-E interfaces, then OMPI will automatically take the 10G interface as it
> is the faster of the two.
>
> Note this means that if a process is on a node that only has IB, and wants
> to communicate to a process on a node that only has 1G-E, then the two
> processes cannot communicate.
>
> HTH
> Ralph
>
> On Jul 5, 2013, at 2:34 PM, Michael Thomadakis 
> wrote:
>
> Hello OpenMPI
>
> We area seriously considering deploying OpenMPI 1.6.5 for production (and
> 1.7.2 for testing) on HPC clusters which consists of nodes with *different
> types of networking interfaces*.
>
>
> 1) Interface selection
>
> We are using OpenMPI 1.6.5 and was wondering how one would go about
> selecting* at run time* which networking interface to use for MPI
> communications in case that both IB, 10GigE and 1 GigE are present.
>
> This issues arises in a cluster with nodes that are equipped with
> different types of interfaces:
>
> *Some *have both IB-QDR or FDR and 10- and 1-GigE. Others *only* have
> 10-GigE and 1-GigE and simply others only 1-GigE.
>
>
> 2) OpenMPI 1.6.5 level of support for Heterogeneous Fabric
>
> Can OpenMPI support running an MPI application using a mix of nodes with
> all of the above networking interface combinations ?
>
>   2.a) Can the same MPI code (SPMD or MPMD) have a subset of its ranks run
> on nodes with QDR IB and another subset on FDR IB simultaneously? These are
> Mellanox QDR and FDR HCAs.
>
> Mellanox mentioned to us that they support both QDR and FDR HCAs attached
> to the same IB subnet. Do you think MVAPICH2 will have any issue with this?
>
> 2.b) Can the same MPI code (SPMD or MPMD) have a subset of its ranks run
> on nodes with IB and another subset over 10GiGE simultaneously?
>
> That is imagine nodes I1, I2, ..., IN having say QDR HCAs and nodes G1,
> G2, GM having only 10GigE interfaces. Could we have the same MPI
> application run across both types of nodes?
>
> Or should there be say 2 communicators with one of them explicitly
> overlaid on a IB only subnet and the other on a 10GigE only subnet?
>
>
> Please let me know if the above are not very clear.
>
> Thank you much
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] How to select specific out of multiple interfaces for communication and support for heterogeneous fabrics

2013-07-05 Thread Ralph Castain
As long as the IB interfaces can communicate to each other, you should be fine.

On Jul 5, 2013, at 3:26 PM, Michael Thomadakis  wrote:

> Sorry on the mvapich2 reference :) 
> 
> All nodes are attached over a common 1GigE network. We wish ofcourse that if 
> a node-pair is connected via a higher-speed fabric as well (IB FDR or 10GigE) 
> then that this would be leveraged instead of the common 1GigE.
> 
> One question: suppose that we use nodes having either FDR or QDR IB 
> interfaces available, connected to one common IB fabric, all defined over a 
> common IP subnet: Will OpenMPI have any problem with this? Can MPI 
> communication take place over this type of hybrid IB fabric? We already have 
> a sub-cluster with QDR HCAs and we are attaching it to IB fabric with FDR 
> "backbone" and another cluster with FDR HCAs. 
> 
> Do you think there may be some issue with this? The HCAs are FDR and QDR 
> Mellanox devices and the switching is also over FDR Mellanox fabric. Mellanox 
> claims that at the IB level this is doable (i.e., FDR link pairs talk to each 
> other at FDR speeds and QDR link pairs at QDR).
> 
> I guess if we use the RC connection types then it does not matter to OpenMPI. 
> 
> thanks 
> Michael
> 
> 
> 
> 
> On Fri, Jul 5, 2013 at 4:59 PM, Ralph Castain  wrote:
> I can't speak for MVAPICH - you probably need to ask them about this 
> scenario. OMPI will automatically select whatever available transport that 
> can reach the intended process. This requires that each communicating pair of 
> processes have access to at least one common transport.
> 
> So if a process that is on a node with only 1G-E wants to communicate with 
> another process, then the node where that other process is running must also 
> have access to a compatible Ethernet interface (1G can talk to 10G, so they 
> can have different capabilities) on that subnet (or on a subnet that knows 
> how to route to the other one). If both nodes have 10G-E as well as 1G-E 
> interfaces, then OMPI will automatically take the 10G interface as it is the 
> faster of the two.
> 
> Note this means that if a process is on a node that only has IB, and wants to 
> communicate to a process on a node that only has 1G-E, then the two processes 
> cannot communicate.
> 
> HTH
> Ralph
> 
> On Jul 5, 2013, at 2:34 PM, Michael Thomadakis  
> wrote:
> 
>> Hello OpenMPI
>> 
>> We area seriously considering deploying OpenMPI 1.6.5 for production (and 
>> 1.7.2 for testing) on HPC clusters which consists of nodes with different 
>> types of networking interfaces.
>> 
>> 
>> 1) Interface selection
>> 
>> We are using OpenMPI 1.6.5 and was wondering how one would go about 
>> selecting at run time which networking interface to use for MPI 
>> communications in case that both IB, 10GigE and 1 GigE are present. 
>> 
>> This issues arises in a cluster with nodes that are equipped with different 
>> types of interfaces:
>> 
>> Some have both IB-QDR or FDR and 10- and 1-GigE. Others only have 10-GigE 
>> and 1-GigE and simply others only 1-GigE.
>> 
>> 
>> 2) OpenMPI 1.6.5 level of support for Heterogeneous Fabric
>> 
>> Can OpenMPI support running an MPI application using a mix of nodes with all 
>> of the above networking interface combinations ? 
>> 
>>   2.a) Can the same MPI code (SPMD or MPMD) have a subset of its ranks run 
>> on nodes with QDR IB and another subset on FDR IB simultaneously? These are 
>> Mellanox QDR and FDR HCAs. 
>> 
>> Mellanox mentioned to us that they support both QDR and FDR HCAs attached to 
>> the same IB subnet. Do you think MVAPICH2 will have any issue with this?
>> 
>> 2.b) Can the same MPI code (SPMD or MPMD) have a subset of its ranks run on 
>> nodes with IB and another subset over 10GiGE simultaneously? 
>> 
>> That is imagine nodes I1, I2, ..., IN having say QDR HCAs and nodes G1, G2, 
>> GM having only 10GigE interfaces. Could we have the same MPI application run 
>> across both types of nodes? 
>> 
>> Or should there be say 2 communicators with one of them explicitly overlaid 
>> on a IB only subnet and the other on a 10GigE only subnet? 
>> 
>> 
>> Please let me know if the above are not very clear.
>> 
>> Thank you much
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] How to select specific out of multiple interfaces for communication and support for heterogeneous fabrics

2013-07-05 Thread Michael Thomadakis
Great ... thanks.   We will try it out as soon as the common backbone IB is
in place.

cheers
Michael



On Fri, Jul 5, 2013 at 6:10 PM, Ralph Castain  wrote:

> As long as the IB interfaces can communicate to each other, you should be
> fine.
>
> On Jul 5, 2013, at 3:26 PM, Michael Thomadakis 
> wrote:
>
> Sorry on the mvapich2 reference :)
>
> All nodes are attached over a common 1GigE network. We wish ofcourse that
> if a node-pair is connected via a higher-speed fabric *as well* (IB FDR
> or 10GigE) then that this would be leveraged instead of the common 1GigE.
>
> One question: suppose that we use nodes having either FDR or QDR IB
> interfaces available, connected to one common IB fabric, all defined over a
> common IP subnet: Will OpenMPI have any problem with this? Can MPI
> communication take place over this type of hybrid IB fabric? We already
> have a sub-cluster with QDR HCAs and we are attaching it to IB fabric with
> FDR "backbone" and another cluster with FDR HCAs.
>
> Do you think there may be some issue with this? The HCAs are FDR and QDR
> Mellanox devices and the switching is also over FDR Mellanox fabric.
> Mellanox claims that at the IB level this is doable (i.e., FDR link pairs
> talk to each other at FDR speeds and QDR link pairs at QDR).
>
> I guess if we use the RC connection types then it does not matter to
> OpenMPI.
>
> thanks 
> Michael
>
>
>
>
> On Fri, Jul 5, 2013 at 4:59 PM, Ralph Castain  wrote:
>
>> I can't speak for MVAPICH - you probably need to ask them about this
>> scenario. OMPI will automatically select whatever available transport that
>> can reach the intended process. This requires that each communicating pair
>> of processes have access to at least one common transport.
>>
>> So if a process that is on a node with only 1G-E wants to communicate
>> with another process, then the node where that other process is running
>> must also have access to a compatible Ethernet interface (1G can talk to
>> 10G, so they can have different capabilities) on that subnet (or on a
>> subnet that knows how to route to the other one). If both nodes have 10G-E
>> as well as 1G-E interfaces, then OMPI will automatically take the 10G
>> interface as it is the faster of the two.
>>
>> Note this means that if a process is on a node that only has IB, and
>> wants to communicate to a process on a node that only has 1G-E, then the
>> two processes cannot communicate.
>>
>> HTH
>> Ralph
>>
>> On Jul 5, 2013, at 2:34 PM, Michael Thomadakis 
>> wrote:
>>
>> Hello OpenMPI
>>
>> We area seriously considering deploying OpenMPI 1.6.5 for production (and
>> 1.7.2 for testing) on HPC clusters which consists of nodes with *different
>> types of networking interfaces*.
>>
>>
>> 1) Interface selection
>>
>> We are using OpenMPI 1.6.5 and was wondering how one would go about
>> selecting* at run time* which networking interface to use for MPI
>> communications in case that both IB, 10GigE and 1 GigE are present.
>>
>> This issues arises in a cluster with nodes that are equipped with
>> different types of interfaces:
>>
>> *Some *have both IB-QDR or FDR and 10- and 1-GigE. Others *only* have
>> 10-GigE and 1-GigE and simply others only 1-GigE.
>>
>>
>> 2) OpenMPI 1.6.5 level of support for Heterogeneous Fabric
>>
>> Can OpenMPI support running an MPI application using a mix of nodes with
>> all of the above networking interface combinations ?
>>
>>   2.a) Can the same MPI code (SPMD or MPMD) have a subset of its ranks
>> run on nodes with QDR IB and another subset on FDR IB simultaneously? These
>> are Mellanox QDR and FDR HCAs.
>>
>> Mellanox mentioned to us that they support both QDR and FDR HCAs attached
>> to the same IB subnet. Do you think MVAPICH2 will have any issue with this?
>>
>> 2.b) Can the same MPI code (SPMD or MPMD) have a subset of its ranks run
>> on nodes with IB and another subset over 10GiGE simultaneously?
>>
>> That is imagine nodes I1, I2, ..., IN having say QDR HCAs and nodes G1,
>> G2, GM having only 10GigE interfaces. Could we have the same MPI
>> application run across both types of nodes?
>>
>> Or should there be say 2 communicators with one of them explicitly
>> overlaid on a IB only subnet and the other on a 10GigE only subnet?
>>
>>
>> Please let me know if the above are not very clear.
>>
>> Thank you much
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] How to select specific out of multiple interfaces for communication and support for heterogeneous fabrics

2013-07-08 Thread Jeff Squyres (jsquyres)
Open MPI may have get confused if you end up having different receive queue 
specifications in your IB setup (in the "openib" Byte Transfer Layer (BTL) 
plugin that is used for point-to-point MPI communication transport in OMPI).  

If Open MPI doesn't work out of the box for you in a job that utilizes both QDR 
and FDR, you may need to override some defaults so that all receives queues are 
the same in both QDR-enabled nodes and FDR-enabled nodes.


On Jul 5, 2013, at 6:26 PM, Michael Thomadakis  wrote:

> Sorry on the mvapich2 reference :) 
> 
> All nodes are attached over a common 1GigE network. We wish ofcourse that if 
> a node-pair is connected via a higher-speed fabric as well (IB FDR or 10GigE) 
> then that this would be leveraged instead of the common 1GigE.
> 
> One question: suppose that we use nodes having either FDR or QDR IB 
> interfaces available, connected to one common IB fabric, all defined over a 
> common IP subnet: Will OpenMPI have any problem with this? Can MPI 
> communication take place over this type of hybrid IB fabric? We already have 
> a sub-cluster with QDR HCAs and we are attaching it to IB fabric with FDR 
> "backbone" and another cluster with FDR HCAs. 
> 
> Do you think there may be some issue with this? The HCAs are FDR and QDR 
> Mellanox devices and the switching is also over FDR Mellanox fabric. Mellanox 
> claims that at the IB level this is doable (i.e., FDR link pairs talk to each 
> other at FDR speeds and QDR link pairs at QDR).
> 
> I guess if we use the RC connection types then it does not matter to OpenMPI. 
> 
> thanks 
> Michael
> 
> 
> 
> 
> On Fri, Jul 5, 2013 at 4:59 PM, Ralph Castain  wrote:
> I can't speak for MVAPICH - you probably need to ask them about this 
> scenario. OMPI will automatically select whatever available transport that 
> can reach the intended process. This requires that each communicating pair of 
> processes have access to at least one common transport.
> 
> So if a process that is on a node with only 1G-E wants to communicate with 
> another process, then the node where that other process is running must also 
> have access to a compatible Ethernet interface (1G can talk to 10G, so they 
> can have different capabilities) on that subnet (or on a subnet that knows 
> how to route to the other one). If both nodes have 10G-E as well as 1G-E 
> interfaces, then OMPI will automatically take the 10G interface as it is the 
> faster of the two.
> 
> Note this means that if a process is on a node that only has IB, and wants to 
> communicate to a process on a node that only has 1G-E, then the two processes 
> cannot communicate.
> 
> HTH
> Ralph
> 
> On Jul 5, 2013, at 2:34 PM, Michael Thomadakis  
> wrote:
> 
>> Hello OpenMPI
>> 
>> We area seriously considering deploying OpenMPI 1.6.5 for production (and 
>> 1.7.2 for testing) on HPC clusters which consists of nodes with different 
>> types of networking interfaces.
>> 
>> 
>> 1) Interface selection
>> 
>> We are using OpenMPI 1.6.5 and was wondering how one would go about 
>> selecting at run time which networking interface to use for MPI 
>> communications in case that both IB, 10GigE and 1 GigE are present. 
>> 
>> This issues arises in a cluster with nodes that are equipped with different 
>> types of interfaces:
>> 
>> Some have both IB-QDR or FDR and 10- and 1-GigE. Others only have 10-GigE 
>> and 1-GigE and simply others only 1-GigE.
>> 
>> 
>> 2) OpenMPI 1.6.5 level of support for Heterogeneous Fabric
>> 
>> Can OpenMPI support running an MPI application using a mix of nodes with all 
>> of the above networking interface combinations ? 
>> 
>>   2.a) Can the same MPI code (SPMD or MPMD) have a subset of its ranks run 
>> on nodes with QDR IB and another subset on FDR IB simultaneously? These are 
>> Mellanox QDR and FDR HCAs. 
>> 
>> Mellanox mentioned to us that they support both QDR and FDR HCAs attached to 
>> the same IB subnet. Do you think MVAPICH2 will have any issue with this?
>> 
>> 2.b) Can the same MPI code (SPMD or MPMD) have a subset of its ranks run on 
>> nodes with IB and another subset over 10GiGE simultaneously? 
>> 
>> That is imagine nodes I1, I2, ..., IN having say QDR HCAs and nodes G1, G2, 
>> GM having only 10GigE interfaces. Could we have the same MPI application run 
>> across both types of nodes? 
>> 
>> Or should there be say 2 communicators with one of them explicitly overlaid 
>> on a IB only subnet and the other on a 10GigE only subnet? 
>> 
>> 
>> Please let me know if the above are not very clear.
>> 
>> Thank you much
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/l