date:20160204

[OMPI devel] Porting the underlying fabric interface

2016-02-04 Thread dpchoudh .

Hi developers

I am trying to add support for a new (proprietary) RDMA capable fabric
to OpenMPI and have the following question:

As I understand, some networks are implemented as a PML framework and
some are implemented as a BTL framework. It seems there is even
overlap as Myrinet seems to exist in both.

My question is: what is the difference between these two frameworks?
When adding support for a new fabric, what factors one should consider
when choosing between one type of framework over the other?

And, with apologies for asking a summary question: is there any kind
of documentation and/or book that explains all the internal details of
the implementation (which looks little like voodoo to a newcomer like
me)?

Thanks for your help.

Durga Choudhury

Life is complex. It has real and imaginary parts.

Re: [OMPI devel] Porting the underlying fabric interface

2016-02-04 Thread Gilles Gouaillardet


Durga,

did you confuse PML and MTL ?

basically, a BTL (Byte Transport Layer ?) is used with "primitive" 
interconnects that can only send bytes.
(e.g. if you need to transmit a tagged message, it is up to you 
send/recv the tag and manually match the tag on the receiver side so you 
can put the message into the right place)
on the other hand, MTL (Message Transport Layer ?) can be used with more 
advanced interconnects, that can "natively" send/recv (tagged) messages.


for example, with infiniband, you can use the openib BTL, or the mxm MTL
(note the openib BTL only requires the free ibverbs libraries
and mxm MTL requires proprietary extensions provided by mellanox)

a good starting point is the video Jeff posted at 
https://www.open-mpi.org/video/?category=internals


Cheers,

Gilles

On 2/4/2016 2:20 PM, dpchoudh . wrote:

Hi developers

I am trying to add support for a new (proprietary) RDMA capable fabric
to OpenMPI and have the following question:

As I understand, some networks are implemented as a PML framework and
some are implemented as a BTL framework. It seems there is even
overlap as Myrinet seems to exist in both.

My question is: what is the difference between these two frameworks?
When adding support for a new fabric, what factors one should consider
when choosing between one type of framework over the other?

And, with apologies for asking a summary question: is there any kind
of documentation and/or book that explains all the internal details of
the implementation (which looks little like voodoo to a newcomer like
me)?

Thanks for your help.

Durga Choudhury

Life is complex. It has real and imaginary parts.
___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/02/18544.php

[OMPI devel] Use OMPI on another network interface

2016-02-04 Thread asavarym


Hello

Using a new network interface and its ad-hoc routing algorithms I  
would like to try my own custom implementation of some collective  
communication patterns(MPI_Bcast,MPI_Alltoall,...) without expanding  
those collective communications as series of point-to-point ones based  
on a given predefined process topology.


In addition my routing methods might require additional parameters,  
rather than the basic destination lists obtained from that topology  
and the kind of collective communication considered.


How would I do that ?

In which component should I modilfy something ?


Regards

Re: [OMPI devel] Use OMPI on another network interface

2016-02-04 Thread Gilles Gouaillardet

Hi,

this is difficult to answer such a generic request.

MPI symbols (MPI_Bcast, ...) are defined as weak symbols, so the simplest
option is to redefine them an implement them the way you like. you are
always able to invoke PMPI_Bcast if you want to invoke the openmpi
implementation.

a more ompi-ish way is to create your own collective module.
for example, the default module is in ompi/mca/coll/tuned

Cheers,

Gilles

On Thursday, February 4, 2016,  wrote:

> Hello
>
> Using a new network interface and its ad-hoc routing algorithms I would
> like to try my own custom implementation of some collective communication
> patterns(MPI_Bcast,MPI_Alltoall,...) without expanding those collective
> communications as series of point-to-point ones based on a given predefined
> process topology.
>
> In addition my routing methods might require additional parameters, rather
> than the basic destination lists obtained from that topology and the kind
> of collective communication considered.
>
> How would I do that ?
>
> In which component should I modilfy something ?
>
>
> Regards
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/02/18546.php
>

Re: [OMPI devel] Use OMPI on another network interface

2016-02-04 Thread Jeff Squyres (jsquyres)

+1 on what Gilles said.  A little more detail:

1. You can simply write your own "MPI_Bcast" and interpose your version before 
Open MPI's version.  E.g.,:

-
$ cat your_program.c
#include 

int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype,
  int root, MPI_Comm comm)
{
// Whatever you want your Bcast to do
}

int main(int argc, char* argv[])
{
MPI_Init(NULL, NULL);
MPI_Bcast(...);
MPI_Finalize()
return 0;
}
-

If you need to call MPI functions inside your MPI_Bcast, call them with "PMPI" 
instead of "MPI".  E.g., call "PMPI_Send(...)" instead of "MPI_Send(...)".  
This guarantees that the back-end Open MPI versions of those functions will be 
called instead of your versions (if you end up overriding more than MPI_Bcast, 
for example).

I showed a trivial example above where everything is in one file -- but you can 
also do more complicated examples where you group all your MPI_* function 
overrides in a library that you link before/to the left of the actual Open MPI 
library on the command line.

2. As Gilles mentioned, you can write your own Open MPI collectives component.  
This will have the back-end Open MPI infrastructure call your routine(s) when 
MPI_Bcast (and friends) are invoked by the application.

Option #2 is a bit more complex than option #1.  If you're just looking to test 
some algorithms and generally play around a little, option #1 is probably what 
you want to do.

> On Feb 4, 2016, at 5:42 AM, Gilles Gouaillardet 
>  wrote:
> 
> Hi,
> 
> this is difficult to answer such a generic request.
> 
> MPI symbols (MPI_Bcast, ...) are defined as weak symbols, so the simplest 
> option is to redefine them an implement them the way you like. you are always 
> able to invoke PMPI_Bcast if you want to invoke the openmpi implementation.
> 
> a more ompi-ish way is to create your own collective module.
> for example, the default module is in ompi/mca/coll/tuned
> 
> Cheers,
> 
> Gilles
> 
> On Thursday, February 4, 2016,  wrote:
> Hello
> 
> Using a new network interface and its ad-hoc routing algorithms I would like 
> to try my own custom implementation of some collective communication 
> patterns(MPI_Bcast,MPI_Alltoall,...) without expanding those collective 
> communications as series of point-to-point ones based on a given predefined 
> process topology.
> 
> In addition my routing methods might require additional parameters, rather 
> than the basic destination lists obtained from that topology and the kind of 
> collective communication considered.
> 
> How would I do that ?
> 
> In which component should I modilfy something ?
> 
> 
> Regards
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/02/18546.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/02/18547.php

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] Porting the underlying fabric interface

2016-02-04 Thread Jeff Squyres (jsquyres)

+1 on what Gilles said.  :-)

Check out this part of the v1.10 README file:

https://github.com/open-mpi/ompi-release/blob/v1.10/README#L585-L625

Basically:

- PML is the back-end to functions like MPI_Send and MPI_Recv.
- The ob1 PML uses BTL plugins in a many-of-many relationship to potentially 
utilize multiple networks.
- The cm PML uses matching-style network APIs in CM plugins to utilize a single 
underlying network.
- The yalla PML was written by Mellanox as a replacement for cm and ob1, in 
that it directly utilizes the MXM network library without going through any of 
the abstractions in ob1 and cm.  It was written at a time when cm was not well 
optimized, and basically just added a latency penalty before dispatching to the 
underlying MTL module.  Since then, cm has been optimized such that its 
abstraction penalty before invoking the underlying MTL module is negligible.

So the question really comes down to:

- if you have a network stack API that does MPI-style matching, you should 
write an MTL.
- if not, you should write a BTL

Does that help?


> On Feb 4, 2016, at 2:29 AM, Gilles Gouaillardet  wrote:
> 
> Durga,
> 
> did you confuse PML and MTL ?
> 
> basically, a BTL (Byte Transport Layer ?) is used with "primitive" 
> interconnects that can only send bytes.
> (e.g. if you need to transmit a tagged message, it is up to you send/recv the 
> tag and manually match the tag on the receiver side so you can put the 
> message into the right place)
> on the other hand, MTL (Message Transport Layer ?) can be used with more 
> advanced interconnects, that can "natively" send/recv (tagged) messages.
> 
> for example, with infiniband, you can use the openib BTL, or the mxm MTL
> (note the openib BTL only requires the free ibverbs libraries
> and mxm MTL requires proprietary extensions provided by mellanox)
> 
> a good starting point is the video Jeff posted at 
> https://www.open-mpi.org/video/?category=internals
> 
> Cheers,
> 
> Gilles
> 
> On 2/4/2016 2:20 PM, dpchoudh . wrote:
>> Hi developers
>> 
>> I am trying to add support for a new (proprietary) RDMA capable fabric
>> to OpenMPI and have the following question:
>> 
>> As I understand, some networks are implemented as a PML framework and
>> some are implemented as a BTL framework. It seems there is even
>> overlap as Myrinet seems to exist in both.
>> 
>> My question is: what is the difference between these two frameworks?
>> When adding support for a new fabric, what factors one should consider
>> when choosing between one type of framework over the other?
>> 
>> And, with apologies for asking a summary question: is there any kind
>> of documentation and/or book that explains all the internal details of
>> the implementation (which looks little like voodoo to a newcomer like
>> me)?
>> 
>> Thanks for your help.
>> 
>> Durga Choudhury
>> 
>> Life is complex. It has real and imaginary parts.
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2016/02/18544.php
>> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/02/18545.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] RFC: set MCA param mpi_add_procs_cutoff default to 32

2016-02-04 Thread Joshua Ladd

+1


On Wed, Feb 3, 2016 at 9:54 PM, Jeff Squyres (jsquyres) 
wrote:

> WHAT: Decrease default value of mpi_add_procs_cutoff from 1024 to 32
>
> WHY: The "partial add procs" behavior is supposed to be a key feature of
> v2.0.0
>
> WHERE: ompi/mpi/runtime/ompi_mpi_params.c
>
> TIMEOUT: Next Tuesday teleconf (9 Feb 2016)
>
> MORE DETAIL:
>
> The mpi_add_procs_cutoff MCA param controls the crossover to when we start
> doing "partial" add_procs() behavior (i.e., don't just
> pml.add_procs(ALL_PROCS) during MPI_INIT).  Currently, this value defaults
> to 1024, meaning that we don't get the "partial add_procs" behavior until
> you run 1025 processes.
>
> Does anyone have an issue with reducing this value to a lower value?  I
> picked 32 somewhat arbitrarily.  See the PR for master:
>
> https://github.com/open-mpi/ompi/pull/1340
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/02/18543.php
>

Re: [OMPI devel] RFC: set MCA param mpi_add_procs_cutoff default to 32

2016-02-04 Thread Gilles Gouaillardet

+1

should we also enable sparse groups by default ?
(or at least on master, and then v2.x later)

Cheers,

Gilles

On Thursday, February 4, 2016, Joshua Ladd  wrote:

> +1
>
>
> On Wed, Feb 3, 2016 at 9:54 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com >
> wrote:
>
>> WHAT: Decrease default value of mpi_add_procs_cutoff from 1024 to 32
>>
>> WHY: The "partial add procs" behavior is supposed to be a key feature of
>> v2.0.0
>>
>> WHERE: ompi/mpi/runtime/ompi_mpi_params.c
>>
>> TIMEOUT: Next Tuesday teleconf (9 Feb 2016)
>>
>> MORE DETAIL:
>>
>> The mpi_add_procs_cutoff MCA param controls the crossover to when we
>> start doing "partial" add_procs() behavior (i.e., don't just
>> pml.add_procs(ALL_PROCS) during MPI_INIT).  Currently, this value defaults
>> to 1024, meaning that we don't get the "partial add_procs" behavior until
>> you run 1025 processes.
>>
>> Does anyone have an issue with reducing this value to a lower value?  I
>> picked 32 somewhat arbitrarily.  See the PR for master:
>>
>> https://github.com/open-mpi/ompi/pull/1340
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com 
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org 
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/02/18543.php
>>
>
>

Re: [OMPI devel] RFC: set MCA param mpi_add_procs_cutoff default to 32

2016-02-04 Thread Ralph Castain

+1, with an addition and modification:

* add the async_modex on by default
* make the change in master and let it "stew" for awhile before moving to
2.0. I believe only Cisco has been running MTT against that setup so far.


On Thu, Feb 4, 2016 at 6:04 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> +1
>
> should we also enable sparse groups by default ?
> (or at least on master, and then v2.x later)
>
> Cheers,
>
> Gilles
>
>
> On Thursday, February 4, 2016, Joshua Ladd  wrote:
>
>> +1
>>
>>
>> On Wed, Feb 3, 2016 at 9:54 PM, Jeff Squyres (jsquyres) <
>> jsquy...@cisco.com> wrote:
>>
>>> WHAT: Decrease default value of mpi_add_procs_cutoff from 1024 to 32
>>>
>>> WHY: The "partial add procs" behavior is supposed to be a key feature of
>>> v2.0.0
>>>
>>> WHERE: ompi/mpi/runtime/ompi_mpi_params.c
>>>
>>> TIMEOUT: Next Tuesday teleconf (9 Feb 2016)
>>>
>>> MORE DETAIL:
>>>
>>> The mpi_add_procs_cutoff MCA param controls the crossover to when we
>>> start doing "partial" add_procs() behavior (i.e., don't just
>>> pml.add_procs(ALL_PROCS) during MPI_INIT).  Currently, this value defaults
>>> to 1024, meaning that we don't get the "partial add_procs" behavior until
>>> you run 1025 processes.
>>>
>>> Does anyone have an issue with reducing this value to a lower value?  I
>>> picked 32 somewhat arbitrarily.  See the PR for master:
>>>
>>> https://github.com/open-mpi/ompi/pull/1340
>>>
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2016/02/18543.php
>>>
>>
>>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/02/18551.php
>

Re: [OMPI devel] Porting the underlying fabric interface

2016-02-04 Thread Howard Pritchard

Hi Durga

as an alternative you could implement a libfabric provider for your
network.  In theory,  if you can implement the reliable datagram endpoint
type on your network and a tag matching mechanism, you could then just use
the ofi mtl and not have to do much if anything in open mpi or mpich etc.

https://github.com/ofiwg/libfabric

You may also want to see if the open ucx tl model might work for your
network.  It may be less work than implementing a libfabric provider.

good luck

Howard

--

sent from my smart phonr so no good type.

Howard
On Feb 4, 2016 6:00 AM, "Jeff Squyres (jsquyres)" 
wrote:

> +1 on what Gilles said.  :-)
>
> Check out this part of the v1.10 README file:
>
> https://github.com/open-mpi/ompi-release/blob/v1.10/README#L585-L625
>
> Basically:
>
> - PML is the back-end to functions like MPI_Send and MPI_Recv.
> - The ob1 PML uses BTL plugins in a many-of-many relationship to
> potentially utilize multiple networks.
> - The cm PML uses matching-style network APIs in CM plugins to utilize a
> single underlying network.
> - The yalla PML was written by Mellanox as a replacement for cm and ob1,
> in that it directly utilizes the MXM network library without going through
> any of the abstractions in ob1 and cm.  It was written at a time when cm
> was not well optimized, and basically just added a latency penalty before
> dispatching to the underlying MTL module.  Since then, cm has been
> optimized such that its abstraction penalty before invoking the underlying
> MTL module is negligible.
>
> So the question really comes down to:
>
> - if you have a network stack API that does MPI-style matching, you should
> write an MTL.
> - if not, you should write a BTL
>
> Does that help?
>
>
> > On Feb 4, 2016, at 2:29 AM, Gilles Gouaillardet 
> wrote:
> >
> > Durga,
> >
> > did you confuse PML and MTL ?
> >
> > basically, a BTL (Byte Transport Layer ?) is used with "primitive"
> interconnects that can only send bytes.
> > (e.g. if you need to transmit a tagged message, it is up to you
> send/recv the tag and manually match the tag on the receiver side so you
> can put the message into the right place)
> > on the other hand, MTL (Message Transport Layer ?) can be used with more
> advanced interconnects, that can "natively" send/recv (tagged) messages.
> >
> > for example, with infiniband, you can use the openib BTL, or the mxm MTL
> > (note the openib BTL only requires the free ibverbs libraries
> > and mxm MTL requires proprietary extensions provided by mellanox)
> >
> > a good starting point is the video Jeff posted at
> https://www.open-mpi.org/video/?category=internals
> >
> > Cheers,
> >
> > Gilles
> >
> > On 2/4/2016 2:20 PM, dpchoudh . wrote:
> >> Hi developers
> >>
> >> I am trying to add support for a new (proprietary) RDMA capable fabric
> >> to OpenMPI and have the following question:
> >>
> >> As I understand, some networks are implemented as a PML framework and
> >> some are implemented as a BTL framework. It seems there is even
> >> overlap as Myrinet seems to exist in both.
> >>
> >> My question is: what is the difference between these two frameworks?
> >> When adding support for a new fabric, what factors one should consider
> >> when choosing between one type of framework over the other?
> >>
> >> And, with apologies for asking a summary question: is there any kind
> >> of documentation and/or book that explains all the internal details of
> >> the implementation (which looks little like voodoo to a newcomer like
> >> me)?
> >>
> >> Thanks for your help.
> >>
> >> Durga Choudhury
> >>
> >> Life is complex. It has real and imaginary parts.
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/02/18544.php
> >>
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/02/18545.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/02/18549.php
>

Re: [OMPI devel] RFC: set MCA param mpi_add_procs_cutoff default to 32

2016-02-04 Thread Jeff Squyres (jsquyres)

On Feb 4, 2016, at 9:18 AM, Ralph Castain  wrote:
> 
> +1, with an addition and modification:
> 
> * add the async_modex on by default
> * make the change in master and let it "stew" for awhile before moving to 
> 2.0. I believe only Cisco has been running MTT against that setup so far.

It's been a little while, and I forget exactly what the async modex is -- can 
you refresh my memory?

I'd be ok with enabling the async_modex, but that's not dependency to or from 
this 1024->32 change, right?  I.e., does the "enable async_modex" change need 
to be tied to this change?

Regardless, I'm fine letting this stuff cook on master for a little bit before 
PR'ing to v2.x.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] RFC: set MCA param mpi_add_procs_cutoff default to 32

2016-02-04 Thread Ralph Castain

Yes and no re the dependency. Without async_modex, the cutoff will save you
memory footprint but not result in any launch performance benefit.
Likewise, turning on async_modex without being over the cutoff won't do you
any good as you'll immediately demand all the modex data.

So they are kinda related, but not in a rigid sense. Maybe they should
be...?


On Thu, Feb 4, 2016 at 9:31 AM, Jeff Squyres (jsquyres) 
wrote:

> On Feb 4, 2016, at 9:18 AM, Ralph Castain  wrote:
> >
> > +1, with an addition and modification:
> >
> > * add the async_modex on by default
> > * make the change in master and let it "stew" for awhile before moving
> to 2.0. I believe only Cisco has been running MTT against that setup so far.
>
> It's been a little while, and I forget exactly what the async modex is --
> can you refresh my memory?
>
> I'd be ok with enabling the async_modex, but that's not dependency to or
> from this 1024->32 change, right?  I.e., does the "enable async_modex"
> change need to be tied to this change?
>
> Regardless, I'm fine letting this stuff cook on master for a little bit
> before PR'ing to v2.x.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/02/18554.php
>

[OMPI devel] Porting the underlying fabric interface

Re: [OMPI devel] Porting the underlying fabric interface

[OMPI devel] Use OMPI on another network interface

Re: [OMPI devel] Use OMPI on another network interface

Re: [OMPI devel] Use OMPI on another network interface

Re: [OMPI devel] Porting the underlying fabric interface

Re: [OMPI devel] RFC: set MCA param mpi_add_procs_cutoff default to 32

Re: [OMPI devel] RFC: set MCA param mpi_add_procs_cutoff default to 32

Re: [OMPI devel] RFC: set MCA param mpi_add_procs_cutoff default to 32

Re: [OMPI devel] Porting the underlying fabric interface

Re: [OMPI devel] RFC: set MCA param mpi_add_procs_cutoff default to 32

Re: [OMPI devel] RFC: set MCA param mpi_add_procs_cutoff default to 32

12 matches

Site Navigation

Mail list logo

Footer information