[OMPI devel] Porting the underlying fabric interface
Hi developers I am trying to add support for a new (proprietary) RDMA capable fabric to OpenMPI and have the following question: As I understand, some networks are implemented as a PML framework and some are implemented as a BTL framework. It seems there is even overlap as Myrinet seems to exist in both. My question is: what is the difference between these two frameworks? When adding support for a new fabric, what factors one should consider when choosing between one type of framework over the other? And, with apologies for asking a summary question: is there any kind of documentation and/or book that explains all the internal details of the implementation (which looks little like voodoo to a newcomer like me)? Thanks for your help. Durga Choudhury Life is complex. It has real and imaginary parts.
Re: [OMPI devel] Porting the underlying fabric interface
Durga, did you confuse PML and MTL ? basically, a BTL (Byte Transport Layer ?) is used with "primitive" interconnects that can only send bytes. (e.g. if you need to transmit a tagged message, it is up to you send/recv the tag and manually match the tag on the receiver side so you can put the message into the right place) on the other hand, MTL (Message Transport Layer ?) can be used with more advanced interconnects, that can "natively" send/recv (tagged) messages. for example, with infiniband, you can use the openib BTL, or the mxm MTL (note the openib BTL only requires the free ibverbs libraries and mxm MTL requires proprietary extensions provided by mellanox) a good starting point is the video Jeff posted at https://www.open-mpi.org/video/?category=internals Cheers, Gilles On 2/4/2016 2:20 PM, dpchoudh . wrote: Hi developers I am trying to add support for a new (proprietary) RDMA capable fabric to OpenMPI and have the following question: As I understand, some networks are implemented as a PML framework and some are implemented as a BTL framework. It seems there is even overlap as Myrinet seems to exist in both. My question is: what is the difference between these two frameworks? When adding support for a new fabric, what factors one should consider when choosing between one type of framework over the other? And, with apologies for asking a summary question: is there any kind of documentation and/or book that explains all the internal details of the implementation (which looks little like voodoo to a newcomer like me)? Thanks for your help. Durga Choudhury Life is complex. It has real and imaginary parts. ___ devel mailing list de...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2016/02/18544.php
[OMPI devel] Use OMPI on another network interface
Hello Using a new network interface and its ad-hoc routing algorithms I would like to try my own custom implementation of some collective communication patterns(MPI_Bcast,MPI_Alltoall,...) without expanding those collective communications as series of point-to-point ones based on a given predefined process topology. In addition my routing methods might require additional parameters, rather than the basic destination lists obtained from that topology and the kind of collective communication considered. How would I do that ? In which component should I modilfy something ? Regards
Re: [OMPI devel] Use OMPI on another network interface
Hi, this is difficult to answer such a generic request. MPI symbols (MPI_Bcast, ...) are defined as weak symbols, so the simplest option is to redefine them an implement them the way you like. you are always able to invoke PMPI_Bcast if you want to invoke the openmpi implementation. a more ompi-ish way is to create your own collective module. for example, the default module is in ompi/mca/coll/tuned Cheers, Gilles On Thursday, February 4, 2016, wrote: > Hello > > Using a new network interface and its ad-hoc routing algorithms I would > like to try my own custom implementation of some collective communication > patterns(MPI_Bcast,MPI_Alltoall,...) without expanding those collective > communications as series of point-to-point ones based on a given predefined > process topology. > > In addition my routing methods might require additional parameters, rather > than the basic destination lists obtained from that topology and the kind > of collective communication considered. > > How would I do that ? > > In which component should I modilfy something ? > > > Regards > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/02/18546.php >
Re: [OMPI devel] Use OMPI on another network interface
+1 on what Gilles said. A little more detail: 1. You can simply write your own "MPI_Bcast" and interpose your version before Open MPI's version. E.g.,: - $ cat your_program.c #include int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) { // Whatever you want your Bcast to do } int main(int argc, char* argv[]) { MPI_Init(NULL, NULL); MPI_Bcast(...); MPI_Finalize() return 0; } - If you need to call MPI functions inside your MPI_Bcast, call them with "PMPI" instead of "MPI". E.g., call "PMPI_Send(...)" instead of "MPI_Send(...)". This guarantees that the back-end Open MPI versions of those functions will be called instead of your versions (if you end up overriding more than MPI_Bcast, for example). I showed a trivial example above where everything is in one file -- but you can also do more complicated examples where you group all your MPI_* function overrides in a library that you link before/to the left of the actual Open MPI library on the command line. 2. As Gilles mentioned, you can write your own Open MPI collectives component. This will have the back-end Open MPI infrastructure call your routine(s) when MPI_Bcast (and friends) are invoked by the application. Option #2 is a bit more complex than option #1. If you're just looking to test some algorithms and generally play around a little, option #1 is probably what you want to do. > On Feb 4, 2016, at 5:42 AM, Gilles Gouaillardet > wrote: > > Hi, > > this is difficult to answer such a generic request. > > MPI symbols (MPI_Bcast, ...) are defined as weak symbols, so the simplest > option is to redefine them an implement them the way you like. you are always > able to invoke PMPI_Bcast if you want to invoke the openmpi implementation. > > a more ompi-ish way is to create your own collective module. > for example, the default module is in ompi/mca/coll/tuned > > Cheers, > > Gilles > > On Thursday, February 4, 2016, wrote: > Hello > > Using a new network interface and its ad-hoc routing algorithms I would like > to try my own custom implementation of some collective communication > patterns(MPI_Bcast,MPI_Alltoall,...) without expanding those collective > communications as series of point-to-point ones based on a given predefined > process topology. > > In addition my routing methods might require additional parameters, rather > than the basic destination lists obtained from that topology and the kind of > collective communication considered. > > How would I do that ? > > In which component should I modilfy something ? > > > Regards > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/02/18546.php > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/02/18547.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] Porting the underlying fabric interface
+1 on what Gilles said. :-) Check out this part of the v1.10 README file: https://github.com/open-mpi/ompi-release/blob/v1.10/README#L585-L625 Basically: - PML is the back-end to functions like MPI_Send and MPI_Recv. - The ob1 PML uses BTL plugins in a many-of-many relationship to potentially utilize multiple networks. - The cm PML uses matching-style network APIs in CM plugins to utilize a single underlying network. - The yalla PML was written by Mellanox as a replacement for cm and ob1, in that it directly utilizes the MXM network library without going through any of the abstractions in ob1 and cm. It was written at a time when cm was not well optimized, and basically just added a latency penalty before dispatching to the underlying MTL module. Since then, cm has been optimized such that its abstraction penalty before invoking the underlying MTL module is negligible. So the question really comes down to: - if you have a network stack API that does MPI-style matching, you should write an MTL. - if not, you should write a BTL Does that help? > On Feb 4, 2016, at 2:29 AM, Gilles Gouaillardet wrote: > > Durga, > > did you confuse PML and MTL ? > > basically, a BTL (Byte Transport Layer ?) is used with "primitive" > interconnects that can only send bytes. > (e.g. if you need to transmit a tagged message, it is up to you send/recv the > tag and manually match the tag on the receiver side so you can put the > message into the right place) > on the other hand, MTL (Message Transport Layer ?) can be used with more > advanced interconnects, that can "natively" send/recv (tagged) messages. > > for example, with infiniband, you can use the openib BTL, or the mxm MTL > (note the openib BTL only requires the free ibverbs libraries > and mxm MTL requires proprietary extensions provided by mellanox) > > a good starting point is the video Jeff posted at > https://www.open-mpi.org/video/?category=internals > > Cheers, > > Gilles > > On 2/4/2016 2:20 PM, dpchoudh . wrote: >> Hi developers >> >> I am trying to add support for a new (proprietary) RDMA capable fabric >> to OpenMPI and have the following question: >> >> As I understand, some networks are implemented as a PML framework and >> some are implemented as a BTL framework. It seems there is even >> overlap as Myrinet seems to exist in both. >> >> My question is: what is the difference between these two frameworks? >> When adding support for a new fabric, what factors one should consider >> when choosing between one type of framework over the other? >> >> And, with apologies for asking a summary question: is there any kind >> of documentation and/or book that explains all the internal details of >> the implementation (which looks little like voodoo to a newcomer like >> me)? >> >> Thanks for your help. >> >> Durga Choudhury >> >> Life is complex. It has real and imaginary parts. >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/02/18544.php >> > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/02/18545.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: set MCA param mpi_add_procs_cutoff default to 32
+1 On Wed, Feb 3, 2016 at 9:54 PM, Jeff Squyres (jsquyres) wrote: > WHAT: Decrease default value of mpi_add_procs_cutoff from 1024 to 32 > > WHY: The "partial add procs" behavior is supposed to be a key feature of > v2.0.0 > > WHERE: ompi/mpi/runtime/ompi_mpi_params.c > > TIMEOUT: Next Tuesday teleconf (9 Feb 2016) > > MORE DETAIL: > > The mpi_add_procs_cutoff MCA param controls the crossover to when we start > doing "partial" add_procs() behavior (i.e., don't just > pml.add_procs(ALL_PROCS) during MPI_INIT). Currently, this value defaults > to 1024, meaning that we don't get the "partial add_procs" behavior until > you run 1025 processes. > > Does anyone have an issue with reducing this value to a lower value? I > picked 32 somewhat arbitrarily. See the PR for master: > > https://github.com/open-mpi/ompi/pull/1340 > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/02/18543.php >
Re: [OMPI devel] RFC: set MCA param mpi_add_procs_cutoff default to 32
+1 should we also enable sparse groups by default ? (or at least on master, and then v2.x later) Cheers, Gilles On Thursday, February 4, 2016, Joshua Ladd wrote: > +1 > > > On Wed, Feb 3, 2016 at 9:54 PM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com > > wrote: > >> WHAT: Decrease default value of mpi_add_procs_cutoff from 1024 to 32 >> >> WHY: The "partial add procs" behavior is supposed to be a key feature of >> v2.0.0 >> >> WHERE: ompi/mpi/runtime/ompi_mpi_params.c >> >> TIMEOUT: Next Tuesday teleconf (9 Feb 2016) >> >> MORE DETAIL: >> >> The mpi_add_procs_cutoff MCA param controls the crossover to when we >> start doing "partial" add_procs() behavior (i.e., don't just >> pml.add_procs(ALL_PROCS) during MPI_INIT). Currently, this value defaults >> to 1024, meaning that we don't get the "partial add_procs" behavior until >> you run 1025 processes. >> >> Does anyone have an issue with reducing this value to a lower value? I >> picked 32 somewhat arbitrarily. See the PR for master: >> >> https://github.com/open-mpi/ompi/pull/1340 >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/02/18543.php >> > >
Re: [OMPI devel] RFC: set MCA param mpi_add_procs_cutoff default to 32
+1, with an addition and modification: * add the async_modex on by default * make the change in master and let it "stew" for awhile before moving to 2.0. I believe only Cisco has been running MTT against that setup so far. On Thu, Feb 4, 2016 at 6:04 AM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > +1 > > should we also enable sparse groups by default ? > (or at least on master, and then v2.x later) > > Cheers, > > Gilles > > > On Thursday, February 4, 2016, Joshua Ladd wrote: > >> +1 >> >> >> On Wed, Feb 3, 2016 at 9:54 PM, Jeff Squyres (jsquyres) < >> jsquy...@cisco.com> wrote: >> >>> WHAT: Decrease default value of mpi_add_procs_cutoff from 1024 to 32 >>> >>> WHY: The "partial add procs" behavior is supposed to be a key feature of >>> v2.0.0 >>> >>> WHERE: ompi/mpi/runtime/ompi_mpi_params.c >>> >>> TIMEOUT: Next Tuesday teleconf (9 Feb 2016) >>> >>> MORE DETAIL: >>> >>> The mpi_add_procs_cutoff MCA param controls the crossover to when we >>> start doing "partial" add_procs() behavior (i.e., don't just >>> pml.add_procs(ALL_PROCS) during MPI_INIT). Currently, this value defaults >>> to 1024, meaning that we don't get the "partial add_procs" behavior until >>> you run 1025 processes. >>> >>> Does anyone have an issue with reducing this value to a lower value? I >>> picked 32 somewhat arbitrarily. See the PR for master: >>> >>> https://github.com/open-mpi/ompi/pull/1340 >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2016/02/18543.php >>> >> >> > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/02/18551.php >
Re: [OMPI devel] Porting the underlying fabric interface
Hi Durga as an alternative you could implement a libfabric provider for your network. In theory, if you can implement the reliable datagram endpoint type on your network and a tag matching mechanism, you could then just use the ofi mtl and not have to do much if anything in open mpi or mpich etc. https://github.com/ofiwg/libfabric You may also want to see if the open ucx tl model might work for your network. It may be less work than implementing a libfabric provider. good luck Howard -- sent from my smart phonr so no good type. Howard On Feb 4, 2016 6:00 AM, "Jeff Squyres (jsquyres)" wrote: > +1 on what Gilles said. :-) > > Check out this part of the v1.10 README file: > > https://github.com/open-mpi/ompi-release/blob/v1.10/README#L585-L625 > > Basically: > > - PML is the back-end to functions like MPI_Send and MPI_Recv. > - The ob1 PML uses BTL plugins in a many-of-many relationship to > potentially utilize multiple networks. > - The cm PML uses matching-style network APIs in CM plugins to utilize a > single underlying network. > - The yalla PML was written by Mellanox as a replacement for cm and ob1, > in that it directly utilizes the MXM network library without going through > any of the abstractions in ob1 and cm. It was written at a time when cm > was not well optimized, and basically just added a latency penalty before > dispatching to the underlying MTL module. Since then, cm has been > optimized such that its abstraction penalty before invoking the underlying > MTL module is negligible. > > So the question really comes down to: > > - if you have a network stack API that does MPI-style matching, you should > write an MTL. > - if not, you should write a BTL > > Does that help? > > > > On Feb 4, 2016, at 2:29 AM, Gilles Gouaillardet > wrote: > > > > Durga, > > > > did you confuse PML and MTL ? > > > > basically, a BTL (Byte Transport Layer ?) is used with "primitive" > interconnects that can only send bytes. > > (e.g. if you need to transmit a tagged message, it is up to you > send/recv the tag and manually match the tag on the receiver side so you > can put the message into the right place) > > on the other hand, MTL (Message Transport Layer ?) can be used with more > advanced interconnects, that can "natively" send/recv (tagged) messages. > > > > for example, with infiniband, you can use the openib BTL, or the mxm MTL > > (note the openib BTL only requires the free ibverbs libraries > > and mxm MTL requires proprietary extensions provided by mellanox) > > > > a good starting point is the video Jeff posted at > https://www.open-mpi.org/video/?category=internals > > > > Cheers, > > > > Gilles > > > > On 2/4/2016 2:20 PM, dpchoudh . wrote: > >> Hi developers > >> > >> I am trying to add support for a new (proprietary) RDMA capable fabric > >> to OpenMPI and have the following question: > >> > >> As I understand, some networks are implemented as a PML framework and > >> some are implemented as a BTL framework. It seems there is even > >> overlap as Myrinet seems to exist in both. > >> > >> My question is: what is the difference between these two frameworks? > >> When adding support for a new fabric, what factors one should consider > >> when choosing between one type of framework over the other? > >> > >> And, with apologies for asking a summary question: is there any kind > >> of documentation and/or book that explains all the internal details of > >> the implementation (which looks little like voodoo to a newcomer like > >> me)? > >> > >> Thanks for your help. > >> > >> Durga Choudhury > >> > >> Life is complex. It has real and imaginary parts. > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/02/18544.php > >> > > > > ___ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/02/18545.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/02/18549.php >
Re: [OMPI devel] RFC: set MCA param mpi_add_procs_cutoff default to 32
On Feb 4, 2016, at 9:18 AM, Ralph Castain wrote: > > +1, with an addition and modification: > > * add the async_modex on by default > * make the change in master and let it "stew" for awhile before moving to > 2.0. I believe only Cisco has been running MTT against that setup so far. It's been a little while, and I forget exactly what the async modex is -- can you refresh my memory? I'd be ok with enabling the async_modex, but that's not dependency to or from this 1024->32 change, right? I.e., does the "enable async_modex" change need to be tied to this change? Regardless, I'm fine letting this stuff cook on master for a little bit before PR'ing to v2.x. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: set MCA param mpi_add_procs_cutoff default to 32
Yes and no re the dependency. Without async_modex, the cutoff will save you memory footprint but not result in any launch performance benefit. Likewise, turning on async_modex without being over the cutoff won't do you any good as you'll immediately demand all the modex data. So they are kinda related, but not in a rigid sense. Maybe they should be...? On Thu, Feb 4, 2016 at 9:31 AM, Jeff Squyres (jsquyres) wrote: > On Feb 4, 2016, at 9:18 AM, Ralph Castain wrote: > > > > +1, with an addition and modification: > > > > * add the async_modex on by default > > * make the change in master and let it "stew" for awhile before moving > to 2.0. I believe only Cisco has been running MTT against that setup so far. > > It's been a little while, and I forget exactly what the async modex is -- > can you refresh my memory? > > I'd be ok with enabling the async_modex, but that's not dependency to or > from this 1024->32 change, right? I.e., does the "enable async_modex" > change need to be tied to this change? > > Regardless, I'm fine letting this stuff cook on master for a little bit > before PR'ing to v2.x. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/02/18554.php >