Dave -- Just out of curiosity, what kind of performance do you get when you use MXM? (e.g., the yalla PML on master)
> On Feb 19, 2015, at 6:41 PM, Dave Turner <drdavetur...@gmail.com> wrote: > > > I've downloaded the OpenMPI master as suggested and rerun all my > aggregate tests > across my system with QDR IB and 10 Gbps RoCE. > > The attached unidirectional.pdf graph is the ping-pong performance for 1 > core > on 1 machine to 1 core on the 2nd. The red curve for OpenMPI 1.8.3 shows > lower > performance for small and also medium message sizes for the base test without > using any tuning parameters. The green line from the OpenMPI master shows > lower > performance only for small messages, but great for medium size. Turning off > the > 10 Gbps card entirely produces great performance for all message sizes. So > the > fixes in the master at least help, but it still seems to be choosing to use > RoCE for > small messages rather than QDR IB. They both use the openib btl so I assume > it > just chooses one at random so this is probably not that surprising. Since > there are > no tunable parameters for multiple openib btl's, this cannot be manually > tuned. > > The bi-directional ping-pong tests show basically the same thing with > lower > performance for small message sizes for 1.8.3 and the master. However, I'm > also seeing the max bandwidth being limited to 44 Gbps instead of 60 Gbps > for the master for some reason. > > The aggregate tests in the 3rd graph are for 20 cores on one machine > yelling at 20 cores on the 2nd machine (bi-directional too). They likewise > show > the lower 10 Gbps RoCE performance for small messages, and also show > the max bandwidth being limited to 45 Gbps for the master. > > Our solution for now is to simply exclude mlx4_1 which is the 10 Gbps > card > which will give us QDR performance but not allow us to use the extra 10 Gbps > to channel bond for large messages. It is more worrisome that max bandwidth > on the bi-directional and aggregate tests using the master are slower than > they > should be. > > Dave > > On Wed, Feb 11, 2015 at 11:00 AM, <devel-requ...@open-mpi.org> wrote: > Send devel mailing list submissions to > de...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.open-mpi.org/mailman/listinfo.cgi/devel > or, via email, send a message with subject or body 'help' to > devel-requ...@open-mpi.org > > You can reach the person managing the list at > devel-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of devel digest..." > > > Today's Topics: > > 1. Re: OMPI devel] RoCE plus QDR IB tunable parameters > (George Bosilca) > 2. Re: OMPI devel] RoCE plus QDR IB tunable parameters > (Howard Pritchard) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 10 Feb 2015 20:41:30 -0500 > From: George Bosilca <bosi...@icl.utk.edu> > To: drdavetur...@gmail.com, Open MPI Developers <de...@open-mpi.org> > Subject: Re: [OMPI devel] OMPI devel] RoCE plus QDR IB tunable > parameters > Message-ID: > <camjjpkxc6e_y34fu5vej0uhrrj2z4ca89mn7wfwa5dsfx52...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Somehow one of the most basic information about the capabilities of the > BTLs (bandwidth) disappeared from the MCA parameters and the one left > (latency) was mislabeled. This mishap not only prevented the communication > engine from correctly ordering the BTL for small messages (the latency > bound part), but also introduced undesirable bias on the load-balance > between multiple devices logic (the bandwidth part). > > I just pushed a fix in master > https://github.com/open-mpi/ompi/commit/e173f9b0c0c63c3ea24b8d8bc0ebafe1f1736acb. > Once validated this should be moved over the 1.8 branch. > > Dave do you think it is possible to renew your experiment with the current > master ? > > Thanks, > George. > > > > On Mon, Feb 9, 2015 at 2:57 PM, Dave Turner <drdavetur...@gmail.com> wrote: > > > Gilles, > > > > I tried running with btl_openib_cpc_include rdmacm and saw no change. > > > > Let's simplify the problem by forgetting about the channel bonding. > > If I just do an aggregate test of 16 cores on one machine talking to 16 on > > a second machine without any settings changed from the default install > > of OpenMPI, I see that RoCE over the 10 Gbps link is used for small > > messages > > then it switches over to QDR IB for large messages. I don't see channel > > bonding > > for large messages, but can turn this on with the btl_tcp_exclusivity > > parameter. > > > > I think there are 2 problems here, both related to the fact that QDR > > IB link and RoCE > > both use the same openib btl. The first problem is that the slower RoCE > > link is being chosen > > for small messages, which does lower performance significantly. The > > second problem > > is that I don't think there are parameters to allow for tuning of multiple > > openib btl's > > to manually select one over the other. > > > > Dave > > > > On Fri, Feb 6, 2015 at 8:24 PM, Gilles Gouaillardet < > > gilles.gouaillar...@gmail.com> wrote: > > > >> Dave, > >> > >> These settings tell ompi to use native infiniband on the ib qdr port and > >> tcpo/ip on the other port. > >> > >> From the faq, roce is implemented in the openib btl > >> http://www.open-mpi.org/faq/?category=openfabrics#ompi-over-roce > >> > >> Did you use > >> --mca btl_openib_cpc_include rdmacm > >> in your first tests ? > >> > >> I had some second thougths about the bandwidth values, and imho they > >> should be 327680 and 81920 because of the 8/10 encoding > >> (And that being said, that should not change the measured performance) > >> > >> Also, could you try again by forcing the same btl_tcp_latency and > >> btl_openib_latency ? > >> > >> Cheers, > >> > >> Gilles > >> > >> Dave Turner <drdavetur...@gmail.com> wrote: > >> George, > >> > >> I can check with my guys on Monday but I think the bandwidth > >> parameters > >> are the defaults. I did alter these to 40960 and 10240 as someone else > >> suggested to me. The attached graph shows the base red line, along with > >> the manual balanced blue line and auto balanced green line (0's for both). > >> This shift lower suggests to me that the higher TCP latency is being > >> pulled in. > >> I'm not sure why the curves are shifted right. > >> > >> Dave > >> > >> On Fri, Feb 6, 2015 at 5:32 PM, George Bosilca <bosi...@icl.utk.edu> > >> wrote: > >> > >>> Dave, > >>> > >>> Based on your ompi_info.all the following bandwidth are reported on your > >>> system: > >>> > >>> MCA btl: parameter "btl_openib_bandwidth" (current > >>> value: "4", data source: default, level: 5 tuner/detail, type: unsigned) > >>> Approximate maximum bandwidth of interconnect > >>> (0 = auto-detect value at run-time [not supported in all BTL modules], >= > >>> 1 > >>> = bandwidth in Mbps) > >>> > >>> MCA btl: parameter "btl_tcp_bandwidth" (current value: > >>> "100", data source: default, level: 5 tuner/detail, type: unsigned) > >>> Approximate maximum bandwidth of interconnect > >>> (0 = auto-detect value at run-time [not supported in all BTL modules], >= > >>> 1 > >>> = bandwidth in Mbps) > >>> > >>> This basically states that on your system the default values for these > >>> parameters are wrong, your TCP network being much faster than the IB. This > >>> explains the somewhat unexpected decision of OMPI. > >>> > >>> As a possible solution I suggest you set these bandwidth values to > >>> something more meaningful (directly in your configuration file). As an > >>> example, > >>> > >>> btl_openib_bandwidth = 40000 > >>> btl_tcp_bandwidth = 10000 > >>> > >>> make more sense based on your HPC system description. > >>> > >>> George. > >>> > >>> > >>> > >>> > >>> On Fri, Feb 6, 2015 at 5:37 PM, Dave Turner <drdavetur...@gmail.com> > >>> wrote: > >>> > >>>> > >>>> We have nodes in our HPC system that have 2 NIC's, > >>>> one being QDR IB and the second being a slower 10 Gbps card > >>>> configured for both RoCE and TCP. Aggregate bandwidth > >>>> tests with 20 cores on one node yelling at 20 cores on a second > >>>> node (attached roce.ib.aggregate.pdf) show that without tuning > >>>> the slower RoCE interface is being used for small messages > >>>> then QDR IB is used for larger messages (red line). Tuning > >>>> the tcp_exclusivity to 1024 to match the openib_exclusivity > >>>> adds another 20 Gbps of bidirectional bandwidth to the high end (green > >>>> line), > >>>> and I'm guessing this is TCP traffic and not RoCE. > >>>> > >>>> So by default the slower interface is being chosen on the low end, > >>>> and > >>>> I don't think there are tunable parameters to allow me to choose the > >>>> QDR interface as the default. Going forward we'll probably just > >>>> disable > >>>> RoCE on these nodes and go with QDR IB plus 10 Gbps TCP for large > >>>> messages. > >>>> > >>>> However, I do think these issues will come up more in the future. > >>>> With the low latency of RoCE matching IB, there are more opportunities > >>>> to do channel bonding or allowing multiple interfaces for aggregate > >>>> traffic > >>>> for even smaller message sizes. > >>>> > >>>> Dave Turner > >>>> > >>>> -- > >>>> Work: davetur...@ksu.edu (785) 532-7791 > >>>> 118 Nichols Hall, Manhattan KS 66502 > >>>> Home: drdavetur...@gmail.com > >>>> cell: (785) 770-5929 > >>>> > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> Link to this post: > >>>> http://www.open-mpi.org/community/lists/devel/2015/02/16951.php > >>>> > >>> > >>> > >> > >> > >> -- > >> Work: davetur...@ksu.edu (785) 532-7791 > >> 118 Nichols Hall, Manhattan KS 66502 > >> Home: drdavetur...@gmail.com > >> cell: (785) 770-5929 > >> > > > > > > > > -- > > Work: davetur...@ksu.edu (785) 532-7791 > > 118 Nichols Hall, Manhattan KS 66502 > > Home: drdavetur...@gmail.com > > cell: (785) 770-5929 > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2015/02/16963.php > > > -------------- next part -------------- > HTML attachment scrubbed and removed > > ------------------------------ > > Message: 2 > Date: Tue, 10 Feb 2015 20:34:59 -0700 > From: Howard Pritchard <hpprit...@gmail.com> > To: Open MPI Developers <de...@open-mpi.org> > Subject: Re: [OMPI devel] OMPI devel] RoCE plus QDR IB tunable > parameters > Message-ID: > <CAF1Cqj5=GPfi=t8Jw6SSUBKjqut0ChgntTyXfU0diM=mxs+...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > HI George, > > I'd say commit cf377db82 explains the vanishing of the bandwidth metric as > well as the mis-labeling of the latency metric. > > Howard > > > 2015-02-10 18:41 GMT-07:00 George Bosilca <bosi...@icl.utk.edu>: > > > Somehow one of the most basic information about the capabilities of the > > BTLs (bandwidth) disappeared from the MCA parameters and the one left > > (latency) was mislabeled. This mishap not only prevented the communication > > engine from correctly ordering the BTL for small messages (the latency > > bound part), but also introduced undesirable bias on the load-balance > > between multiple devices logic (the bandwidth part). > > > > I just pushed a fix in master > > https://github.com/open-mpi/ompi/commit/e173f9b0c0c63c3ea24b8d8bc0ebafe1f1736acb. > > Once validated this should be moved over the 1.8 branch. > > > > Dave do you think it is possible to renew your experiment with the current > > master ? > > > > Thanks, > > George. > > > > > > > > On Mon, Feb 9, 2015 at 2:57 PM, Dave Turner <drdavetur...@gmail.com> > > wrote: > > > >> Gilles, > >> > >> I tried running with btl_openib_cpc_include rdmacm and saw no > >> change. > >> > >> Let's simplify the problem by forgetting about the channel bonding. > >> > >> If I just do an aggregate test of 16 cores on one machine talking to 16 on > >> a second machine without any settings changed from the default install > >> of OpenMPI, I see that RoCE over the 10 Gbps link is used for small > >> messages > >> then it switches over to QDR IB for large messages. I don't see channel > >> bonding > >> for large messages, but can turn this on with the btl_tcp_exclusivity > >> parameter. > >> > >> I think there are 2 problems here, both related to the fact that QDR > >> IB link and RoCE > >> both use the same openib btl. The first problem is that the slower RoCE > >> link is being chosen > >> for small messages, which does lower performance significantly. The > >> second problem > >> is that I don't think there are parameters to allow for tuning of > >> multiple openib btl's > >> to manually select one over the other. > >> > >> Dave > >> > >> On Fri, Feb 6, 2015 at 8:24 PM, Gilles Gouaillardet < > >> gilles.gouaillar...@gmail.com> wrote: > >> > >>> Dave, > >>> > >>> These settings tell ompi to use native infiniband on the ib qdr port and > >>> tcpo/ip on the other port. > >>> > >>> From the faq, roce is implemented in the openib btl > >>> http://www.open-mpi.org/faq/?category=openfabrics#ompi-over-roce > >>> > >>> Did you use > >>> --mca btl_openib_cpc_include rdmacm > >>> in your first tests ? > >>> > >>> I had some second thougths about the bandwidth values, and imho they > >>> should be 327680 and 81920 because of the 8/10 encoding > >>> (And that being said, that should not change the measured performance) > >>> > >>> Also, could you try again by forcing the same btl_tcp_latency and > >>> btl_openib_latency ? > >>> > >>> Cheers, > >>> > >>> Gilles > >>> > >>> Dave Turner <drdavetur...@gmail.com> wrote: > >>> George, > >>> > >>> I can check with my guys on Monday but I think the bandwidth > >>> parameters > >>> are the defaults. I did alter these to 40960 and 10240 as someone else > >>> suggested to me. The attached graph shows the base red line, along with > >>> the manual balanced blue line and auto balanced green line (0's for > >>> both). > >>> This shift lower suggests to me that the higher TCP latency is being > >>> pulled in. > >>> I'm not sure why the curves are shifted right. > >>> > >>> Dave > >>> > >>> On Fri, Feb 6, 2015 at 5:32 PM, George Bosilca <bosi...@icl.utk.edu> > >>> wrote: > >>> > >>>> Dave, > >>>> > >>>> Based on your ompi_info.all the following bandwidth are reported on > >>>> your system: > >>>> > >>>> MCA btl: parameter "btl_openib_bandwidth" (current > >>>> value: "4", data source: default, level: 5 tuner/detail, type: unsigned) > >>>> Approximate maximum bandwidth of interconnect > >>>> (0 = auto-detect value at run-time [not supported in all BTL modules], > >>>> >= 1 > >>>> = bandwidth in Mbps) > >>>> > >>>> MCA btl: parameter "btl_tcp_bandwidth" (current value: > >>>> "100", data source: default, level: 5 tuner/detail, type: unsigned) > >>>> Approximate maximum bandwidth of interconnect > >>>> (0 = auto-detect value at run-time [not supported in all BTL modules], > >>>> >= 1 > >>>> = bandwidth in Mbps) > >>>> > >>>> This basically states that on your system the default values for these > >>>> parameters are wrong, your TCP network being much faster than the IB. > >>>> This > >>>> explains the somewhat unexpected decision of OMPI. > >>>> > >>>> As a possible solution I suggest you set these bandwidth values to > >>>> something more meaningful (directly in your configuration file). As an > >>>> example, > >>>> > >>>> btl_openib_bandwidth = 40000 > >>>> btl_tcp_bandwidth = 10000 > >>>> > >>>> make more sense based on your HPC system description. > >>>> > >>>> George. > >>>> > >>>> > >>>> > >>>> > >>>> On Fri, Feb 6, 2015 at 5:37 PM, Dave Turner <drdavetur...@gmail.com> > >>>> wrote: > >>>> > >>>>> > >>>>> We have nodes in our HPC system that have 2 NIC's, > >>>>> one being QDR IB and the second being a slower 10 Gbps card > >>>>> configured for both RoCE and TCP. Aggregate bandwidth > >>>>> tests with 20 cores on one node yelling at 20 cores on a second > >>>>> node (attached roce.ib.aggregate.pdf) show that without tuning > >>>>> the slower RoCE interface is being used for small messages > >>>>> then QDR IB is used for larger messages (red line). Tuning > >>>>> the tcp_exclusivity to 1024 to match the openib_exclusivity > >>>>> adds another 20 Gbps of bidirectional bandwidth to the high end (green > >>>>> line), > >>>>> and I'm guessing this is TCP traffic and not RoCE. > >>>>> > >>>>> So by default the slower interface is being chosen on the low > >>>>> end, and > >>>>> I don't think there are tunable parameters to allow me to choose the > >>>>> QDR interface as the default. Going forward we'll probably just > >>>>> disable > >>>>> RoCE on these nodes and go with QDR IB plus 10 Gbps TCP for large > >>>>> messages. > >>>>> > >>>>> However, I do think these issues will come up more in the future. > >>>>> With the low latency of RoCE matching IB, there are more opportunities > >>>>> to do channel bonding or allowing multiple interfaces for aggregate > >>>>> traffic > >>>>> for even smaller message sizes. > >>>>> > >>>>> Dave Turner > >>>>> > >>>>> -- > >>>>> Work: davetur...@ksu.edu (785) 532-7791 > >>>>> 118 Nichols Hall, Manhattan KS 66502 > >>>>> Home: drdavetur...@gmail.com > >>>>> cell: (785) 770-5929 > >>>>> > >>>>> _______________________________________________ > >>>>> devel mailing list > >>>>> de...@open-mpi.org > >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>> Link to this post: > >>>>> http://www.open-mpi.org/community/lists/devel/2015/02/16951.php > >>>>> > >>>> > >>>> > >>> > >>> > >>> -- > >>> Work: davetur...@ksu.edu (785) 532-7791 > >>> 118 Nichols Hall, Manhattan KS 66502 > >>> Home: drdavetur...@gmail.com > >>> cell: (785) 770-5929 > >>> > >> > >> > >> > >> -- > >> Work: davetur...@ksu.edu (785) 532-7791 > >> 118 Nichols Hall, Manhattan KS 66502 > >> Home: drdavetur...@gmail.com > >> cell: (785) 770-5929 > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > >> http://www.open-mpi.org/community/lists/devel/2015/02/16963.php > >> > > > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2015/02/16965.php > > > -------------- next part -------------- > HTML attachment scrubbed and removed > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ------------------------------ > > End of devel Digest, Vol 2917, Issue 1 > ************************************** > > > > -- > Work: davetur...@ksu.edu (785) 532-7791 > 118 Nichols Hall, Manhattan KS 66502 > Home: drdavetur...@gmail.com > cell: (785) 770-5929 > <unidirectional.pdf><bidirectional.pdf><aggregate.pdf>_______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/02/17004.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/