Re: [OMPI devel] OMPI devel] RoCE plus QDR IB tunable parameters

Jeff Squyres (jsquyres) Sun, 22 Feb 2015 06:37:27 -0500 (EST)

Dave --

Just out of curiosity, what kind of performance do you get when you use MXM?  
(e.g., the yalla PML on master)



> On Feb 19, 2015, at 6:41 PM, Dave Turner <drdavetur...@gmail.com> wrote:
> 
> 
>      I've downloaded the OpenMPI master as suggested and rerun all my 
> aggregate tests
> across my system with QDR IB and 10 Gbps RoCE.  
> 
>      The attached unidirectional.pdf graph is the ping-pong performance for 1 
> core
> on 1 machine to 1 core on the 2nd.  The red curve for OpenMPI 1.8.3 shows 
> lower
> performance for small and also medium message sizes for the base test without
> using any tuning parameters.  The green line from the OpenMPI master shows 
> lower 
> performance only for small messages, but great for medium size.  Turning off 
> the
> 10 Gbps card entirely produces great performance for all message sizes.  So 
> the
> fixes in the master at least help, but it still seems to be choosing to use 
> RoCE for
> small messages rather than QDR IB.  They both use the openib btl so I assume 
> it
> just chooses one at random so this is probably not that surprising.  Since 
> there are
> no tunable parameters for multiple openib btl's, this cannot be manually 
> tuned.
> 
>      The bi-directional ping-pong tests show basically the same thing with 
> lower
> performance for small message sizes for 1.8.3 and the master.  However, I'm 
> also seeing the max bandwidth being limited to 44 Gbps instead of 60 Gbps
> for the master for some reason.
> 
>      The aggregate tests in the 3rd graph are for 20 cores on one machine
> yelling at 20 cores on the 2nd machine (bi-directional too).  They likewise 
> show
> the lower 10 Gbps RoCE performance for small messages, and also show 
> the max bandwidth being limited to 45 Gbps for the master.
> 
>      Our solution for now is to simply exclude mlx4_1 which is the 10 Gbps 
> card
> which will give us QDR performance but not allow us to use the extra 10 Gbps
> to channel bond for large messages.  It is more worrisome that max bandwidth
> on the bi-directional and aggregate tests using the master are slower than 
> they
> should be.
> 
>                        Dave
> 
> On Wed, Feb 11, 2015 at 11:00 AM, <devel-requ...@open-mpi.org> wrote:
> Send devel mailing list submissions to
>         de...@open-mpi.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://www.open-mpi.org/mailman/listinfo.cgi/devel
> or, via email, send a message with subject or body 'help' to
>         devel-requ...@open-mpi.org
> 
> You can reach the person managing the list at
>         devel-ow...@open-mpi.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of devel digest..."
> 
> 
> Today's Topics:
> 
>    1. Re: OMPI devel] RoCE plus QDR IB tunable parameters
>       (George Bosilca)
>    2. Re: OMPI devel] RoCE plus QDR IB tunable parameters
>       (Howard Pritchard)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Tue, 10 Feb 2015 20:41:30 -0500
> From: George Bosilca <bosi...@icl.utk.edu>
> To: drdavetur...@gmail.com, Open MPI Developers <de...@open-mpi.org>
> Subject: Re: [OMPI devel] OMPI devel] RoCE plus QDR IB tunable
>         parameters
> Message-ID:
>         <camjjpkxc6e_y34fu5vej0uhrrj2z4ca89mn7wfwa5dsfx52...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Somehow one of the most basic information about the capabilities of the
> BTLs (bandwidth) disappeared from the MCA parameters and the one left
> (latency) was mislabeled. This mishap not only prevented the communication
> engine from correctly ordering the BTL for small messages (the latency
> bound part), but also introduced undesirable bias on the load-balance
> between multiple devices logic (the bandwidth part).
> 
> I just pushed a fix  in master
> https://github.com/open-mpi/ompi/commit/e173f9b0c0c63c3ea24b8d8bc0ebafe1f1736acb.
> Once validated this should be moved over the 1.8 branch.
> 
> Dave do you think it is possible to renew your experiment with the current
> master ?
> 
>   Thanks,
>     George.
> 
> 
> 
> On Mon, Feb 9, 2015 at 2:57 PM, Dave Turner <drdavetur...@gmail.com> wrote:
> 
> > Gilles,
> >
> >      I tried running with btl_openib_cpc_include rdmacm and saw no change.
> >
> >       Let's simplify the problem by forgetting about the channel bonding.
> > If I just do an aggregate test of 16 cores on one machine talking to 16 on
> > a second machine without any settings changed from the default install
> > of OpenMPI, I see that RoCE over the 10 Gbps link is used for small
> > messages
> > then it switches over to QDR IB for large messages.  I don't see channel
> > bonding
> > for large messages, but can turn this on with the btl_tcp_exclusivity
> > parameter.
> >
> >      I think there are 2 problems here, both related to the fact that QDR
> > IB link and RoCE
> > both use the same openib btl.  The first problem is that the slower RoCE
> > link is being chosen
> > for small messages, which does lower performance significantly.  The
> > second problem
> > is that I don't think there are parameters to allow for tuning of multiple
> > openib btl's
> > to manually select one over the other.
> >
> >                        Dave
> >
> > On Fri, Feb 6, 2015 at 8:24 PM, Gilles Gouaillardet <
> > gilles.gouaillar...@gmail.com> wrote:
> >
> >> Dave,
> >>
> >> These settings tell ompi to use native infiniband on the ib qdr port and
> >> tcpo/ip on the other port.
> >>
> >> From the faq, roce is implemented in the openib btl
> >> http://www.open-mpi.org/faq/?category=openfabrics#ompi-over-roce
> >>
> >> Did you use
> >> --mca btl_openib_cpc_include rdmacm
> >> in your first tests ?
> >>
> >> I had some second thougths about the bandwidth values, and imho they
> >> should be 327680 and 81920 because of the 8/10 encoding
> >> (And that being said, that should not change the measured performance)
> >>
> >> Also, could you try again by forcing the same btl_tcp_latency and
> >> btl_openib_latency ?
> >>
> >> Cheers,
> >>
> >> Gilles
> >>
> >> Dave Turner <drdavetur...@gmail.com> wrote:
> >> George,
> >>
> >>      I can check with my guys on Monday but I think the bandwidth
> >> parameters
> >> are the defaults.  I did alter these to 40960 and 10240 as someone else
> >> suggested to me.  The attached graph shows the base red line, along with
> >> the manual balanced blue line and auto balanced green line (0's for both).
> >> This shift lower suggests to me that the higher TCP latency is being
> >> pulled in.
> >> I'm not sure why the curves are shifted right.
> >>
> >>                         Dave
> >>
> >> On Fri, Feb 6, 2015 at 5:32 PM, George Bosilca <bosi...@icl.utk.edu>
> >> wrote:
> >>
> >>> Dave,
> >>>
> >>> Based on your ompi_info.all the following bandwidth are reported on your
> >>> system:
> >>>
> >>>                 MCA btl: parameter "btl_openib_bandwidth" (current
> >>> value: "4", data source: default, level: 5 tuner/detail, type: unsigned)
> >>>                           Approximate maximum bandwidth of interconnect
> >>> (0 = auto-detect value at run-time [not supported in all BTL modules], >= 
> >>> 1
> >>> = bandwidth in Mbps)
> >>>
> >>>                  MCA btl: parameter "btl_tcp_bandwidth" (current value:
> >>> "100", data source: default, level: 5 tuner/detail, type: unsigned)
> >>>                           Approximate maximum bandwidth of interconnect
> >>> (0 = auto-detect value at run-time [not supported in all BTL modules], >= 
> >>> 1
> >>> = bandwidth in Mbps)
> >>>
> >>> This basically states that on your system the default values for these
> >>> parameters are wrong, your TCP network being much faster than the IB. This
> >>> explains the somewhat unexpected decision of OMPI.
> >>>
> >>> As a possible solution I suggest you set these bandwidth values to
> >>> something more meaningful (directly in your configuration file). As an
> >>> example,
> >>>
> >>> btl_openib_bandwidth = 40000
> >>> btl_tcp_bandwidth = 10000
> >>>
> >>> make more sense based on your HPC system description.
> >>>
> >>>   George.
> >>>
> >>>
> >>>
> >>>
> >>> On Fri, Feb 6, 2015 at 5:37 PM, Dave Turner <drdavetur...@gmail.com>
> >>> wrote:
> >>>
> >>>>
> >>>>      We have nodes in our HPC system that have 2 NIC's,
> >>>> one being QDR IB and the second being a slower 10 Gbps card
> >>>> configured for both RoCE and TCP.  Aggregate bandwidth
> >>>> tests with 20 cores on one node yelling at 20 cores on a second
> >>>> node (attached roce.ib.aggregate.pdf) show that without tuning
> >>>> the slower RoCE interface is being used for small messages
> >>>> then QDR IB is used for larger messages (red line).  Tuning
> >>>> the tcp_exclusivity to 1024 to match the openib_exclusivity
> >>>> adds another 20 Gbps of bidirectional bandwidth to the high end (green
> >>>> line),
> >>>> and I'm guessing this is TCP traffic and not RoCE.
> >>>>
> >>>>      So by default the slower interface is being chosen on the low end,
> >>>> and
> >>>> I don't think there are tunable parameters to allow me to choose the
> >>>> QDR interface as the default.  Going forward we'll probably just
> >>>> disable
> >>>> RoCE on these nodes and go with QDR IB plus 10 Gbps TCP for large
> >>>> messages.
> >>>>
> >>>>       However, I do think these issues will come up more in the future.
> >>>> With the low latency of RoCE matching IB, there are more opportunities
> >>>> to do channel bonding or allowing multiple interfaces for aggregate
> >>>> traffic
> >>>> for even smaller message sizes.
> >>>>
> >>>>                 Dave Turner
> >>>>
> >>>> --
> >>>> Work:     davetur...@ksu.edu     (785) 532-7791
> >>>>              118 Nichols Hall, Manhattan KS  66502
> >>>> Home:    drdavetur...@gmail.com
> >>>>               cell: (785) 770-5929
> >>>>
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> de...@open-mpi.org
> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>> Link to this post:
> >>>> http://www.open-mpi.org/community/lists/devel/2015/02/16951.php
> >>>>
> >>>
> >>>
> >>
> >>
> >> --
> >> Work:     davetur...@ksu.edu     (785) 532-7791
> >>              118 Nichols Hall, Manhattan KS  66502
> >> Home:    drdavetur...@gmail.com
> >>               cell: (785) 770-5929
> >>
> >
> >
> >
> > --
> > Work:     davetur...@ksu.edu     (785) 532-7791
> >              118 Nichols Hall, Manhattan KS  66502
> > Home:    drdavetur...@gmail.com
> >               cell: (785) 770-5929
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2015/02/16963.php
> >
> -------------- next part --------------
> HTML attachment scrubbed and removed
> 
> ------------------------------
> 
> Message: 2
> Date: Tue, 10 Feb 2015 20:34:59 -0700
> From: Howard Pritchard <hpprit...@gmail.com>
> To: Open MPI Developers <de...@open-mpi.org>
> Subject: Re: [OMPI devel] OMPI devel] RoCE plus QDR IB tunable
>         parameters
> Message-ID:
>         <CAF1Cqj5=GPfi=t8Jw6SSUBKjqut0ChgntTyXfU0diM=mxs+...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
> 
> HI George,
> 
> I'd say commit cf377db82 explains the vanishing of the bandwidth metric as
> well as the mis-labeling of the latency metric.
> 
> Howard
> 
> 
> 2015-02-10 18:41 GMT-07:00 George Bosilca <bosi...@icl.utk.edu>:
> 
> > Somehow one of the most basic information about the capabilities of the
> > BTLs (bandwidth) disappeared from the MCA parameters and the one left
> > (latency) was mislabeled. This mishap not only prevented the communication
> > engine from correctly ordering the BTL for small messages (the latency
> > bound part), but also introduced undesirable bias on the load-balance
> > between multiple devices logic (the bandwidth part).
> >
> > I just pushed a fix  in master
> > https://github.com/open-mpi/ompi/commit/e173f9b0c0c63c3ea24b8d8bc0ebafe1f1736acb.
> > Once validated this should be moved over the 1.8 branch.
> >
> > Dave do you think it is possible to renew your experiment with the current
> > master ?
> >
> >   Thanks,
> >     George.
> >
> >
> >
> > On Mon, Feb 9, 2015 at 2:57 PM, Dave Turner <drdavetur...@gmail.com>
> > wrote:
> >
> >> Gilles,
> >>
> >>      I tried running with btl_openib_cpc_include rdmacm and saw no
> >> change.
> >>
> >>       Let's simplify the problem by forgetting about the channel bonding.
> >>
> >> If I just do an aggregate test of 16 cores on one machine talking to 16 on
> >> a second machine without any settings changed from the default install
> >> of OpenMPI, I see that RoCE over the 10 Gbps link is used for small
> >> messages
> >> then it switches over to QDR IB for large messages.  I don't see channel
> >> bonding
> >> for large messages, but can turn this on with the btl_tcp_exclusivity
> >> parameter.
> >>
> >>      I think there are 2 problems here, both related to the fact that QDR
> >> IB link and RoCE
> >> both use the same openib btl.  The first problem is that the slower RoCE
> >> link is being chosen
> >> for small messages, which does lower performance significantly.  The
> >> second problem
> >> is that I don't think there are parameters to allow for tuning of
> >> multiple openib btl's
> >> to manually select one over the other.
> >>
> >>                        Dave
> >>
> >> On Fri, Feb 6, 2015 at 8:24 PM, Gilles Gouaillardet <
> >> gilles.gouaillar...@gmail.com> wrote:
> >>
> >>> Dave,
> >>>
> >>> These settings tell ompi to use native infiniband on the ib qdr port and
> >>> tcpo/ip on the other port.
> >>>
> >>> From the faq, roce is implemented in the openib btl
> >>> http://www.open-mpi.org/faq/?category=openfabrics#ompi-over-roce
> >>>
> >>> Did you use
> >>> --mca btl_openib_cpc_include rdmacm
> >>> in your first tests ?
> >>>
> >>> I had some second thougths about the bandwidth values, and imho they
> >>> should be 327680 and 81920 because of the 8/10 encoding
> >>> (And that being said, that should not change the measured performance)
> >>>
> >>> Also, could you try again by forcing the same btl_tcp_latency and
> >>> btl_openib_latency ?
> >>>
> >>> Cheers,
> >>>
> >>> Gilles
> >>>
> >>> Dave Turner <drdavetur...@gmail.com> wrote:
> >>> George,
> >>>
> >>>      I can check with my guys on Monday but I think the bandwidth
> >>> parameters
> >>> are the defaults.  I did alter these to 40960 and 10240 as someone else
> >>> suggested to me.  The attached graph shows the base red line, along with
> >>> the manual balanced blue line and auto balanced green line (0's for
> >>> both).
> >>> This shift lower suggests to me that the higher TCP latency is being
> >>> pulled in.
> >>> I'm not sure why the curves are shifted right.
> >>>
> >>>                         Dave
> >>>
> >>> On Fri, Feb 6, 2015 at 5:32 PM, George Bosilca <bosi...@icl.utk.edu>
> >>> wrote:
> >>>
> >>>> Dave,
> >>>>
> >>>> Based on your ompi_info.all the following bandwidth are reported on
> >>>> your system:
> >>>>
> >>>>                 MCA btl: parameter "btl_openib_bandwidth" (current
> >>>> value: "4", data source: default, level: 5 tuner/detail, type: unsigned)
> >>>>                           Approximate maximum bandwidth of interconnect
> >>>> (0 = auto-detect value at run-time [not supported in all BTL modules], 
> >>>> >= 1
> >>>> = bandwidth in Mbps)
> >>>>
> >>>>                  MCA btl: parameter "btl_tcp_bandwidth" (current value:
> >>>> "100", data source: default, level: 5 tuner/detail, type: unsigned)
> >>>>                           Approximate maximum bandwidth of interconnect
> >>>> (0 = auto-detect value at run-time [not supported in all BTL modules], 
> >>>> >= 1
> >>>> = bandwidth in Mbps)
> >>>>
> >>>> This basically states that on your system the default values for these
> >>>> parameters are wrong, your TCP network being much faster than the IB. 
> >>>> This
> >>>> explains the somewhat unexpected decision of OMPI.
> >>>>
> >>>> As a possible solution I suggest you set these bandwidth values to
> >>>> something more meaningful (directly in your configuration file). As an
> >>>> example,
> >>>>
> >>>> btl_openib_bandwidth = 40000
> >>>> btl_tcp_bandwidth = 10000
> >>>>
> >>>> make more sense based on your HPC system description.
> >>>>
> >>>>   George.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Feb 6, 2015 at 5:37 PM, Dave Turner <drdavetur...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>>
> >>>>>      We have nodes in our HPC system that have 2 NIC's,
> >>>>> one being QDR IB and the second being a slower 10 Gbps card
> >>>>> configured for both RoCE and TCP.  Aggregate bandwidth
> >>>>> tests with 20 cores on one node yelling at 20 cores on a second
> >>>>> node (attached roce.ib.aggregate.pdf) show that without tuning
> >>>>> the slower RoCE interface is being used for small messages
> >>>>> then QDR IB is used for larger messages (red line).  Tuning
> >>>>> the tcp_exclusivity to 1024 to match the openib_exclusivity
> >>>>> adds another 20 Gbps of bidirectional bandwidth to the high end (green
> >>>>> line),
> >>>>> and I'm guessing this is TCP traffic and not RoCE.
> >>>>>
> >>>>>      So by default the slower interface is being chosen on the low
> >>>>> end, and
> >>>>> I don't think there are tunable parameters to allow me to choose the
> >>>>> QDR interface as the default.  Going forward we'll probably just
> >>>>> disable
> >>>>> RoCE on these nodes and go with QDR IB plus 10 Gbps TCP for large
> >>>>> messages.
> >>>>>
> >>>>>       However, I do think these issues will come up more in the future.
> >>>>> With the low latency of RoCE matching IB, there are more opportunities
> >>>>> to do channel bonding or allowing multiple interfaces for aggregate
> >>>>> traffic
> >>>>> for even smaller message sizes.
> >>>>>
> >>>>>                 Dave Turner
> >>>>>
> >>>>> --
> >>>>> Work:     davetur...@ksu.edu     (785) 532-7791
> >>>>>              118 Nichols Hall, Manhattan KS  66502
> >>>>> Home:    drdavetur...@gmail.com
> >>>>>               cell: (785) 770-5929
> >>>>>
> >>>>> _______________________________________________
> >>>>> devel mailing list
> >>>>> de...@open-mpi.org
> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>> Link to this post:
> >>>>> http://www.open-mpi.org/community/lists/devel/2015/02/16951.php
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Work:     davetur...@ksu.edu     (785) 532-7791
> >>>              118 Nichols Hall, Manhattan KS  66502
> >>> Home:    drdavetur...@gmail.com
> >>>               cell: (785) 770-5929
> >>>
> >>
> >>
> >>
> >> --
> >> Work:     davetur...@ksu.edu     (785) 532-7791
> >>              118 Nichols Hall, Manhattan KS  66502
> >> Home:    drdavetur...@gmail.com
> >>               cell: (785) 770-5929
> >>
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> >> http://www.open-mpi.org/community/lists/devel/2015/02/16963.php
> >>
> >
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2015/02/16965.php
> >
> -------------- next part --------------
> HTML attachment scrubbed and removed
> 
> ------------------------------
> 
> Subject: Digest Footer
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ------------------------------
> 
> End of devel Digest, Vol 2917, Issue 1
> **************************************
> 
> 
> 
> -- 
> Work:     davetur...@ksu.edu     (785) 532-7791
>              118 Nichols Hall, Manhattan KS  66502
> Home:    drdavetur...@gmail.com
>               cell: (785) 770-5929
> <unidirectional.pdf><bidirectional.pdf><aggregate.pdf>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/02/17004.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] OMPI devel] RoCE plus QDR IB tunable parameters

Reply via email to