I've downloaded the OpenMPI master as suggested and rerun all my
aggregate tests
across my system with QDR IB and 10 Gbps RoCE.

     The attached unidirectional.pdf graph is the ping-pong performance for
1 core
on 1 machine to 1 core on the 2nd.  The red curve for OpenMPI 1.8.3 shows
lower
performance for small and also medium message sizes for the base test
without
using any tuning parameters.  The green line from the OpenMPI master shows
lower
performance only for small messages, but great for medium size.  Turning
off the
10 Gbps card entirely produces great performance for all message sizes.  So
the
fixes in the master at least help, but it still seems to be choosing to use
RoCE for
small messages rather than QDR IB.  They both use the openib btl so I
assume it
just chooses one at random so this is probably not that surprising.  Since
there are
no tunable parameters for multiple openib btl's, this cannot be manually
tuned.

     The bi-directional ping-pong tests show basically the same thing with
lower
performance for small message sizes for 1.8.3 and the master.  However, I'm
also seeing the max bandwidth being limited to 44 Gbps instead of 60 Gbps
for the master for some reason.

     The aggregate tests in the 3rd graph are for 20 cores on one machine
yelling at 20 cores on the 2nd machine (bi-directional too).  They likewise
show
the lower 10 Gbps RoCE performance for small messages, and also show
the max bandwidth being limited to 45 Gbps for the master.

     Our solution for now is to simply exclude mlx4_1 which is the 10 Gbps
card
which will give us QDR performance but not allow us to use the extra 10 Gbps
to channel bond for large messages.  It is more worrisome that max bandwidth
on the bi-directional and aggregate tests using the master are slower than
they
should be.

                       Dave

On Wed, Feb 11, 2015 at 11:00 AM, <devel-requ...@open-mpi.org> wrote:

> Send devel mailing list submissions to
>         de...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://www.open-mpi.org/mailman/listinfo.cgi/devel
> or, via email, send a message with subject or body 'help' to
>         devel-requ...@open-mpi.org
>
> You can reach the person managing the list at
>         devel-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of devel digest..."
>
>
> Today's Topics:
>
>    1. Re: OMPI devel] RoCE plus QDR IB tunable parameters
>       (George Bosilca)
>    2. Re: OMPI devel] RoCE plus QDR IB tunable parameters
>       (Howard Pritchard)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 10 Feb 2015 20:41:30 -0500
> From: George Bosilca <bosi...@icl.utk.edu>
> To: drdavetur...@gmail.com, Open MPI Developers <de...@open-mpi.org>
> Subject: Re: [OMPI devel] OMPI devel] RoCE plus QDR IB tunable
>         parameters
> Message-ID:
>         <
> camjjpkxc6e_y34fu5vej0uhrrj2z4ca89mn7wfwa5dsfx52...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Somehow one of the most basic information about the capabilities of the
> BTLs (bandwidth) disappeared from the MCA parameters and the one left
> (latency) was mislabeled. This mishap not only prevented the communication
> engine from correctly ordering the BTL for small messages (the latency
> bound part), but also introduced undesirable bias on the load-balance
> between multiple devices logic (the bandwidth part).
>
> I just pushed a fix  in master
>
> https://github.com/open-mpi/ompi/commit/e173f9b0c0c63c3ea24b8d8bc0ebafe1f1736acb
> .
> Once validated this should be moved over the 1.8 branch.
>
> Dave do you think it is possible to renew your experiment with the current
> master ?
>
>   Thanks,
>     George.
>
>
>
> On Mon, Feb 9, 2015 at 2:57 PM, Dave Turner <drdavetur...@gmail.com>
> wrote:
>
> > Gilles,
> >
> >      I tried running with btl_openib_cpc_include rdmacm and saw no
> change.
> >
> >       Let's simplify the problem by forgetting about the channel bonding.
> > If I just do an aggregate test of 16 cores on one machine talking to 16
> on
> > a second machine without any settings changed from the default install
> > of OpenMPI, I see that RoCE over the 10 Gbps link is used for small
> > messages
> > then it switches over to QDR IB for large messages.  I don't see channel
> > bonding
> > for large messages, but can turn this on with the btl_tcp_exclusivity
> > parameter.
> >
> >      I think there are 2 problems here, both related to the fact that QDR
> > IB link and RoCE
> > both use the same openib btl.  The first problem is that the slower RoCE
> > link is being chosen
> > for small messages, which does lower performance significantly.  The
> > second problem
> > is that I don't think there are parameters to allow for tuning of
> multiple
> > openib btl's
> > to manually select one over the other.
> >
> >                        Dave
> >
> > On Fri, Feb 6, 2015 at 8:24 PM, Gilles Gouaillardet <
> > gilles.gouaillar...@gmail.com> wrote:
> >
> >> Dave,
> >>
> >> These settings tell ompi to use native infiniband on the ib qdr port and
> >> tcpo/ip on the other port.
> >>
> >> From the faq, roce is implemented in the openib btl
> >> http://www.open-mpi.org/faq/?category=openfabrics#ompi-over-roce
> >>
> >> Did you use
> >> --mca btl_openib_cpc_include rdmacm
> >> in your first tests ?
> >>
> >> I had some second thougths about the bandwidth values, and imho they
> >> should be 327680 and 81920 because of the 8/10 encoding
> >> (And that being said, that should not change the measured performance)
> >>
> >> Also, could you try again by forcing the same btl_tcp_latency and
> >> btl_openib_latency ?
> >>
> >> Cheers,
> >>
> >> Gilles
> >>
> >> Dave Turner <drdavetur...@gmail.com> wrote:
> >> George,
> >>
> >>      I can check with my guys on Monday but I think the bandwidth
> >> parameters
> >> are the defaults.  I did alter these to 40960 and 10240 as someone else
> >> suggested to me.  The attached graph shows the base red line, along with
> >> the manual balanced blue line and auto balanced green line (0's for
> both).
> >> This shift lower suggests to me that the higher TCP latency is being
> >> pulled in.
> >> I'm not sure why the curves are shifted right.
> >>
> >>                         Dave
> >>
> >> On Fri, Feb 6, 2015 at 5:32 PM, George Bosilca <bosi...@icl.utk.edu>
> >> wrote:
> >>
> >>> Dave,
> >>>
> >>> Based on your ompi_info.all the following bandwidth are reported on
> your
> >>> system:
> >>>
> >>>                 MCA btl: parameter "btl_openib_bandwidth" (current
> >>> value: "4", data source: default, level: 5 tuner/detail, type:
> unsigned)
> >>>                           Approximate maximum bandwidth of interconnect
> >>> (0 = auto-detect value at run-time [not supported in all BTL modules],
> >= 1
> >>> = bandwidth in Mbps)
> >>>
> >>>                  MCA btl: parameter "btl_tcp_bandwidth" (current value:
> >>> "100", data source: default, level: 5 tuner/detail, type: unsigned)
> >>>                           Approximate maximum bandwidth of interconnect
> >>> (0 = auto-detect value at run-time [not supported in all BTL modules],
> >= 1
> >>> = bandwidth in Mbps)
> >>>
> >>> This basically states that on your system the default values for these
> >>> parameters are wrong, your TCP network being much faster than the IB.
> This
> >>> explains the somewhat unexpected decision of OMPI.
> >>>
> >>> As a possible solution I suggest you set these bandwidth values to
> >>> something more meaningful (directly in your configuration file). As an
> >>> example,
> >>>
> >>> btl_openib_bandwidth = 40000
> >>> btl_tcp_bandwidth = 10000
> >>>
> >>> make more sense based on your HPC system description.
> >>>
> >>>   George.
> >>>
> >>>
> >>>
> >>>
> >>> On Fri, Feb 6, 2015 at 5:37 PM, Dave Turner <drdavetur...@gmail.com>
> >>> wrote:
> >>>
> >>>>
> >>>>      We have nodes in our HPC system that have 2 NIC's,
> >>>> one being QDR IB and the second being a slower 10 Gbps card
> >>>> configured for both RoCE and TCP.  Aggregate bandwidth
> >>>> tests with 20 cores on one node yelling at 20 cores on a second
> >>>> node (attached roce.ib.aggregate.pdf) show that without tuning
> >>>> the slower RoCE interface is being used for small messages
> >>>> then QDR IB is used for larger messages (red line).  Tuning
> >>>> the tcp_exclusivity to 1024 to match the openib_exclusivity
> >>>> adds another 20 Gbps of bidirectional bandwidth to the high end (green
> >>>> line),
> >>>> and I'm guessing this is TCP traffic and not RoCE.
> >>>>
> >>>>      So by default the slower interface is being chosen on the low
> end,
> >>>> and
> >>>> I don't think there are tunable parameters to allow me to choose the
> >>>> QDR interface as the default.  Going forward we'll probably just
> >>>> disable
> >>>> RoCE on these nodes and go with QDR IB plus 10 Gbps TCP for large
> >>>> messages.
> >>>>
> >>>>       However, I do think these issues will come up more in the
> future.
> >>>> With the low latency of RoCE matching IB, there are more opportunities
> >>>> to do channel bonding or allowing multiple interfaces for aggregate
> >>>> traffic
> >>>> for even smaller message sizes.
> >>>>
> >>>>                 Dave Turner
> >>>>
> >>>> --
> >>>> Work:     davetur...@ksu.edu     (785) 532-7791
> >>>>              118 Nichols Hall, Manhattan KS  66502
> >>>> Home:    drdavetur...@gmail.com
> >>>>               cell: (785) 770-5929
> >>>>
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> de...@open-mpi.org
> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>> Link to this post:
> >>>> http://www.open-mpi.org/community/lists/devel/2015/02/16951.php
> >>>>
> >>>
> >>>
> >>
> >>
> >> --
> >> Work:     davetur...@ksu.edu     (785) 532-7791
> >>              118 Nichols Hall, Manhattan KS  66502
> >> Home:    drdavetur...@gmail.com
> >>               cell: (785) 770-5929
> >>
> >
> >
> >
> > --
> > Work:     davetur...@ksu.edu     (785) 532-7791
> >              118 Nichols Hall, Manhattan KS  66502
> > Home:    drdavetur...@gmail.com
> >               cell: (785) 770-5929
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2015/02/16963.php
> >
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> Message: 2
> Date: Tue, 10 Feb 2015 20:34:59 -0700
> From: Howard Pritchard <hpprit...@gmail.com>
> To: Open MPI Developers <de...@open-mpi.org>
> Subject: Re: [OMPI devel] OMPI devel] RoCE plus QDR IB tunable
>         parameters
> Message-ID:
>         <CAF1Cqj5=GPfi=t8Jw6SSUBKjqut0ChgntTyXfU0diM=
> mxs+...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> HI George,
>
> I'd say commit cf377db82 explains the vanishing of the bandwidth metric as
> well as the mis-labeling of the latency metric.
>
> Howard
>
>
> 2015-02-10 18:41 GMT-07:00 George Bosilca <bosi...@icl.utk.edu>:
>
> > Somehow one of the most basic information about the capabilities of the
> > BTLs (bandwidth) disappeared from the MCA parameters and the one left
> > (latency) was mislabeled. This mishap not only prevented the
> communication
> > engine from correctly ordering the BTL for small messages (the latency
> > bound part), but also introduced undesirable bias on the load-balance
> > between multiple devices logic (the bandwidth part).
> >
> > I just pushed a fix  in master
> >
> https://github.com/open-mpi/ompi/commit/e173f9b0c0c63c3ea24b8d8bc0ebafe1f1736acb
> .
> > Once validated this should be moved over the 1.8 branch.
> >
> > Dave do you think it is possible to renew your experiment with the
> current
> > master ?
> >
> >   Thanks,
> >     George.
> >
> >
> >
> > On Mon, Feb 9, 2015 at 2:57 PM, Dave Turner <drdavetur...@gmail.com>
> > wrote:
> >
> >> Gilles,
> >>
> >>      I tried running with btl_openib_cpc_include rdmacm and saw no
> >> change.
> >>
> >>       Let's simplify the problem by forgetting about the channel
> bonding.
> >>
> >> If I just do an aggregate test of 16 cores on one machine talking to 16
> on
> >> a second machine without any settings changed from the default install
> >> of OpenMPI, I see that RoCE over the 10 Gbps link is used for small
> >> messages
> >> then it switches over to QDR IB for large messages.  I don't see channel
> >> bonding
> >> for large messages, but can turn this on with the btl_tcp_exclusivity
> >> parameter.
> >>
> >>      I think there are 2 problems here, both related to the fact that
> QDR
> >> IB link and RoCE
> >> both use the same openib btl.  The first problem is that the slower RoCE
> >> link is being chosen
> >> for small messages, which does lower performance significantly.  The
> >> second problem
> >> is that I don't think there are parameters to allow for tuning of
> >> multiple openib btl's
> >> to manually select one over the other.
> >>
> >>                        Dave
> >>
> >> On Fri, Feb 6, 2015 at 8:24 PM, Gilles Gouaillardet <
> >> gilles.gouaillar...@gmail.com> wrote:
> >>
> >>> Dave,
> >>>
> >>> These settings tell ompi to use native infiniband on the ib qdr port
> and
> >>> tcpo/ip on the other port.
> >>>
> >>> From the faq, roce is implemented in the openib btl
> >>> http://www.open-mpi.org/faq/?category=openfabrics#ompi-over-roce
> >>>
> >>> Did you use
> >>> --mca btl_openib_cpc_include rdmacm
> >>> in your first tests ?
> >>>
> >>> I had some second thougths about the bandwidth values, and imho they
> >>> should be 327680 and 81920 because of the 8/10 encoding
> >>> (And that being said, that should not change the measured performance)
> >>>
> >>> Also, could you try again by forcing the same btl_tcp_latency and
> >>> btl_openib_latency ?
> >>>
> >>> Cheers,
> >>>
> >>> Gilles
> >>>
> >>> Dave Turner <drdavetur...@gmail.com> wrote:
> >>> George,
> >>>
> >>>      I can check with my guys on Monday but I think the bandwidth
> >>> parameters
> >>> are the defaults.  I did alter these to 40960 and 10240 as someone else
> >>> suggested to me.  The attached graph shows the base red line, along
> with
> >>> the manual balanced blue line and auto balanced green line (0's for
> >>> both).
> >>> This shift lower suggests to me that the higher TCP latency is being
> >>> pulled in.
> >>> I'm not sure why the curves are shifted right.
> >>>
> >>>                         Dave
> >>>
> >>> On Fri, Feb 6, 2015 at 5:32 PM, George Bosilca <bosi...@icl.utk.edu>
> >>> wrote:
> >>>
> >>>> Dave,
> >>>>
> >>>> Based on your ompi_info.all the following bandwidth are reported on
> >>>> your system:
> >>>>
> >>>>                 MCA btl: parameter "btl_openib_bandwidth" (current
> >>>> value: "4", data source: default, level: 5 tuner/detail, type:
> unsigned)
> >>>>                           Approximate maximum bandwidth of
> interconnect
> >>>> (0 = auto-detect value at run-time [not supported in all BTL
> modules], >= 1
> >>>> = bandwidth in Mbps)
> >>>>
> >>>>                  MCA btl: parameter "btl_tcp_bandwidth" (current
> value:
> >>>> "100", data source: default, level: 5 tuner/detail, type: unsigned)
> >>>>                           Approximate maximum bandwidth of
> interconnect
> >>>> (0 = auto-detect value at run-time [not supported in all BTL
> modules], >= 1
> >>>> = bandwidth in Mbps)
> >>>>
> >>>> This basically states that on your system the default values for these
> >>>> parameters are wrong, your TCP network being much faster than the IB.
> This
> >>>> explains the somewhat unexpected decision of OMPI.
> >>>>
> >>>> As a possible solution I suggest you set these bandwidth values to
> >>>> something more meaningful (directly in your configuration file). As an
> >>>> example,
> >>>>
> >>>> btl_openib_bandwidth = 40000
> >>>> btl_tcp_bandwidth = 10000
> >>>>
> >>>> make more sense based on your HPC system description.
> >>>>
> >>>>   George.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Feb 6, 2015 at 5:37 PM, Dave Turner <drdavetur...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>>
> >>>>>      We have nodes in our HPC system that have 2 NIC's,
> >>>>> one being QDR IB and the second being a slower 10 Gbps card
> >>>>> configured for both RoCE and TCP.  Aggregate bandwidth
> >>>>> tests with 20 cores on one node yelling at 20 cores on a second
> >>>>> node (attached roce.ib.aggregate.pdf) show that without tuning
> >>>>> the slower RoCE interface is being used for small messages
> >>>>> then QDR IB is used for larger messages (red line).  Tuning
> >>>>> the tcp_exclusivity to 1024 to match the openib_exclusivity
> >>>>> adds another 20 Gbps of bidirectional bandwidth to the high end
> (green
> >>>>> line),
> >>>>> and I'm guessing this is TCP traffic and not RoCE.
> >>>>>
> >>>>>      So by default the slower interface is being chosen on the low
> >>>>> end, and
> >>>>> I don't think there are tunable parameters to allow me to choose the
> >>>>> QDR interface as the default.  Going forward we'll probably just
> >>>>> disable
> >>>>> RoCE on these nodes and go with QDR IB plus 10 Gbps TCP for large
> >>>>> messages.
> >>>>>
> >>>>>       However, I do think these issues will come up more in the
> future.
> >>>>> With the low latency of RoCE matching IB, there are more
> opportunities
> >>>>> to do channel bonding or allowing multiple interfaces for aggregate
> >>>>> traffic
> >>>>> for even smaller message sizes.
> >>>>>
> >>>>>                 Dave Turner
> >>>>>
> >>>>> --
> >>>>> Work:     davetur...@ksu.edu     (785) 532-7791
> >>>>>              118 Nichols Hall, Manhattan KS  66502
> >>>>> Home:    drdavetur...@gmail.com
> >>>>>               cell: (785) 770-5929
> >>>>>
> >>>>> _______________________________________________
> >>>>> devel mailing list
> >>>>> de...@open-mpi.org
> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>> Link to this post:
> >>>>> http://www.open-mpi.org/community/lists/devel/2015/02/16951.php
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Work:     davetur...@ksu.edu     (785) 532-7791
> >>>              118 Nichols Hall, Manhattan KS  66502
> >>> Home:    drdavetur...@gmail.com
> >>>               cell: (785) 770-5929
> >>>
> >>
> >>
> >>
> >> --
> >> Work:     davetur...@ksu.edu     (785) 532-7791
> >>              118 Nichols Hall, Manhattan KS  66502
> >> Home:    drdavetur...@gmail.com
> >>               cell: (785) 770-5929
> >>
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> >> http://www.open-mpi.org/community/lists/devel/2015/02/16963.php
> >>
> >
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> > http://www.open-mpi.org/community/lists/devel/2015/02/16965.php
> >
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> ------------------------------
>
> End of devel Digest, Vol 2917, Issue 1
> **************************************
>



-- 
Work:     davetur...@ksu.edu     (785) 532-7791
             118 Nichols Hall, Manhattan KS  66502
Home:    drdavetur...@gmail.com
              cell: (785) 770-5929

Attachment: unidirectional.pdf
Description: Adobe PDF document

Attachment: bidirectional.pdf
Description: Adobe PDF document

Attachment: aggregate.pdf
Description: Adobe PDF document

Reply via email to