HI George,

I'd say commit cf377db82 explains the vanishing of the bandwidth metric as
well as the mis-labeling of the latency metric.

Howard


2015-02-10 18:41 GMT-07:00 George Bosilca <bosi...@icl.utk.edu>:

> Somehow one of the most basic information about the capabilities of the
> BTLs (bandwidth) disappeared from the MCA parameters and the one left
> (latency) was mislabeled. This mishap not only prevented the communication
> engine from correctly ordering the BTL for small messages (the latency
> bound part), but also introduced undesirable bias on the load-balance
> between multiple devices logic (the bandwidth part).
>
> I just pushed a fix  in master
> https://github.com/open-mpi/ompi/commit/e173f9b0c0c63c3ea24b8d8bc0ebafe1f1736acb.
> Once validated this should be moved over the 1.8 branch.
>
> Dave do you think it is possible to renew your experiment with the current
> master ?
>
>   Thanks,
>     George.
>
>
>
> On Mon, Feb 9, 2015 at 2:57 PM, Dave Turner <drdavetur...@gmail.com>
> wrote:
>
>> Gilles,
>>
>>      I tried running with btl_openib_cpc_include rdmacm and saw no
>> change.
>>
>>       Let's simplify the problem by forgetting about the channel bonding.
>>
>> If I just do an aggregate test of 16 cores on one machine talking to 16 on
>> a second machine without any settings changed from the default install
>> of OpenMPI, I see that RoCE over the 10 Gbps link is used for small
>> messages
>> then it switches over to QDR IB for large messages.  I don't see channel
>> bonding
>> for large messages, but can turn this on with the btl_tcp_exclusivity
>> parameter.
>>
>>      I think there are 2 problems here, both related to the fact that QDR
>> IB link and RoCE
>> both use the same openib btl.  The first problem is that the slower RoCE
>> link is being chosen
>> for small messages, which does lower performance significantly.  The
>> second problem
>> is that I don't think there are parameters to allow for tuning of
>> multiple openib btl's
>> to manually select one over the other.
>>
>>                        Dave
>>
>> On Fri, Feb 6, 2015 at 8:24 PM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>>
>>> Dave,
>>>
>>> These settings tell ompi to use native infiniband on the ib qdr port and
>>> tcpo/ip on the other port.
>>>
>>> From the faq, roce is implemented in the openib btl
>>> http://www.open-mpi.org/faq/?category=openfabrics#ompi-over-roce
>>>
>>> Did you use
>>> --mca btl_openib_cpc_include rdmacm
>>> in your first tests ?
>>>
>>> I had some second thougths about the bandwidth values, and imho they
>>> should be 327680 and 81920 because of the 8/10 encoding
>>> (And that being said, that should not change the measured performance)
>>>
>>> Also, could you try again by forcing the same btl_tcp_latency and
>>> btl_openib_latency ?
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> Dave Turner <drdavetur...@gmail.com> wrote:
>>> George,
>>>
>>>      I can check with my guys on Monday but I think the bandwidth
>>> parameters
>>> are the defaults.  I did alter these to 40960 and 10240 as someone else
>>> suggested to me.  The attached graph shows the base red line, along with
>>> the manual balanced blue line and auto balanced green line (0's for
>>> both).
>>> This shift lower suggests to me that the higher TCP latency is being
>>> pulled in.
>>> I'm not sure why the curves are shifted right.
>>>
>>>                         Dave
>>>
>>> On Fri, Feb 6, 2015 at 5:32 PM, George Bosilca <bosi...@icl.utk.edu>
>>> wrote:
>>>
>>>> Dave,
>>>>
>>>> Based on your ompi_info.all the following bandwidth are reported on
>>>> your system:
>>>>
>>>>                 MCA btl: parameter "btl_openib_bandwidth" (current
>>>> value: "4", data source: default, level: 5 tuner/detail, type: unsigned)
>>>>                           Approximate maximum bandwidth of interconnect
>>>> (0 = auto-detect value at run-time [not supported in all BTL modules], >= 1
>>>> = bandwidth in Mbps)
>>>>
>>>>                  MCA btl: parameter "btl_tcp_bandwidth" (current value:
>>>> "100", data source: default, level: 5 tuner/detail, type: unsigned)
>>>>                           Approximate maximum bandwidth of interconnect
>>>> (0 = auto-detect value at run-time [not supported in all BTL modules], >= 1
>>>> = bandwidth in Mbps)
>>>>
>>>> This basically states that on your system the default values for these
>>>> parameters are wrong, your TCP network being much faster than the IB. This
>>>> explains the somewhat unexpected decision of OMPI.
>>>>
>>>> As a possible solution I suggest you set these bandwidth values to
>>>> something more meaningful (directly in your configuration file). As an
>>>> example,
>>>>
>>>> btl_openib_bandwidth = 40000
>>>> btl_tcp_bandwidth = 10000
>>>>
>>>> make more sense based on your HPC system description.
>>>>
>>>>   George.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Feb 6, 2015 at 5:37 PM, Dave Turner <drdavetur...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>      We have nodes in our HPC system that have 2 NIC's,
>>>>> one being QDR IB and the second being a slower 10 Gbps card
>>>>> configured for both RoCE and TCP.  Aggregate bandwidth
>>>>> tests with 20 cores on one node yelling at 20 cores on a second
>>>>> node (attached roce.ib.aggregate.pdf) show that without tuning
>>>>> the slower RoCE interface is being used for small messages
>>>>> then QDR IB is used for larger messages (red line).  Tuning
>>>>> the tcp_exclusivity to 1024 to match the openib_exclusivity
>>>>> adds another 20 Gbps of bidirectional bandwidth to the high end (green
>>>>> line),
>>>>> and I'm guessing this is TCP traffic and not RoCE.
>>>>>
>>>>>      So by default the slower interface is being chosen on the low
>>>>> end, and
>>>>> I don't think there are tunable parameters to allow me to choose the
>>>>> QDR interface as the default.  Going forward we'll probably just
>>>>> disable
>>>>> RoCE on these nodes and go with QDR IB plus 10 Gbps TCP for large
>>>>> messages.
>>>>>
>>>>>       However, I do think these issues will come up more in the future.
>>>>> With the low latency of RoCE matching IB, there are more opportunities
>>>>> to do channel bonding or allowing multiple interfaces for aggregate
>>>>> traffic
>>>>> for even smaller message sizes.
>>>>>
>>>>>                 Dave Turner
>>>>>
>>>>> --
>>>>> Work:     davetur...@ksu.edu     (785) 532-7791
>>>>>              118 Nichols Hall, Manhattan KS  66502
>>>>> Home:    drdavetur...@gmail.com
>>>>>               cell: (785) 770-5929
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/devel/2015/02/16951.php
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Work:     davetur...@ksu.edu     (785) 532-7791
>>>              118 Nichols Hall, Manhattan KS  66502
>>> Home:    drdavetur...@gmail.com
>>>               cell: (785) 770-5929
>>>
>>
>>
>>
>> --
>> Work:     davetur...@ksu.edu     (785) 532-7791
>>              118 Nichols Hall, Manhattan KS  66502
>> Home:    drdavetur...@gmail.com
>>               cell: (785) 770-5929
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/02/16963.php
>>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/02/16965.php
>

Reply via email to