HI George, I'd say commit cf377db82 explains the vanishing of the bandwidth metric as well as the mis-labeling of the latency metric.
Howard 2015-02-10 18:41 GMT-07:00 George Bosilca <bosi...@icl.utk.edu>: > Somehow one of the most basic information about the capabilities of the > BTLs (bandwidth) disappeared from the MCA parameters and the one left > (latency) was mislabeled. This mishap not only prevented the communication > engine from correctly ordering the BTL for small messages (the latency > bound part), but also introduced undesirable bias on the load-balance > between multiple devices logic (the bandwidth part). > > I just pushed a fix in master > https://github.com/open-mpi/ompi/commit/e173f9b0c0c63c3ea24b8d8bc0ebafe1f1736acb. > Once validated this should be moved over the 1.8 branch. > > Dave do you think it is possible to renew your experiment with the current > master ? > > Thanks, > George. > > > > On Mon, Feb 9, 2015 at 2:57 PM, Dave Turner <drdavetur...@gmail.com> > wrote: > >> Gilles, >> >> I tried running with btl_openib_cpc_include rdmacm and saw no >> change. >> >> Let's simplify the problem by forgetting about the channel bonding. >> >> If I just do an aggregate test of 16 cores on one machine talking to 16 on >> a second machine without any settings changed from the default install >> of OpenMPI, I see that RoCE over the 10 Gbps link is used for small >> messages >> then it switches over to QDR IB for large messages. I don't see channel >> bonding >> for large messages, but can turn this on with the btl_tcp_exclusivity >> parameter. >> >> I think there are 2 problems here, both related to the fact that QDR >> IB link and RoCE >> both use the same openib btl. The first problem is that the slower RoCE >> link is being chosen >> for small messages, which does lower performance significantly. The >> second problem >> is that I don't think there are parameters to allow for tuning of >> multiple openib btl's >> to manually select one over the other. >> >> Dave >> >> On Fri, Feb 6, 2015 at 8:24 PM, Gilles Gouaillardet < >> gilles.gouaillar...@gmail.com> wrote: >> >>> Dave, >>> >>> These settings tell ompi to use native infiniband on the ib qdr port and >>> tcpo/ip on the other port. >>> >>> From the faq, roce is implemented in the openib btl >>> http://www.open-mpi.org/faq/?category=openfabrics#ompi-over-roce >>> >>> Did you use >>> --mca btl_openib_cpc_include rdmacm >>> in your first tests ? >>> >>> I had some second thougths about the bandwidth values, and imho they >>> should be 327680 and 81920 because of the 8/10 encoding >>> (And that being said, that should not change the measured performance) >>> >>> Also, could you try again by forcing the same btl_tcp_latency and >>> btl_openib_latency ? >>> >>> Cheers, >>> >>> Gilles >>> >>> Dave Turner <drdavetur...@gmail.com> wrote: >>> George, >>> >>> I can check with my guys on Monday but I think the bandwidth >>> parameters >>> are the defaults. I did alter these to 40960 and 10240 as someone else >>> suggested to me. The attached graph shows the base red line, along with >>> the manual balanced blue line and auto balanced green line (0's for >>> both). >>> This shift lower suggests to me that the higher TCP latency is being >>> pulled in. >>> I'm not sure why the curves are shifted right. >>> >>> Dave >>> >>> On Fri, Feb 6, 2015 at 5:32 PM, George Bosilca <bosi...@icl.utk.edu> >>> wrote: >>> >>>> Dave, >>>> >>>> Based on your ompi_info.all the following bandwidth are reported on >>>> your system: >>>> >>>> MCA btl: parameter "btl_openib_bandwidth" (current >>>> value: "4", data source: default, level: 5 tuner/detail, type: unsigned) >>>> Approximate maximum bandwidth of interconnect >>>> (0 = auto-detect value at run-time [not supported in all BTL modules], >= 1 >>>> = bandwidth in Mbps) >>>> >>>> MCA btl: parameter "btl_tcp_bandwidth" (current value: >>>> "100", data source: default, level: 5 tuner/detail, type: unsigned) >>>> Approximate maximum bandwidth of interconnect >>>> (0 = auto-detect value at run-time [not supported in all BTL modules], >= 1 >>>> = bandwidth in Mbps) >>>> >>>> This basically states that on your system the default values for these >>>> parameters are wrong, your TCP network being much faster than the IB. This >>>> explains the somewhat unexpected decision of OMPI. >>>> >>>> As a possible solution I suggest you set these bandwidth values to >>>> something more meaningful (directly in your configuration file). As an >>>> example, >>>> >>>> btl_openib_bandwidth = 40000 >>>> btl_tcp_bandwidth = 10000 >>>> >>>> make more sense based on your HPC system description. >>>> >>>> George. >>>> >>>> >>>> >>>> >>>> On Fri, Feb 6, 2015 at 5:37 PM, Dave Turner <drdavetur...@gmail.com> >>>> wrote: >>>> >>>>> >>>>> We have nodes in our HPC system that have 2 NIC's, >>>>> one being QDR IB and the second being a slower 10 Gbps card >>>>> configured for both RoCE and TCP. Aggregate bandwidth >>>>> tests with 20 cores on one node yelling at 20 cores on a second >>>>> node (attached roce.ib.aggregate.pdf) show that without tuning >>>>> the slower RoCE interface is being used for small messages >>>>> then QDR IB is used for larger messages (red line). Tuning >>>>> the tcp_exclusivity to 1024 to match the openib_exclusivity >>>>> adds another 20 Gbps of bidirectional bandwidth to the high end (green >>>>> line), >>>>> and I'm guessing this is TCP traffic and not RoCE. >>>>> >>>>> So by default the slower interface is being chosen on the low >>>>> end, and >>>>> I don't think there are tunable parameters to allow me to choose the >>>>> QDR interface as the default. Going forward we'll probably just >>>>> disable >>>>> RoCE on these nodes and go with QDR IB plus 10 Gbps TCP for large >>>>> messages. >>>>> >>>>> However, I do think these issues will come up more in the future. >>>>> With the low latency of RoCE matching IB, there are more opportunities >>>>> to do channel bonding or allowing multiple interfaces for aggregate >>>>> traffic >>>>> for even smaller message sizes. >>>>> >>>>> Dave Turner >>>>> >>>>> -- >>>>> Work: davetur...@ksu.edu (785) 532-7791 >>>>> 118 Nichols Hall, Manhattan KS 66502 >>>>> Home: drdavetur...@gmail.com >>>>> cell: (785) 770-5929 >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2015/02/16951.php >>>>> >>>> >>>> >>> >>> >>> -- >>> Work: davetur...@ksu.edu (785) 532-7791 >>> 118 Nichols Hall, Manhattan KS 66502 >>> Home: drdavetur...@gmail.com >>> cell: (785) 770-5929 >>> >> >> >> >> -- >> Work: davetur...@ksu.edu (785) 532-7791 >> 118 Nichols Hall, Manhattan KS 66502 >> Home: drdavetur...@gmail.com >> cell: (785) 770-5929 >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/02/16963.php >> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/02/16965.php >