Gilles, I tried running with btl_openib_cpc_include rdmacm and saw no change.
Let's simplify the problem by forgetting about the channel bonding. If I just do an aggregate test of 16 cores on one machine talking to 16 on a second machine without any settings changed from the default install of OpenMPI, I see that RoCE over the 10 Gbps link is used for small messages then it switches over to QDR IB for large messages. I don't see channel bonding for large messages, but can turn this on with the btl_tcp_exclusivity parameter. I think there are 2 problems here, both related to the fact that QDR IB link and RoCE both use the same openib btl. The first problem is that the slower RoCE link is being chosen for small messages, which does lower performance significantly. The second problem is that I don't think there are parameters to allow for tuning of multiple openib btl's to manually select one over the other. Dave On Fri, Feb 6, 2015 at 8:24 PM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > Dave, > > These settings tell ompi to use native infiniband on the ib qdr port and > tcpo/ip on the other port. > > From the faq, roce is implemented in the openib btl > http://www.open-mpi.org/faq/?category=openfabrics#ompi-over-roce > > Did you use > --mca btl_openib_cpc_include rdmacm > in your first tests ? > > I had some second thougths about the bandwidth values, and imho they > should be 327680 and 81920 because of the 8/10 encoding > (And that being said, that should not change the measured performance) > > Also, could you try again by forcing the same btl_tcp_latency and > btl_openib_latency ? > > Cheers, > > Gilles > > Dave Turner <drdavetur...@gmail.com> wrote: > George, > > I can check with my guys on Monday but I think the bandwidth > parameters > are the defaults. I did alter these to 40960 and 10240 as someone else > suggested to me. The attached graph shows the base red line, along with > the manual balanced blue line and auto balanced green line (0's for both). > This shift lower suggests to me that the higher TCP latency is being > pulled in. > I'm not sure why the curves are shifted right. > > Dave > > On Fri, Feb 6, 2015 at 5:32 PM, George Bosilca <bosi...@icl.utk.edu> > wrote: > >> Dave, >> >> Based on your ompi_info.all the following bandwidth are reported on your >> system: >> >> MCA btl: parameter "btl_openib_bandwidth" (current value: >> "4", data source: default, level: 5 tuner/detail, type: unsigned) >> Approximate maximum bandwidth of interconnect >> (0 = auto-detect value at run-time [not supported in all BTL modules], >= 1 >> = bandwidth in Mbps) >> >> MCA btl: parameter "btl_tcp_bandwidth" (current value: >> "100", data source: default, level: 5 tuner/detail, type: unsigned) >> Approximate maximum bandwidth of interconnect >> (0 = auto-detect value at run-time [not supported in all BTL modules], >= 1 >> = bandwidth in Mbps) >> >> This basically states that on your system the default values for these >> parameters are wrong, your TCP network being much faster than the IB. This >> explains the somewhat unexpected decision of OMPI. >> >> As a possible solution I suggest you set these bandwidth values to >> something more meaningful (directly in your configuration file). As an >> example, >> >> btl_openib_bandwidth = 40000 >> btl_tcp_bandwidth = 10000 >> >> make more sense based on your HPC system description. >> >> George. >> >> >> >> >> On Fri, Feb 6, 2015 at 5:37 PM, Dave Turner <drdavetur...@gmail.com> >> wrote: >> >>> >>> We have nodes in our HPC system that have 2 NIC's, >>> one being QDR IB and the second being a slower 10 Gbps card >>> configured for both RoCE and TCP. Aggregate bandwidth >>> tests with 20 cores on one node yelling at 20 cores on a second >>> node (attached roce.ib.aggregate.pdf) show that without tuning >>> the slower RoCE interface is being used for small messages >>> then QDR IB is used for larger messages (red line). Tuning >>> the tcp_exclusivity to 1024 to match the openib_exclusivity >>> adds another 20 Gbps of bidirectional bandwidth to the high end (green >>> line), >>> and I'm guessing this is TCP traffic and not RoCE. >>> >>> So by default the slower interface is being chosen on the low end, >>> and >>> I don't think there are tunable parameters to allow me to choose the >>> QDR interface as the default. Going forward we'll probably just disable >>> RoCE on these nodes and go with QDR IB plus 10 Gbps TCP for large >>> messages. >>> >>> However, I do think these issues will come up more in the future. >>> With the low latency of RoCE matching IB, there are more opportunities >>> to do channel bonding or allowing multiple interfaces for aggregate >>> traffic >>> for even smaller message sizes. >>> >>> Dave Turner >>> >>> -- >>> Work: davetur...@ksu.edu (785) 532-7791 >>> 118 Nichols Hall, Manhattan KS 66502 >>> Home: drdavetur...@gmail.com >>> cell: (785) 770-5929 >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/02/16951.php >>> >> >> > > > -- > Work: davetur...@ksu.edu (785) 532-7791 > 118 Nichols Hall, Manhattan KS 66502 > Home: drdavetur...@gmail.com > cell: (785) 770-5929 > -- Work: davetur...@ksu.edu (785) 532-7791 118 Nichols Hall, Manhattan KS 66502 Home: drdavetur...@gmail.com cell: (785) 770-5929