On 16:55 Mon 09 Feb , Nicolas Morey Chaisemartin wrote: > This patch fixes a bug in index port incrementation in the fat-tree > algorithm. > Problem happens (at least) with a 4 level Fat tree as below: > > > L3 L3 > ___________________|__|____________________ > / / \ \ <= All > the L2 are connected on 2 L3 switches > L2-1 L2-2 L2-1 L2-2 > / / \ \ <== The > Nth L1 of a set leads only to the Nth L2 (L2-N). With some pruning. > L1 L1 L1 L1 > /|\ /|\ /|\ /|\ > ==Fully mixed to L1== ==Fully mixed to L1== <=== We have > multiple set. In each set, all L0 lead to all L1 of their set. > > L0 L0 L0 L0 > / \ / \ / \ / \ > CN CN .. CN CN .... CN CN .. CN CN > > > To detail: > We have a bunch of sets. Each set contains compute node, L0 and L1 > switches. > Plus a common top of L2 and L3 switches. > > In each set, there are groups of compute nodes. Each group is connected to > a single L0 switch. > In a given set, all L0 are connected to all L1. > > The Nth L1 of a set is connected to the Nth L2 and only to this one. (so > through a L2, the Nth L1 can only see the Nth L1 of the other sets) > All the L2 are connected to a couple of L3. > > > If we dont put the L3. We have a perfectly balanced fat tree and well > equilibrated routes. > But when we add the L3, it introduce a huge difference. As it is not > necessary, no route is going through L3 (which is fine). > However 1/4 of L2->L1 routes is not used at all, 1/2 is half used and 1/4 > is twice overused (compared to the balanced state). > > This comes from the down_port_groups_idx which is incremented each time the > algorithm goes down through a node whether it creates routes to HCA (port > != switch) > or not. As route coming up from a L1 reaches only one L2, the algorithm > goes through all the other L2 while going down, incrementing their index. > Our case here is a bit specific but in a case where your L1 doesn't have > full connectivity to all your L2, and another switch rank above, the > problem may appear. > > To avoid this problem, __osm_ftree_fabric_route_upgoing_by_going_down > function has been changed so it returns a value to indicate if routes to > HCA (in fact to leaf switch) were created. > With this information, we only increase the index when the algorithm has > created routes to HCA. > After applying this patch and measuring the link usage, we are perfectly > balanced (L2<->L3 links are still not used but that is to be expected). > > Signed-off-by: Nicolas Morey-Chaisemartin > <[email protected]>
Applied. Thanks. Sasha _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
