This patch fixes a bug in index port incrementation in the fat-tree algorithm.
Problem happens (at least) with a 4 level Fat tree as below:
L3 L3
___________________|__|____________________
/ / \ \ <= All the L2
are connected on 2 L3 switches
L2-1 L2-2 L2-1 L2-2
/ / \ \ <== The Nth
L1 of a set leads only to the Nth L2 (L2-N). With some pruning.
L1 L1 L1 L1
/|\ /|\ /|\ /|\
==Fully mixed to L1== ==Fully mixed to L1== <=== We have
multiple set. In each set, all L0 lead to all L1 of their set.
L0 L0 L0 L0
/ \ / \ / \ / \
CN CN .. CN CN .... CN CN .. CN CN
To detail:
We have a bunch of sets. Each set contains compute node, L0 and L1 switches.
Plus a common top of L2 and L3 switches.
In each set, there are groups of compute nodes. Each group is connected to a
single L0 switch.
In a given set, all L0 are connected to all L1.
The Nth L1 of a set is connected to the Nth L2 and only to this one. (so
through a L2, the Nth L1 can only see the Nth L1 of the other sets)
All the L2 are connected to a couple of L3.
If we dont put the L3. We have a perfectly balanced fat tree and well
equilibrated routes.
But when we add the L3, it introduce a huge difference. As it is not necessary,
no route is going through L3 (which is fine).
However 1/4 of L2->L1 routes is not used at all, 1/2 is half used and 1/4 is
twice overused (compared to the balanced state).
This comes from the down_port_groups_idx which is incremented each time the
algorithm goes down through a node whether it creates routes to HCA (port !=
switch)
or not. As route coming up from a L1 reaches only one L2, the algorithm goes
through all the other L2 while going down, incrementing their index.
Our case here is a bit specific but in a case where your L1 doesn't have full
connectivity to all your L2, and another switch rank above, the problem may
appear.
To avoid this problem, __osm_ftree_fabric_route_upgoing_by_going_down function
has been changed so it returns a value to indicate if routes to HCA (in fact to
leaf switch) were created.
With this information, we only increase the index when the algorithm has
created routes to HCA.
After applying this patch and measuring the link usage, we are perfectly balanced
(L2<->L3 links are still not used but that is to be expected).
Signed-off-by: Nicolas Morey-Chaisemartin
<[email protected]>
---
opensm/opensm/osm_ucast_ftree.c | 39 +++++++++++++++++++++++----------------
1 files changed, 23 insertions(+), 16 deletions(-)
Repost of the patch with Yevgeni's comment and a more complete description :)
Hope it's good this time.
diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
index 68900d8..3ea61a1 100644
--- a/opensm/opensm/osm_ucast_ftree.c
+++ b/opensm/opensm/osm_ucast_ftree.c
@@ -1914,7 +1914,7 @@ static void __osm_ftree_set_sw_fwd_table(IN cl_map_item_t * const p_map_item,
* assign-up-going-port-by-descending-down to r-port node (recursion)
*/
-static void
+static boolean_t
__osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
IN ftree_sw_t * p_sw,
IN ftree_sw_t * p_prev_sw,
@@ -1932,18 +1932,14 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
uint16_t i;
uint16_t j;
uint16_t k;
+ boolean_t created_route = FALSE;
/* we shouldn't enter here if both real_lid and main_path are false */
CL_ASSERT(is_real_lid || is_main_path);
/* if there is no down-going ports */
if (p_sw->down_port_groups_num == 0)
- return;
-
- /* promote the index that indicates which group should we
- start with when going through all the downgoing groups */
- p_sw->down_port_groups_idx =
- (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
+ return FALSE;;
/* foreach down-going port group (in indexing order) */
i = p_sw->down_port_groups_idx;
@@ -1952,9 +1948,12 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
p_group = p_sw->down_port_groups[i];
i = (i + 1) % p_sw->down_port_groups_num;
- /* Skip this port group unless it points to a switch */
- if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH)
+ /* If this port group doesn't point to a switch, mark
+ that the route was created and skip to the next group */
+ if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH){
+ created_route = TRUE;
continue;
+ }
if (p_prev_sw
&& (p_group->remote_base_lid == p_prev_sw->base_lid)) {
@@ -2073,16 +2072,24 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
/* Recursion step:
Assign upgoing ports by stepping down, starting on REMOTE switch */
- __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree, p_remote_sw, /* remote switch - used as a route-upgoing alg. start point */
- NULL, /* prev. position - NULL to mark that we went down and not up */
- target_lid, /* LID that we're routing to */
- target_rank, /* rank of the LID that we're routing to */
- is_real_lid, /* whether the target LID is real or dummy */
- is_main_path, /* whether this is path to HCA that should by tracked by counters */
- highest_rank_in_route); /* highest visited point in the tree before going down */
+ created_route |= __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree, p_remote_sw, /* remote switch - used as a route-upgoing alg. start point */
+ NULL, /* prev. position - NULL to mark that we went down and not up */
+ target_lid, /* LID that we're routing to */
+ target_rank, /* rank of the LID that we're routing to */
+ is_real_lid, /* whether the target LID is real or dummy */
+ is_main_path, /* whether this is path to HCA that should by tracked by counters */
+ highest_rank_in_route); /* highest visited point in the tree before going down */
}
/* done scanning all the down-going port groups */
+ /* if the route was created, promote the index that
+ indicates which group should we start with when
+ going through all the downgoing groups */
+ if (created_route)
+ p_sw->down_port_groups_idx =
+ (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
+
+ return created_route;
} /* __osm_ftree_fabric_route_upgoing_by_going_down() */
/***************************************************/
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general