Hi Nicolas,
Nicolas Morey Chaisemartin wrote:
Yevgeny Kliteynik wrote:
Nicolas Morey Chaisemartin wrote:
Hi everyone,
We have been working quite a lot at Bull lately on the Ftree algorithm
and we have made some upgrades.
However, as they modify the behavior of the ftree algorithm, we
haven't pushed them until now.
I'm just going to detail which upgrades we have done and let you
decide if you are interested, if and how they should be pushed
upstream (new routing algorithm, option in the ftree, etc.)
Here is a simplify model of the topology we have been working on
L3 L3
___________________|__|____________________
/ / \ \ <=
All the L2 are connected on 2 L3 switches
L2-1 L2-2 L2-1 L2-2 <=
There are service nodes connected directly on L2 switches
/ \S1 / \S2 S3/ \ S4/ \ <==
The Nth L1 of a set leads only to the Nth L2 (L2-N). With some pruning.
L1 L1 L1 L1
/|\ /|\ /|\ /|\
==Fully mixed to L1== ==Fully mixed to L1== <=== We have
multiple set. In each set, all L0 lead to all L1 of their set.
L0 L0 L0 L0
/ \ / \ / \ / \
CN CN .. CN CN .... CN CN .. CN CN
To detail:
We have a bunch of sets. Each set contains compute node, L0 and L1
switches.
Plus a common top of L2 and L3 switches.
In each set, there are groups of compute nodes. Each group is
connected to a single L0 switch.
In a given set, all L0 are connected to all L1.
The Nth L1 of a set is connected to the Nth L2 and only to this one.
(so through a L2, the Nth L1 can only see the Nth L1 of the other sets)
There are Services nodes connected to the L2 switches.
All the L2 are connected to a couple of L3.
The problem we have seen when routing on this topology is that most of
the routes from CN to SN (service nodes) go through the L3 switches.
With the current algorithm, the less loaded link is choosed to go down
by going up. Therefore, the primary path goes through a L2, then a L3
from where it covers all the network.
Not sure I understand why.
There are two stages of routing:
1. Route the downward paths, which is done by climbing up the tree.
2. Route the upward paths, which is done by descending.
Before each climbing, the routing first configures all the upward
paths: the first acll in the beginning of the
__osm_ftree_fabric_route_downgoing_by_going_up() function is
__osm_ftree_fabric_route_upgoing_by_going_down().
OK
So in your example, if some Service Node (SN) is connected to L2-i
switch, the routing should first configure all the i'th switches
in the L1 sets, and then all the L0 switches in all the sets because
L1 is fully mixed to L0 in each set.
Where am I wrong?
You're right. Routes towards the service nodes are OK.
Problem is the routes from the service node.
Let's take a CN (called C0) attached to a L0.
Algorithm go one step up (downward route) to L1-1 (could be any L1 in its set),
covers all the upward routes so we have all the routes CN->C0 where the CN
belong to the same set.
Next step: Algorithm doesn't have a choice here, it goes to L2-1, covers the upward
routes so we have all the CN->C0 routes and S1->C0
However we don't have any routes from S[234]->C0 yet.
Next and final step, algorithm go to either one of the L3, covers upwards routes
and create routes from S[234]->C0. These routes have more hops than the minimum
possible.
OK, I get it now.
So there's is a general issue with the existing algorithm:
in topologies with redundant switch level(s), such as L3 in
your example, fat-tree might create switch-to-CN routes (or
in your example, SN-to-CN when SN is on a higher tree level
than CN) that are longer than min-hop.
Currently, when the algorithm goes up, it checks whether or
not the upper switch already has this LID in the LFT, and
if it has - the route is abandoned.
What you're suggesting is to check the path length, and if
the new path is shorter, then it should replace the old one.
Sounds like a great idea.
This wasn't acceptable for us as L3 switches would be overloaded when
there were less loaded/shorter paths to achieve the same HCA.
So what we have introduced here is a "balanced min_hop" within the
ftree algorithm.
Basically, instead of just leaving when we reach a LFT which has
already been configured for the target lid, we check the hops count of
the switch toward this lid, and the hop count on the path we came
through. If we have found a shorter path, we update the LFT and minhop
tables to use this new path.
And what do you do after that? Do you continue climbing
the tree and checking the route length? I guess you need
to continue climbing till all the upper switches will
have LFT entries with route length equal to min-hop.
I do have some concerns about the CN-to-CN routes in this
case. Any chance these routes might be affected?
My intuition says that they won't, because they are required
to reside at the same tree level, and since there is full
connectivity between them, they will be all configured with
the shortest path. But I'm not done convincing myself...
-- Yevgeny
I don't think that fat-tree routing was supposed to configure a route
with more hops that min-hop would do it. Either I don't understand the
example, or there is a bug in the routing.
If it isn't suppose to, then it's a bug :)
However this bug only happens with nodes not at the same level as compute nodes.
-- Yevgeny
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general