Yevgeny Kliteynik wrote: > Nicolas Morey Chaisemartin wrote: >> Hi everyone, >> >> We have been working quite a lot at Bull lately on the Ftree algorithm >> and we have made some upgrades. >> However, as they modify the behavior of the ftree algorithm, we >> haven't pushed them until now. >> I'm just going to detail which upgrades we have done and let you >> decide if you are interested, if and how they should be pushed >> upstream (new routing algorithm, option in the ftree, etc.) >> >> Here is a simplify model of the topology we have been working on >> >> L3 L3 >> ___________________|__|____________________ >> / / \ \ <= >> All the L2 are connected on 2 L3 switches >> L2-1 L2-2 L2-1 L2-2 <= >> There are service nodes connected directly on L2 switches >> / \S1 / \S2 S3/ \ S4/ \ <== >> The Nth L1 of a set leads only to the Nth L2 (L2-N). With some pruning. >> L1 L1 L1 L1 >> /|\ /|\ /|\ /|\ >> ==Fully mixed to L1== ==Fully mixed to L1== <=== We have >> multiple set. In each set, all L0 lead to all L1 of their set. >> >> L0 L0 L0 L0 >> / \ / \ / \ / \ >> CN CN .. CN CN .... CN CN .. CN CN >> >> To detail: >> We have a bunch of sets. Each set contains compute node, L0 and L1 >> switches. >> Plus a common top of L2 and L3 switches. >> >> In each set, there are groups of compute nodes. Each group is >> connected to a single L0 switch. >> In a given set, all L0 are connected to all L1. >> >> The Nth L1 of a set is connected to the Nth L2 and only to this one. >> (so through a L2, the Nth L1 can only see the Nth L1 of the other sets) >> There are Services nodes connected to the L2 switches. >> All the L2 are connected to a couple of L3. >> >> The problem we have seen when routing on this topology is that most of >> the routes from CN to SN (service nodes) go through the L3 switches. >> With the current algorithm, the less loaded link is choosed to go down >> by going up. Therefore, the primary path goes through a L2, then a L3 >> from where it covers all the network. > > Not sure I understand why. > There are two stages of routing: > 1. Route the downward paths, which is done by climbing up the tree. > 2. Route the upward paths, which is done by descending. > Before each climbing, the routing first configures all the upward > paths: the first acll in the beginning of the > __osm_ftree_fabric_route_downgoing_by_going_up() function is > __osm_ftree_fabric_route_upgoing_by_going_down().
OK > > So in your example, if some Service Node (SN) is connected to L2-i > switch, the routing should first configure all the i'th switches > in the L1 sets, and then all the L0 switches in all the sets because > L1 is fully mixed to L0 in each set. > > Where am I wrong? You're right. Routes towards the service nodes are OK. Problem is the routes from the service node. Let's take a CN (called C0) attached to a L0. Algorithm go one step up (downward route) to L1-1 (could be any L1 in its set), covers all the upward routes so we have all the routes CN->C0 where the CN belong to the same set. Next step: Algorithm doesn't have a choice here, it goes to L2-1, covers the upward routes so we have all the CN->C0 routes and S1->C0 However we don't have any routes from S[234]->C0 yet. Next and final step, algorithm go to either one of the L3, covers upwards routes and create routes from S[234]->C0. These routes have more hops than the minimum possible. > >> This wasn't acceptable for us as L3 switches would be overloaded when >> there were less loaded/shorter paths to achieve the same HCA. >> >> So what we have introduced here is a "balanced min_hop" within the >> ftree algorithm. >> Basically, instead of just leaving when we reach a LFT which has >> already been configured for the target lid, we check the hops count of >> the switch toward this lid, and the hop count on the path we came >> through. If we have found a shorter path, we update the LFT and minhop >> tables to use this new path. > > I don't think that fat-tree routing was supposed to configure a route > with more hops that min-hop would do it. Either I don't understand the > example, or there is a bug in the routing. If it isn't suppose to, then it's a bug :) However this bug only happens with nodes not at the same level as compute nodes. > > -- Yevgeny > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
