Generating routes for a torus that are free of credit loops requires the use of multiple virtual lanes, and thus SLs on IB. For IB fabrics it also requires that _every_ application use path record queries - any application that uses an SL that was not obtained via a path record query may cause credit loops.
In addition, if a fabric topology change (e.g. failed switch/link) causes a change in the path SL values needed to prevent credit loops, then _every_ application needs to repath for every path whose SL has changed. AFAIK there is no good way to do this as yet in general. Also, the requirement for path SL queries on every connection places a heavy load on subnet administration, and the possibility that path SL values can change makes caching as a performance enhancement more difficult. Since multiple VL/SL values are required to prevent credit loops on a torus, supporting QoS means that QoS and routing need to share the small pool of available SL values, and the even smaller pool of available VL values. The torus-2QoS engine addresses these issues for a 2D/3D torus fabric by providing the following functionality: - routing that is free of credit loops - two levels of QoS, assuming switches support 8 data VLs - ability to route around a single failed switch, and/or multiple failed links, without - introducing credit loops - changing path SL values - very short run times, with good scaling properties as fabric size increases The routing engine currently in opensm that is most functional for a torus-connected fabric is LASH. In comparison with torus-2QoS, LASH has the following issues: - LASH does not support QoS. - changing inter-switch topology (add/remove a switch, or removing all the links between a switch) can change many path SL values, potentially leading to credit loops if running applications do not repath. - running time to calculate routes scales poorly with increasing fabric size. The basic algorithm used by torus-2QoS is DOR. It also uses SL bits 0-2, one SL bit per torus dimension, to encode whether a path crosses a dateline (where the coordinate value wraps to zero) for each of the three dimensions, in order to avoid the credit loops that otherwise result on a torus. It uses SL bit 3 to distinguish between two QoS levels. It uses the SL2VL tables to map those eight SL values per QoS level into two VL values per QoS level, based on which coordinate direction a link points. For two QoS levels, this consumes four data VLs, where VL bit 0 encodes whether the path crosses the dateline for the coordinate direction in which the link points, and VL bit 2 encodes QoS level. In the event of link failure, it routes the long way around the 1-D ring containing the failed link. I.e. no turns are introduced into a path in order to route around a failed link. Note that due to this implementation, torus-2QoS cannot route a torus with link failures that break a 1-D ring into two disjoint segments. Under DOR routing in a torus with a failed switch, paths that would otherwise turn at the failed switch cannot be routed without introducing an "illegal" turn into the path. Such turns are "illegal" in the sense that allowing them will allow credit loops, unless something can be done. The routes produced by torus-2QoS will introduce such "illegal" turns when a switch fails. It makes use of the input/output port dependence in the SL2VL maps to set the otherwise unused VL bit 1 for the path hop following such an illegal turn. This is enough to avoid credit loops in the presence of a single failed switch. As an example, consider the following 2D torus, and consider routes from S to D, both when the switch at F is operational, and when it has failed. torus-2QoS will generate routes such that the path S-F-D is followed if F is operational, and the path S-E-I-L-D if F has failed: | | | | | | | --+----+----+----+----+----+----+-- | | | | | | | --+----+----+----+----+----D----+-- | | | | | | | --+----+----+----+----I----L----+-- | | | | | | | --+----+----S----+----E----F----+-- | | | | | | | --+----+----+----+----+----+----+-- The turn in S-E-I-L-D at switch I is the illegal turn introduced into the path. The turns at E and L are extra turns introduced into the path that are legal in the sense that no credit loops can be constructed using them. The path hop after the turn at switch I has VL bit 1 set, which marks it as a hop after an illegal turn. I've used the latest development version of ibdmchk, because it can use path SL values and SL2VL tables, to check for credit loops in cases like the above routed with torus-2QoS, and it finds none. I've also looked for credit loops in a torus with multiple failed switches routed with torus-2QoS, and learned that if and only if the failed switches are adjacent in the last DOR dimension, there will be no credit loops. Since trous-2QoS uses all available SL values for unicast traffic, multicast traffic must share SL values with unicast traffic. This in turn means that multicast routing must be compatible with unicast routing to prevent credit loops. Since torus-2QoS unicast routing is based on DOR, it turns out to be possible to construct spanning trees so that when multicast and unicast traffic are overlaid, credit loops are not possible. Here is a 2D example of such a spanning tree, where "x" is the root switch, and each "+" is a non-root switch: + + + + + | | | | | + + + + + | | | | | +--+--x--+--+ | | | | | + + + + + For multicast traffic routed from root to tip, every turn in the above spanning tree is a legal DOR turn. For traffic routed from tip to root, and traffic routed through the root, turns are not legal DOR turns. However, to construct a credit loop, the union of multicast routing on this spanning tree with DOR unicast routing can only provide 3 of the 4 turns needed for the loop. In addition, if none of the above spanning tree branches crosses a dateline used for unicast credit loop avoidance on a torus, and multicast traffic is confined to SL 0 or SL 8 (recall that torus-2QoS uses SL bit 3 to differentiate QoS level), then multicast traffic also cannot contribute to the "ring" credit loops that are otherwise possible in a torus. Torus-2QoS uses these ideas to create a master spanning tree. Every multicast group spanning tree will be constructed as a subset of the master tree, with the same root as the master tree. Such multicast group spanning trees will in general not be optimal for groups which are a subset of the full fabric. However, this compromise must be made to enable support for two QoS levels on a torus while preventing credit loops. To build a spanning tree for a particular MLID, torus-2QoS just needs to mark all the ports that participate in that multicast group, then walk the master spanning tree and add switches hosting the marked ports to the multicast group spanning tree. A depth-first search of the master spanning tree is used for this. Signed-off-by: Jim Schutt <jasc...@sandia.gov> --- I've attached the patch as a compressed file, as otherwise it is too large to make it through the list. -- Jim opensm/opensm/Makefile.am | 2 +- opensm/opensm/osm_torus.c | 9114 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 9115 insertions(+), 1 deletions(-) create mode 100644 opensm/opensm/osm_torus.c
0007-opensm-Add-torus-2QoS-routing-engine.patch.bz2
Description: application/bzip