This patch series adds a new routing engine designed to handle large fabrics connected with a 2D/3D torus topology.
Patches 1-4 do some preparation to handle new SL-related features of the routing engine, patches 5/6 add and enable the engine, and patches 7-11 have some fixups that only make sense in the presence of the new engine. So why a new torus routing engine? Because I believe none of the existing routing engines can provide a satisfactory operational experience on a large-scale torus, i.e. one with hundreds of switches. Generating routes for a torus that are free of credit loops requires the use of multiple virtual lanes, and thus SLs on IB. For IB fabrics it also requires that _every_ application use path record queries - any application that uses an SL that was not obtained via a path record query may cause credit loops. In addition, if a fabric topology change (e.g. failed switch/link) causes a change in the path SL values needed to prevent credit loops, then _every_ application needs to repath for every path whose SL has changed. AFAIK there is no good way to do this as yet in general. Also, the requirement for path SL queries on every connection places a heavy load on subnet administration, and the possibility that path SL values can change makes caching as a performance enhancement more difficult. Since multiple VL/SL values are required to prevent credit loops on a torus, supporting QoS means that QoS and routing need to share the small pool of available SL values, and the even smaller pool of available VL values. This patch series, and the routing engine it introduces, addresses these issues for a 2D/3D torus fabric. The torus-2QoS engine can provide the following functionality on a 2D/3D torus: - routing that is free of credit loops - two levels of QoS, assuming switches support 8 data VLs - ability to route around a single failed switch, and/or multiple failed links, without - introducing credit loops - changing path SL values - very short run times, with good scaling properties as fabric size increases The routing engine currently in opensm that is most functional for a torus-connected fabric is LASH. In comparison with torus-2QoS, LASH has the following issues: - LASH does not support QoS. - changing inter-switch topology (add/remove a switch, or removing all the links between a switch) can change many path SL values, potentially leading to credit loops if running applications do not repath. - running time to calculate routes scales poorly with increasing fabric size. The basic algorithm used by torus-2QoS is DOR. It also uses SL bits 0-2, one SL bit per torus dimension, to encode whether a path crosses a dateline (where the coordinate value wraps to zero) for each of the three dimensions, in order to avoid the credit loops that otherwise result on a torus. It uses SL bit 3 to distinguish between two QoS levels. It uses the SL2VL tables to map those eight SL values per QoS level into two VL values per QoS level, based on which coordinate direction a link points. For two QoS levels, this consumes four data VLs, where VL bit 0 encodes whether the path crosses the dateline for the coordinate direction in which the link points, and VL bit 2 encodes QoS level. In the event of link failure, it routes the long way around the 1-D ring containing the failed link. I.e. no turns are introduced into a path in order to route around a failed link. Note that due to this implementation, torus-2QoS cannot route a torus with link failures that break a 1-D ring into two disjoint segments. Under DOR routing in a torus with a failed switch, paths that would otherwise turn at the failed switch cannot be routed without introducing an "illegal" turn into the path. Such turns are "illegal" in the sense that allowing them will allow credit loops, unless something can be done. The routes produced by torus-2QoS will introduce such "illegal" turns when a switch fails. It makes use of the input/output port dependence in the SL2VL maps to set the otherwise unused VL bit 1 for the path hop following such an illegal turn. This is enough to avoid credit loops in the presence of a single failed switch. As an example, consider the following 2D torus, and consider routes from S to D, both when the switch at F is operational, and when it has failed. torus-2QoS will generate routes such that the path S-F-D is followed if F is operational, and the path S-E-I-L-D if F has failed: | | | | | | | --+----+----+----+----+----+----+-- | | | | | | | --+----+----+----+----+----D----+-- | | | | | | | --+----+----+----+----I----L----+-- | | | | | | | --+----+----S----+----E----F----+-- | | | | | | | --+----+----+----+----+----+----+-- The turn in S-E-I-L-D at switch I is the illegal turn introduced into the path. The turns at E and L are extra turns introduced into the path that are legal in the sense that no credit loops can be constructed using them. The path hop after the turn at switch I has VL bit 1 set, which marks it as a hop after an illegal turn. I've used the latest development version of ibdmchk, because it can use path SL values and SL2VL tables, to check for credit loops in cases like the above routed with torus-2QoS, and it finds none. I've also looked for credit loops in a torus with multiple failed switches routed with torus-2QoS, and learned that if and only if the failed switches are adjacent in the last DOR dimension, there will be no credit loops. Since torus-2QoS makes use of all available SL values when supporting 2 QoS levels, there are none left over on which to confine multicast. It turns out there is a way to construct a spanning tree which can overlay a DOR-routed mesh, so that multicast and unicast can coexist on the same SL/VL without causing credit loops. I'm working on that but don't have it implemented yet. In the meantime, if you do not request QoS using opensm -Q, then torus-2QoS will only use SLs 8-15, and thus VLs 4-7, leaving SL0/VL0 free for multicast. Jim Schutt (11): opensm: Prepare for routing engine input to path record SL lookup and SL2VL map setup. opensm: Allow the routing engine to influence SL2VL calculations. opensm: Allow the routing engine to participate in path SL calculations. opensm: Track the minimum value in the fabric of data VLs supported. opensm: Add torus-2QoS routing engine. opensm: Enable torus-2QoS routing engine. opensm: Add opensm option to specify file name for extra torus-2QoS configuration information. opensm: Do not require -Q option for torus-2QoS routing engine. opensm: Make it possible to configure no fallback routing engine. opensm: Avoid havoc in minhop caused by torus-2QoS persistent use of osm_port_t:priv. opensm: Update documentation to describe torus-2QoS. opensm/doc/current-routing.txt | 154 +- opensm/include/opensm/osm_base.h | 18 + opensm/include/opensm/osm_opensm.h | 24 +- opensm/include/opensm/osm_subnet.h | 7 + opensm/include/opensm/osm_ucast_lash.h | 3 - opensm/man/opensm.8.in | 9 +- opensm/opensm/Makefile.am | 2 +- opensm/opensm/main.c | 8 + opensm/opensm/osm_console.c | 10 +- opensm/opensm/osm_dump.c | 3 +- opensm/opensm/osm_link_mgr.c | 16 +- opensm/opensm/osm_opensm.c | 54 +- opensm/opensm/osm_port_info_rcv.c | 13 +- opensm/opensm/osm_qos.c | 26 +- opensm/opensm/osm_sa_path_record.c | 33 +- opensm/opensm/osm_state_mgr.c | 10 +- opensm/opensm/osm_subnet.c | 20 +- opensm/opensm/osm_ucast_lash.c | 11 +- opensm/opensm/osm_ucast_mgr.c | 44 +- opensm/opensm/osm_ucast_torus.c | 8665 ++++++++++++++++++++++++++++++++ 20 files changed, 9038 insertions(+), 92 deletions(-) create mode 100644 opensm/opensm/osm_ucast_torus.c -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html