Generating routes for a torus that are free of credit loops requires
the use of multiple virtual lanes, and thus SLs on IB.  For IB fabrics
it also requires that _every_ application use path record queries -
any application that uses an SL that was not obtained via a path record
query may cause credit loops.

In addition, if a fabric topology change (e.g. failed switch/link)
causes a change in the path SL values needed to prevent credit loops,
then _every_ application needs to repath for every path whose SL has
changed.  AFAIK there is no good way to do this as yet in general.

Also, the requirement for path SL queries on every connection places a
heavy load on subnet administration, and the possibility that path SL
values can change makes caching as a performance enhancement more
difficult.

Since multiple VL/SL values are required to prevent credit loops on a
torus,  supporting QoS means that QoS and routing need to share the small
pool of available SL values, and the even smaller pool of available VL
values.

The torus-2QoS engine addresses these issues for a 2D/3D torus fabric
by providing the following functionality:
- routing that is free of credit loops
- two levels of QoS, assuming switches support 8 data VLs
- ability to route around a single failed switch, and/or multiple failed
    links, without
    - introducing credit loops
    - changing path SL values
- very short run times, with good scaling properties as fabric size
    increases

The routing engine currently in opensm that is most functional for a
torus-connected fabric is LASH.  In comparison with torus-2QoS, LASH
has the following issues:
- LASH does not support QoS.
- changing inter-switch topology (add/remove a switch, or
    removing all the links between a switch) can change many
    path SL values, potentially leading to credit loops if
    running applications do not repath.
- running time to calculate routes scales poorly with increasing
    fabric size.

The basic algorithm used by torus-2QoS is DOR.  It also uses SL bits 0-2,
one SL bit per torus dimension, to encode whether a path crosses a dateline
(where the coordinate value wraps to zero) for each of the three dimensions,
in order to avoid the credit loops that otherwise result on a torus.  It
uses SL bit 3 to distinguish between two QoS levels.

It uses the SL2VL tables to map those eight SL values per QoS level into
two VL values per QoS level, based on which coordinate direction a link
points.  For two QoS levels, this consumes four data VLs, where VL bit
0 encodes whether the path crosses the dateline for the coordinate
direction in which the link points, and VL bit 2 encodes QoS level.

In the event of link failure, it routes the long way around the 1-D ring
containing the failed link.  I.e. no turns are introduced into a path in
order to route around a failed link.  Note that due to this implementation,
torus-2QoS cannot route a torus with link failures that break a 1-D ring
into two disjoint segments.

Under DOR routing in a torus with a failed switch, paths that would
otherwise turn at the failed switch cannot be routed without introducing
an "illegal" turn into the path.  Such turns are "illegal" in the
sense that allowing them will allow credit loops, unless something can
be done.

The routes produced by torus-2QoS will introduce such "illegal" turns when
a switch fails.  It makes use of the input/output port dependence in the
SL2VL maps to set the otherwise unused VL bit 1 for the path hop following
such an illegal turn.  This is enough to avoid credit loops in the
presence of a single failed switch.

As an example, consider the following 2D torus, and consider routes
from S to D, both when the switch at F is operational, and when it
has failed.  torus-2QoS will generate routes such that the path
S-F-D is followed if F is operational, and the path S-E-I-L-D
if F has failed:

    |    |    |    |    |    |    |
  --+----+----+----+----+----+----+--
    |    |    |    |    |    |    |
  --+----+----+----+----+----D----+--
    |    |    |    |    |    |    |
  --+----+----+----+----I----L----+--
    |    |    |    |    |    |    |
  --+----+----S----+----E----F----+--
    |    |    |    |    |    |    |
  --+----+----+----+----+----+----+--

The turn in S-E-I-L-D at switch I is the illegal turn introduced
into the path.  The turns at E and L are extra turns introduced
into the path that are legal in the sense that no credit loops
can be constructed using them.

The path hop after the turn at switch I has VL bit 1 set, which marks
it as a hop after an illegal turn.

I've used the latest development version of ibdmchk, because it can
use path SL values and SL2VL tables, to check for credit loops in
cases like the above routed with torus-2QoS, and it finds none.

I've also looked for credit loops in a torus with multiple failed
switches routed with torus-2QoS, and learned that if and only if
the failed switches are adjacent in the last DOR dimension, there
will be no credit loops.

Since trous-2QoS uses all available SL values for unicast traffic,
multicast traffic must share SL values with unicast traffic.  This
in turn means that multicast routing must be compatible with unicast
routing to prevent credit loops.

Since torus-2QoS unicast routing is based on DOR, it turns out to
be possible to construct spanning trees so that when multicast
and unicast traffic are overlaid, credit loops are not possible.

Here is a 2D example of such a spanning tree, where "x" is the
root switch, and each "+" is a non-root switch:

   +  +  +  +  +
   |  |  |  |  |
   +  +  +  +  +
   |  |  |  |  |
   +--+--x--+--+
   |  |  |  |  |
   +  +  +  +  +

For multicast traffic routed from root to tip, every turn in the
above spanning tree is a legal DOR turn.

For traffic routed from tip to root, and traffic routed through
the root, turns are not legal DOR turns.  However, to construct
a credit loop, the union of multicast routing on this spanning
tree with DOR unicast routing can only provide 3 of the 4 turns
needed for the loop.

In addition, if none of the above spanning tree branches crosses
a dateline used for unicast credit loop avoidance on a torus,
and multicast traffic is confined to SL 0 or SL 8 (recall that
torus-2QoS uses SL bit 3 to differentiate QoS level), then
multicast traffic also cannot contribute to the "ring" credit
loops that are otherwise possible in a torus.

Torus-2QoS uses these ideas to create a master spanning tree.
Every multicast group spanning tree will be constructed as a
subset of the master tree, with the same root as the master
tree.

Such multicast group spanning trees will in general not be
optimal for groups which are a subset of the full fabric.
However, this compromise must be made to enable support for
two QoS levels on a torus while preventing credit loops.

To build a spanning tree for a particular MLID, torus-2QoS just
needs to mark all the ports that participate in that multicast
group, then walk the master spanning tree and add switches
hosting the marked ports to the multicast group spanning tree.
A depth-first search of the master spanning tree is used for this.

Signed-off-by: Jim Schutt <jasc...@sandia.gov>
---

I've attached the patch as a compressed file, as otherwise
it is too large to make it through the list.

-- Jim

 opensm/opensm/Makefile.am |    2 +-
 opensm/opensm/osm_torus.c | 9114 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 9115 insertions(+), 1 deletions(-)
 create mode 100644 opensm/opensm/osm_torus.c


Attachment: 0007-opensm-Add-torus-2QoS-routing-engine.patch.bz2
Description: application/bzip

Reply via email to