The whole point of torus-2QoS is to provide deadlock-free routing
for a torus while enabling two quality of service levels.  The
ability to route around a failed switch provides a window to
repair the fabric with minimal impact to running applications.

So if the possibility of mesage deadlock is detected due to
the topology of missing switches, torus-2QoS should fail
to route.

Users of torus-2QoS can either configure multiple routing
algorithms, so another algorithm with different properties can
attempt to route the fabric, or configure no fallback algorithm
so that the last good torus-2QoS tables are left in the switches.

None of the alternatives are great:
- Having torus-2QoS route the fabric even though the missing
    switch topology allows message deadlock means applications
    may encounter poor performance due to message deadlock.
- Having another engine route the fabric means that any
    application that doesn't repath may trigger message deadlock
    due to inconsistencies between path SL values in use and
    path SL values required by the new engine for deadlock-free
    routing.
- Leaving the last good torus-2QoS tables in the switches means
    that traffic through the newly failed switch cannot be
    delivered.

It isn't clear which of these options has the least impact on
running applications, but the operational imperative is clear:
failures in a torus fabric routed with torus-2QoS need to be
repaired ASAP.

Signed-off-by: Jim Schutt <jasc...@sandia.gov>
---
 opensm/opensm/osm_ucast_torus.c |    8 +++++---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/opensm/opensm/osm_ucast_torus.c b/opensm/opensm/osm_ucast_torus.c
index 7108394..bc87757 100644
--- a/opensm/opensm/osm_ucast_torus.c
+++ b/opensm/opensm/osm_ucast_torus.c
@@ -7659,10 +7659,12 @@ bool routable_torus(struct torus *t, struct fabric *f)
                        }
                }
 
-       if (t->flags & MSG_DEADLOCK)
+       if (t->flags & MSG_DEADLOCK) {
                OSM_LOG(&t->osm->log, OSM_LOG_ERROR,
-                       "Warning: missing switch topology "
-                       "==> message deadlock possible!\n");
+                       "Error: missing switch topology "
+                       "==> message deadlock!\n");
+               success = false;
+       }
        return success;
 }
 
-- 
1.5.6.GIT


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to