The whole point of torus-2QoS is to provide deadlock-free routing for a torus while enabling two quality of service levels. The ability to route around a failed switch provides a window to repair the fabric with minimal impact to running applications.
So if the possibility of mesage deadlock is detected due to the topology of missing switches, torus-2QoS should fail to route. Users of torus-2QoS can either configure multiple routing algorithms, so another algorithm with different properties can attempt to route the fabric, or configure no fallback algorithm so that the last good torus-2QoS tables are left in the switches. None of the alternatives are great: - Having torus-2QoS route the fabric even though the missing switch topology allows message deadlock means applications may encounter poor performance due to message deadlock. - Having another engine route the fabric means that any application that doesn't repath may trigger message deadlock due to inconsistencies between path SL values in use and path SL values required by the new engine for deadlock-free routing. - Leaving the last good torus-2QoS tables in the switches means that traffic through the newly failed switch cannot be delivered. It isn't clear which of these options has the least impact on running applications, but the operational imperative is clear: failures in a torus fabric routed with torus-2QoS need to be repaired ASAP. Signed-off-by: Jim Schutt <jasc...@sandia.gov> --- opensm/opensm/osm_ucast_torus.c | 8 +++++--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/opensm/opensm/osm_ucast_torus.c b/opensm/opensm/osm_ucast_torus.c index 7108394..bc87757 100644 --- a/opensm/opensm/osm_ucast_torus.c +++ b/opensm/opensm/osm_ucast_torus.c @@ -7659,10 +7659,12 @@ bool routable_torus(struct torus *t, struct fabric *f) } } - if (t->flags & MSG_DEADLOCK) + if (t->flags & MSG_DEADLOCK) { OSM_LOG(&t->osm->log, OSM_LOG_ERROR, - "Warning: missing switch topology " - "==> message deadlock possible!\n"); + "Error: missing switch topology " + "==> message deadlock!\n"); + success = false; + } return success; } -- 1.5.6.GIT -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html