Ketan, Many thanks for your decisive leadership in this very important problem. (Thanks too to the folks that brought this to the IETF.) The first thing I’d like to say is that, it is high time the IETF tackles the broader issues of building efficient, resilient, responsive, and performant HPC Clusters, and in particular, AI (ML) Clusters. These are multi-hop networks that operate at Layer 3, but many current attempts to solve this approach it at Layer 2, as an Ethernet problem. Some of the problems may indeed be fixable at the Ethernet Layer, but many quite crucial problems require Layer 3 solutions.
I would like to see the IETF look at the problem more holistically, from an end-to-end point of view (workflow-wise as well as traffic-wise). I have a contribution to the RTGWG to get a higher level conversation going. I understand this problem will be tackled in RTGWG until we have a better sense of the charter, and the IETF may then propose a WG-forming BoF if appropriate. My contribution suggests that in parallel to tackling the problem of congestion detection and notification, the IETF also focus on reducing the probability that congestion occurs in the first place; similarly for network failures. The approach proposes scheduling network resources in conjunction with compute resources; the network resources hopefully will reduce chances of congestion, but can also put in place protection paths should a node or link fail. Pavan Beeram will be presenting the elements of this idea in RTGWG. The hope is that the ensuing discussion will help the group come up with a holistic solution to optimizing network resources (in backend networks, in DCI networks and in the WAN) for ML workloads. Cheers, Kireeti. _______________________________________________ rtgwg mailing list -- [email protected] To unsubscribe send an email to [email protected]
