Hi all,

Packet-granular adaptive routing, also known as congestion-aware packet spray, 
has been widely recognized as an ideal load-balancing mechanism for AI Ethernet 
networks. Some cloud providers have implemented their in-house packet spray 
approaches, which are mainly built on proactive and real-time congestion 
detection along all possible ECMP paths.

In order to achieve a non-blocking network fabric, it seems more suitable for 
network switches for perform packet spray since they could obtain the 
information about network congestion between switches more quickly and easily. 
Some major network chip vendors are developed their proprietary congestion 
notification mechanisms built on their proprietary data-plane signaling. 
However, to meet the aim of the UEC to deliver an Ethernet-based open, 
interoperable, high-performance full-communications stack for the growing 
network demands of AI and HPC at scale, it is meaningful for us to pursue an 
open standard-based approach for packet spray. This draft is a step towards 
that goal indeed, any comments and suggestions are welcome.

Best regards,rt
Xiaohu

发件人: [email protected] <[email protected]>
日期: 星期一, 2024年1月29日 16:36
收件人: Hang Wu <[email protected]>, Hongyi Huang <[email protected]>, 
Junjie Wang <[email protected]>, Qingliang Zhang <[email protected]>, 
Xiaohu Xu <[email protected]>, Yadong Liu <[email protected]>, Yinben 
Xia <[email protected]>, Zongying He <[email protected]>
主题: New Version Notification for draft-xu-lsr-fare-01.txt
A new version of Internet-Draft draft-xu-lsr-fare-01.txt has been successfully
submitted by Xiaohu Xu and posted to the
IETF repository.

Name:     draft-xu-lsr-fare
Revision: 01
Title:    Fully Adaptive Routing Ethernet
Date:     2024-01-29
Group:    Individual Submission
Pages:    9
URL:      https://www.ietf.org/archive/id/draft-xu-lsr-fare-01.txt
Status:   https://datatracker.ietf.org/doc/draft-xu-lsr-fare/
HTMLized: https://datatracker.ietf.org/doc/html/draft-xu-lsr-fare
Diff:     https://author-tools.ietf.org/iddiff?url2=draft-xu-lsr-fare-01

Abstract:

   Large language models (LLMs) like ChatGPT have become increasingly
   popular in recent years due to their impressive performance in
   various natural language processing tasks.  These models are built by
   training deep neural networks on massive amounts of text data, often
   consisting of billions or even trillions of parameters.  However, the
   training process for these models can be extremely resource-
   intensive, requiring the deployment of thousands or even tens of
   thousands of GPUs in a single AI training cluster.  Therefore, three-
   stage or even five-stage CLOS networks are commonly adopted for AI
   networks.  The non-blocking nature of the network become increasingly
   critical for large-scale AI models.  Therefore, adaptive routing is
   necessary to dynamically load balance traffic to the same destination
   over multiple ECMP paths, based on network capacity and even
   congestion information along those paths.



The IETF Secretariat

_______________________________________________
rtgwg mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/rtgwg

Reply via email to