I have a few questions and obervations after attending multiple AIDC side meetings over the last year or so. I am ordering the questions from the scale-out perspective and then zoom into each point down the supply-chain. My comments relate to interoperability.
I think Petr may be able to give some perspective on this since he worked at a hyperscaler and now a vendor that supports multiple hyperscalers. But deployers of models, please chime in. Thanks. (1) Across clusters: How many deployments use a mixture of vendor switches and routers. Maybe the core network, if clusters are connected via layer-3, multiple router vendors are likely and wouldn't surprise me. But if the clusters are layer-2 do you see cases where multiple switch vendors would be deployed. And in either case, is there anything special or new these devices need to other than move IP bits? I see LLMs or any AI models that need to be distributed as just another IP application. I have stated the same with blockchainn applications. The only feature I see that could be supported by existing/new protocosl in the network, is if the neural-net compute needs to be mobile. I can see this happening with general app-level agents but in this data-center deployment (for training and interface scale), I don't see the requirement. (2) Intra cluster: Will top-of-rack switches in different racks use differnt vendor switches? I would define an NVDA cluster with NVlink-72 a piece of an intra-cluster and not a whole intra-cluster itself. (3) Inter server within a cluster: So now you have a NVlink-72 cluster and you want it to talk to another one in either intra-cluster or inter-cluster, will those switches and/or routers be theh same vendor. Would anyone deploy a model where an NVDA cluster and an AMD cluster operate on the same neural-net? And if it could, maybe it could be split up between MoEs where NVDA does half of MoEs and AMD the other half? I know this gets way complex when you look at all the versions of paralellism possible. But lets leave that out for now. Or because it will be really hard, wanting the parallelism, it drives deployers to be single vendor. (4) NICs across servers within a cluster: Is it even possible to have two NVDA GPU-based servers where one runs an NVDA DPU and the other an AMD NIC? (5) One neural-net supported multiple GPU vendors: Can a Dell rack with NVDA GPUs interoperate with a SMCI rack with AMD GPUs? Is this desirable or is the specs and timing so differnt you could never get this to work. Having said this, I make the the observation if any of the questions above is yes (multiple vendors), than the IETF has a role in making interoperability work (traditional IETF high-level charter). And is IETF too late and do the vendors actually want to interoperate. As always interoperability may be good for the customer, but give less control for the vendor and more complex support issues for both the vendor and the customer. If the interopoerability is not desirable from the vendors, then what role does IETF have. I could be way off here on the goal of the side meeting. If its to advance work in the IETF, than my questions and observations stand. If the goal of the side meeting is to present and learn about what's deploy, my vote says Yingzhen and Jeff should keep these meetings going. Cheers, Dino > On Jul 23, 2025, at 11:40 AM, Yingzhen Qu <[email protected]> wrote: > > Hi all, > > Thanks to everyone who participated in the side meeting. > > We now have all the meeting materials including recording at the GitHub > repository: https://github.com/Yingzhen-ietf/AIDC-IETF123 > > Please let us know if you have any questions. > > Thanks, > Jeff and Yingzhen > > On Mon, Jul 21, 2025 at 12:05 AM Yingzhen Qu <[email protected]> wrote: > Hi all, > > We will continue the AIDC side meeting at IETF 123. The meeting is scheduled > on Monday, July 21, 17:00 - 19:00 Madrid time. > > We will have two in-depth technical talks from Nvidia and Broadcom about > networking for AI Data Centers. > > Petr Lapukhov (Nvidia): LLM Inference and Networking: Scale-up and Scale-out > Costin Raiciu (Broadcom): Load Balancing Approaches in AI/ML Networks > > Room: El Escorial > IETF Webex: https://ietf.webex.com/meet/ietfsidemeeting2 > > Looking forward to seeing you there! > > Thanks, > Jeff and Yingzhen > > -- > aidc mailing list -- [email protected] > To unsubscribe send an email to [email protected] _______________________________________________ rtgwg mailing list -- [email protected] To unsubscribe send an email to [email protected]
