I have a few questions and obervations after attending multiple AIDC side 
meetings over the last year or so. I am ordering the questions from the 
scale-out perspective and then zoom into each point down the supply-chain. My 
comments relate to interoperability.

I think Petr may be able to give some perspective on this since he worked at a 
hyperscaler and now a vendor that supports multiple hyperscalers. But deployers 
of models, please chime in. Thanks.

(1) Across clusters: How many deployments use a mixture of vendor switches and 
routers. Maybe the core network, if clusters are connected via layer-3, 
multiple router vendors are likely and wouldn't surprise me.  But if the 
clusters are layer-2 do you see cases where multiple switch vendors would be 
deployed. And in either case, is there anything special or new these devices 
need to other than move IP bits? I see LLMs or any AI models that need to be 
distributed as just another IP application. I have stated the same with 
blockchainn applications. The only feature I see that could be supported by 
existing/new protocosl in the network, is if the neural-net compute needs to be 
mobile. I can see this happening with general app-level agents but in this 
data-center deployment (for training and interface scale), I don't see the 
requirement.

(2) Intra cluster: Will top-of-rack switches in different racks use differnt 
vendor switches?  I would define an NVDA cluster with NVlink-72 a piece of an 
intra-cluster and not a whole intra-cluster itself.

(3) Inter server within a cluster: So now you have a NVlink-72 cluster and you 
want it to talk to another one in either intra-cluster or inter-cluster, will 
those switches and/or routers be theh same vendor. Would anyone deploy a model 
where an NVDA cluster and an AMD cluster operate on the same neural-net? And if 
it could, maybe it could be split up between MoEs where NVDA does half of MoEs 
and AMD the other half? I know this gets way complex when you look at all the 
versions of paralellism possible. But lets leave that out for now. Or because 
it will be really hard, wanting the parallelism, it drives deployers to be 
single vendor.

(4) NICs across servers within a cluster: Is it even possible to have two NVDA 
GPU-based servers where one runs an NVDA DPU and the other an AMD NIC? 

(5) One neural-net supported multiple GPU vendors: Can a Dell rack with NVDA 
GPUs interoperate with a SMCI rack with AMD GPUs? Is this desirable or is the 
specs and timing so differnt you could never get this to work.

Having said this, I make the the observation if any of the questions above is 
yes (multiple vendors), than the IETF has a role in making interoperability 
work (traditional IETF high-level charter). And is IETF too late and do the 
vendors actually want to interoperate. As always interoperability may be good 
for the customer, but give less control for the vendor and more complex support 
issues for both the vendor and the customer. If the interopoerability is not 
desirable from the vendors, then what role does IETF have.

I could be way off here on the goal of the side meeting. If its to advance work 
in the IETF, than my questions and observations stand. If the goal of the side 
meeting is to present and learn about what's deploy, my vote says Yingzhen and 
Jeff should keep these meetings going.

Cheers,
Dino

> On Jul 23, 2025, at 11:40 AM, Yingzhen Qu <[email protected]> wrote:
> 
> Hi all,
> 
> Thanks to everyone who participated in the side meeting.
> 
> We now have all the meeting materials including recording at the GitHub 
> repository: https://github.com/Yingzhen-ietf/AIDC-IETF123
> 
> Please let us know if you have any questions.
> 
> Thanks,
> Jeff and Yingzhen
> 
> On Mon, Jul 21, 2025 at 12:05 AM Yingzhen Qu <[email protected]> wrote:
> Hi all,
> 
> We will continue the AIDC side meeting at IETF 123. The meeting is scheduled 
> on Monday, July 21, 17:00 - 19:00 Madrid time.
> 
> We will have two in-depth technical talks from Nvidia and Broadcom about 
> networking for AI Data Centers.
> 
> Petr Lapukhov (Nvidia): LLM Inference and Networking: Scale-up and Scale-out
> Costin Raiciu (Broadcom): Load Balancing Approaches in AI/ML Networks
> 
> Room: El Escorial
> IETF Webex: https://ietf.webex.com/meet/ietfsidemeeting2
> 
> Looking forward to seeing you there!
> 
> Thanks,
> Jeff and Yingzhen
> 
> -- 
> aidc mailing list -- [email protected]
> To unsubscribe send an email to [email protected]

_______________________________________________
rtgwg mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to