Cool.  :)  https://en.wikipedia.org/wiki/Amdahl%27s_law suggests clusters
should not be too large, but from the sounds of it some of these training
cluster get pretty big.

All makes sense.  Incorporating these comments into the draft will
strengthen the document.

Multicast
a)
At AT&T/DirecTV, I worked on a massive multicast network.  There were
~3000-4000 multicast groups, with a mix of *,G and S,G.  They had ~300
places that picked up video, encoded it, converted it to multicast and then
delivered it across the WAN to multiple satellite uplink facilities.  Even
crazier, they had remote stat muxes in the uplink facilities that sent the
*,G to messages to the encoders to adjust the bandwidth. - The first time I
saw the multicast route table it blew my mind.
b)
Don't some machine learning scatter/gather paths need to communicate to
update the weights?  If we're ultra-focused on bandwidth, maybe this *is*
actually a case for multicast?

On Fri, Jan 9, 2026 at 2:53 PM Tom Herbert <[email protected]> wrote:

> On Fri, Jan 9, 2026 at 2:03 PM dave seddon <[email protected]>
> wrote:
> >
> > G'day Tom and Haoyu,
> >
> > I'm trying to join the discussion about "draft: Scale-Up Network
> > Header (SUNH)", but I just joined the mail list, so I don't know if
> > posting to the subject line will do it.  ( Apologies if this breaks
> > threading )
>
> Hi Dave,
>
> Thanks for the comments!
>
> >
> > Drafts:
> > https://datatracker.ietf.org/doc/draft-herbert-sunh/
> > https://datatracker.ietf.org/doc/html/draft-song-ship-edge-05
> >
> > It seems like the discussion centers on the address length.
> >
> > The SUNH "1.1.  Problem statement" is very clear
> > "
> > 8% overhead in a 256 byte packet, and the forty bytes of IPv6 header
> > would be about 16% overhead
> > "
> >
> > Absolutely minimizing overhead makes sense currently, but for how long
> > do we expect this to be true?  Tom, since you've been talking to
> > people who run the largest AI clusters in the world, you expect this
> > to hold true for the foreseeable future.
>
> It's actually working in that other direction. Header overhead in the
> data center is an emerging problem. For a long time, we didn't really
> care too much about header overhead in the datacenter basically
> because link utilization wasn't very high, and there were a lot of
> large packets to amortize the header overhead. In fact, many
> hyperscalers moved to IPv6 thereby doubling network header overhead
> without even thinking twice.
>
> The problem we have in AI is that people are trying to drive
> utilization to 100%, there's much less heterogeneity in workloads, and
> packet sizes can by well less than an MTU for some workloads. The 10%
> overhead that we didn't care about in the past is not popping up as a
> problem. (of course some of those that transitioned to IPv6 might be
> regretting that now ;-) )
>
> There's also secondary issues concerning the length of addresses. It's
> much more efficient to switch on 16-bit addresses as opposed to 32 or
> 128-bit addresses.
>
> >
> >
> > Tom - I wonder if draft-herbert-sunh would benefit from a small
> > summary, maybe with a table, that compares the proposed addressing to
> > other protocols that are common within data centers?
> >
> > For example, comparing protocols by their header, address lengths, and
> > "overhead"
> > - PCIe ( IEEE have paywalls, so it's hard to find a good source.
> > Maybe this:
> https://www.pearsonhighered.com/assets/samplechapter/0/3/2/1/0321156307.pdf
> > )
>
> Okay.
>
> > - Infiniband ( addressing scheme found here on page 625
> > https://hjemmesider.diku.dk/~vinter/CC/Infinibandchap42.pdf )
> > - Ethernet
> > - Ethernet with 802.1q ( and qnq )
> > - IPv4
> > - IPv6
> > - SUNH
> > ...
> >
> > Now that the context is established, explain why 16 bits were chosen
> > for the source/destination address.  I guess, but it's not in the
> > document; You were considering the number of hosts in the domain.
>
> The numbers being thrown around for scale-up networks seem to be a
> couple of thousand nodes at most. 16 bits nicely rounds to the power
> of two and allows plenty of space to scale to reasonably large GPU
> clusters. Also, for scale-up we anticipate pretty flat networks with
> may two or three hops at most (justifies smaller Hop Limits in the
> protocol).
>
> >
> > Nit pick (sorry). "care must be taken to ensure the minimum packet
> > size is maintained".  Might help to explain why.
>
> It's the minimum Ethernet packet size of 64 byte. Without a payload
> length field like IPv4 has, we need some way to be able to send
> packets less than 64 bytes of logical length in 64 bytes on-the-wire
> without ambiguity as to what the real size in. I can add some text.
>
> >
> > Re section "TCP and UDP in SUNH".  I remember recently Stuart from
> > Apple saying something pretty interesting about UDP: "If IP had port
> > numbers, you wouldn't really need a UDP header at all."
>
> Hee, I remember at my first IETF in the 90's Brian Carpenter was
> saying the same thing :-)
>
> >
> > Multicast?  It might be worth mentioning multicast and explaining why
> > it isn't discussed.  e.g. No requirement for this, or it might be
> > considered in the future if a need arises.
>
> Well, we haven't needed it for the past forty years so why start using
> it now :-) Seriously though, AI applications aren't typically using
> multicast. I suppose someone might envision using multicast for
> collective offloads, but that sounds like something that might never
> get past the experimental stage. Also, since the SUNH address space is
> private, nothing precludes anyone from defining their own multicast
> addresses in the existing space. What I don't think we want is to
> define a multicast prefix or any mandated structure on SUNH addresses.
> Likewise, if someone wants to do hierarchical addressing in 16 bits
> that's their local decision.
>
> Tom
>
> >
> >
> >
> > Haoyu - I really like your draft-song-ship-edge-05 Hierarchical
> > addressing stuff:
> > a)
> > This reminds me of good old fiber channel addressing, and I suppose
> > the more modern Infiniband/RDMA.
> > b)
> > The words "variable length" are scary because variability clearly
> > isn't ideal for hardware.  I guess when you say "variable length" you
> > don't actually mean the addresses would vary dynamically, but that
> > there could be a range of set fixed length addressing that could be
> > selected for different deployment scenarios?
>
> +1 "Variable length" is an anathema for hardware engineers!
>
> > c)
> > One core concept of draft-song-ship-edge-05, is that traffic destined
> > for IoT devices needs a long, unique address, while the traffic
> > _sourced_ from these devices towards the data center can have a much
> > smaller destination address.
> > I recall Geoff Huston discussing IPv6 at a recent NANGO, where he
> > commented that because of the pervasive use of anycast by a relatively
> > small number of CDNs, that the Internet might only need a /24 worth of
> > addresses for 99% of all traffic.
> > Other network protocols with asymmetric addresses include:
> > - PCIe (Requester vs Completer addressing)
> > - In InfiniBand / RDMA, requests carry full destination addressing
> > (QPN + LID/GID + path), while responses omit it and are routed
> > implicitly using the established queue-pair and path state, making the
> > addressing directionally asymmetric.
> > - QUIC has explicit directional asymmetry in connection IDs
> >
> >
> > --
> > Regards,
> > Dave Seddon
> >
> > _______________________________________________
> > Int-area mailing list -- [email protected]
> > To unsubscribe send an email to [email protected]
>


-- 
Regards,
Dave Seddon
+1 415 857 5102
_______________________________________________
Int-area mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to