Hi Dave and Tom, In my previous email, I tried to explain the issue that using EtherType to define different address lengths. It’s doable if we confine the problem to support only the intra-cluster communication (i.e., an independent scale up network).
But, if we already assume to use Ethernet as the L2 technology for the scale up network, I see no reason to maintain a separate scale out network because it’s already Ethernet based. The two should converge for each node and share the same NIC per node. Thus, we have a new problem need to solve. From the same NIC, a node should be able to communicate with any other nodes in the same cluster, in the different cluster, and even in the different data centers. The hierarchical address scheme provides an IPv6-ready way to make each node IPv6 addressable, yet for the node itself, it only keeps a much shorter address suffix which length is determined by the cluster size. I’d like people to think along this line to see if it makes sense. Otherwise, focusing on optimizing AFH may be too narrow. After all, that’s promoted by a switch silicon vendor. If they can already support it in their silicon, an alternative solution is less attractive, unless significant new features are introduced. Best regards, Haoyu From: dave seddon <[email protected]> Sent: Saturday, January 10, 2026 10:29 AM To: Tom Herbert <[email protected]> Cc: Haoyu Song <[email protected]>; [email protected] Subject: Re: [Int-area] Re: Regarding the draft: Scale-Up Network Header (SUNH) Interesting discussion. Sorry, I hope I'm not bringing up too many things. Usually I find standards more helpful when they clearly rule things out, as much as they define the included items. If you like I could try writing some minor updates. What's the best way to collaborate? Google doc or git? > With all this in mind, if there were to be another address size in addition to 16 bits, I might opt for 24 bits. It breaks four byte alignment, but has the nice property that it maps 10/8 IPv4 addresses. SUNH with 24 bit addresses is 10 bytes, compared to 20 bytes IPv4 header. I suppose that might be worth it. > Like I said, adding different address sizes could just be a matter of getting different EtherTypes for SUNH. But, I would only want to add support for more address sizes sparingly. Makes sense. The draft could provide a hypothetical example of the 24 bit address case and the extra EtherTypes, and just leave the door open to this in future. I don't know, do these data center networks use 802.1q? 802.1q header is pretty chunky at 32 bits, with a 12-bit VLAN ID. QinQ would be even worse. Actually, SUNH doesn't mention ECN yet. Certainly DCTCP, TCP Prague, and BBRv3 can all leverage ECN. Most switches can manipulate bits based on buffer depth, and networking teams would love the enhanced debuggability. Google's "PLB: congestion signals are simple and effective for network load balancing" is an interesting example of using ECN. https://dl.acm.org/doi/pdf/10.1145/3544216.3544226 Another related topic is path tracing, which you can probably just rule out. https://www.ietf.org/archive/id/draft-filsfils-spring-path-tracing-05.html https://www.youtube.com/watch?v=X0J2Gz57Lds On Sat, Jan 10, 2026 at 7:55 AM Tom Herbert <[email protected]<mailto:[email protected]>> wrote: On Fri, Jan 9, 2026 at 4:15 PM Haoyu Song <[email protected]<mailto:[email protected]>> wrote: > > Hi Dave, > > Thank you for the comments. We are on the same page that a compact header > with just enough address bits is critical in AI DCN (I would argue this also > applies to the scale-out networks). > > I want to further discuss two points: Hi Haoyu, thanks for the discussion! > > 1. The variable size address isn't that "scary" actually. We have verified > the scheme with P4 and it's doable. Once it's realized in switch ASIC, > there's no performance implications at all. The size of addresses in lookups will have performance implications and cost effects as well. For instance, with 16 bit addresses a switch could do route lookup with a simple array lookup in SRAM, for 32 bit addresses we need a CAM or TCAM, for 128 bit addresses we need a CAM or TCAM 4x the size of the one for IPv4. > On the other hand, supporting different lengths have many advantages: it can > scale with the cluster size without any waste, it supports communication > between clusters with different sizes, it doesn't need to respin the chips > in case the network scale changes, and the same standard would be applied to > any scenarios as laid out in our paper "Adaptive Addresses for Next > Generation IP Protocol in Hierarchical Networks"(ICNP2020). Of course, > there's a tradeoff on how fine the address length step should be supported > (e.g., 1 bit, 2 bits, 4 bits, or 8 bits). This is subject to further study. Waste is relative and we get diminishing returns in both directions. For instance, if we halve an IPv6 address then we save sixteen bytes per packet. That's significant. But if we halve the sixteen bit addresses of SUNH we'd save a whole two bytes per packet and that's nothing to write home about. On the other hand, suppose we double the sixteen bit addresses of SUNH then we have addresses the same size as IPv4 addresses. Grant it, IPv4 header has some other stuff in the header, but at some point it starts to be a question of why not just use IPv4? As for the granularity, my strong preference is to first keep addresses in units of eight bits. It's unpleasant for a lot of processors to deal with anything smaller and is a long held convention in IP addresses, port numbers, and Ethernet addresses. I'd also prefer maintaining four byte alignment of the transport layer like IPv4 and IPv6, but I suppose alignment is mostly historical at this time so maybe it's not super critical. With all this in mind, if there were to be another address size in addition to 16 bits, I might opt for 24 bits. It breaks four byte alignment, but has the nice property that it maps 10/8 IPv4 addresses. SUNH with 24 bit addresses is 10 bytes, compared to 20 bytes IPv4 header. I suppose that might be worth it. > > 2. If we assume the scale up network would take ethernet as the L2 > technology, it can be envisioned that the scale up and scale out network > would eventually converge into a single network. Then we would consider that > the L3 should also have a common standard (strictly speaking, if we only have > a separate scale up network, we don't need L3 at all, because an L2 fabirc is > enough). Strictly speaking, yes, But people also want network layer functionality like TOS and Hop Limits so L3 enters the picture and we see people go down the path to reinventing L3 like AHF does. > Thus, the variable size address can support a hierarchical network naturally > mapping to the DCN topology and more important, it allows the seamlessly > connecting with the Internet which runs IPv4/IPv6 so the inter-DC > communication can be supported without any modification to the public > network. I think this is a reason we need an IP-like L3 header which can > translate into IPv4/v6. Note the SUNH proposal support this already, the only > issue is the 16-bit address is an overkill to the current cluster size, and > the fixed length is not flexible. Like I said, adding different address sizes could just be a matter of getting different EtherTypes for SUNH. But, I would only want to add support for more address sizes sparingly. Tom > > > Best regards, > Haoyu > > -----Original Message----- > From: dave seddon <[email protected]<mailto:[email protected]>> > Sent: Friday, January 9, 2026 2:03 PM > To: [email protected]<mailto:[email protected]> > Subject: [Int-area] Regarding the draft: Scale-Up Network Header (SUNH) > > G'day Tom and Haoyu, > > I'm trying to join the discussion about "draft: Scale-Up Network Header > (SUNH)", but I just joined the mail list, so I don't know if posting to the > subject line will do it. ( Apologies if this breaks threading ) > > Drafts: > https://datatracker.ietf.org/doc/draft-herbert-sunh/ > https://datatracker.ietf.org/doc/html/draft-song-ship-edge-05 > > It seems like the discussion centers on the address length. > > The SUNH "1.1. Problem statement" is very clear " > 8% overhead in a 256 byte packet, and the forty bytes of IPv6 header would be > about 16% overhead " > > Absolutely minimizing overhead makes sense currently, but for how long do we > expect this to be true? Tom, since you've been talking to people who run the > largest AI clusters in the world, you expect this to hold true for the > foreseeable future. > > > Tom - I wonder if draft-herbert-sunh would benefit from a small summary, > maybe with a table, that compares the proposed addressing to other protocols > that are common within data centers? > > For example, comparing protocols by their header, address lengths, and > "overhead" > - PCIe ( IEEE have paywalls, so it's hard to find a good source. > Maybe this: > https://www.pearsonhighered.com/assets/samplechapter/0/3/2/1/0321156307.pdf > ) > - Infiniband ( addressing scheme found here on page 625 > https://hjemmesider.diku.dk/~vinter/CC/Infinibandchap42.pdf ) > - Ethernet > - Ethernet with 802.1q ( and qnq ) > - IPv4 > - IPv6 > - SUNH > ... > > Now that the context is established, explain why 16 bits were chosen for the > source/destination address. I guess, but it's not in the document; You were > considering the number of hosts in the domain. > > Nit pick (sorry). "care must be taken to ensure the minimum packet size is > maintained". Might help to explain why. > > Re section "TCP and UDP in SUNH". I remember recently Stuart from Apple > saying something pretty interesting about UDP: "If IP had port numbers, you > wouldn't really need a UDP header at all." > > Multicast? It might be worth mentioning multicast and explaining why it > isn't discussed. e.g. No requirement for this, or it might be considered in > the future if a need arises. > > > > Haoyu - I really like your draft-song-ship-edge-05 Hierarchical addressing > stuff: > a) > This reminds me of good old fiber channel addressing, and I suppose the more > modern Infiniband/RDMA. > b) > The words "variable length" are scary because variability clearly isn't ideal > for hardware. I guess when you say "variable length" you don't actually mean > the addresses would vary dynamically, but that there could be a range of set > fixed length addressing that could be selected for different deployment > scenarios? > c) > One core concept of draft-song-ship-edge-05, is that traffic destined for IoT > devices needs a long, unique address, while the traffic _sourced_ from these > devices towards the data center can have a much smaller destination address. > I recall Geoff Huston discussing IPv6 at a recent NANGO, where he commented > that because of the pervasive use of anycast by a relatively small number of > CDNs, that the Internet might only need a /24 worth of addresses for 99% of > all traffic. > Other network protocols with asymmetric addresses include: > - PCIe (Requester vs Completer addressing) > - In InfiniBand / RDMA, requests carry full destination addressing (QPN + > LID/GID + path), while responses omit it and are routed implicitly using the > established queue-pair and path state, making the addressing directionally > asymmetric. > - QUIC has explicit directional asymmetry in connection IDs > > > -- > Regards, > Dave Seddon > > _______________________________________________ > Int-area mailing list -- [email protected]<mailto:[email protected]> To > unsubscribe send an email to > [email protected]<mailto:[email protected]> > > _______________________________________________ > Int-area mailing list -- [email protected]<mailto:[email protected]> > To unsubscribe send an email to > [email protected]<mailto:[email protected]> -- Regards, Dave Seddon +1 415 857 5102
_______________________________________________ Int-area mailing list -- [email protected] To unsubscribe send an email to [email protected]
