On Mon, Jan 12, 2026 at 4:54 PM Haoyu Song <[email protected]> wrote: > > > > inline > > From: Tom Herbert <[email protected]> > Sent: Monday, January 12, 2026 4:31 PM > To: Haoyu Song <[email protected]> > Cc: dave seddon <[email protected]>; int-area <[email protected]> > Subject: Re: [Int-area] Re: Regarding the draft: Scale-Up Network Header > (SUNH) > > > > > > On Mon, Jan 12, 2026, 3:47 PM Haoyu Song <[email protected]> wrote: > > Hi Tom, > > Please see my response inline. > > Best regards, > Haoyu > > -----Original Message----- > From: Tom Herbert <[email protected]> > Sent: Saturday, January 10, 2026 7:55 AM > To: Haoyu Song <[email protected]> > Cc: dave seddon <[email protected]>; [email protected] > Subject: Re: [Int-area] Re: Regarding the draft: Scale-Up Network Header > (SUNH) > > On Fri, Jan 9, 2026 at 4:15 PM Haoyu Song <[email protected]> wrote: > > > > Hi Dave, > > > > Thank you for the comments. We are on the same page that a compact header > > with just enough address bits is critical in AI DCN (I would argue this > > also applies to the scale-out networks). > > > > I want to further discuss two points: > > Hi Haoyu, thanks for the discussion! > > > > > 1. The variable size address isn't that "scary" actually. We have verified > > the scheme with P4 and it's doable. Once it's realized in switch ASIC, > > there's no performance implications at all. > > The size of addresses in lookups will have performance implications and cost > effects as well. For instance, with 16 bit addresses a switch could do route > lookup with a simple array lookup in SRAM, for 32 bit addresses we need a CAM > or TCAM, for 128 bit addresses we need a CAM or TCAM 4x the size of the one > for IPv4. > > [HS] The hierarchical addressing scheme never lookups full addresses. At each > level, it only searches the prefix assigned to the level. For example, each > cluster has 1K nodes and we have 1K clusters in a DC. The lowest level node > in a cluster has a 10bit address and 10bit prefix. In a cluster, the nodes > only uses 10bit addresses. If it needs to talk to another node in another > cluster, its address needs to be augmented with the 10bit prefix (but the > prefix is only stored in the gateway switch, which is oblivious to the nodes > in a cluster). Finally, the data center gateway switch holds a 108bit > prefixes, which can be used to augment all the addresses in the data center > to 128bit IPv6 addresses. A small TCAM or a small direct index table for > lookups is enough at each level. (details can be found at > https://www.researchgate.net/profile/Haoyu-Song/publication/347085487_Adaptive_Addresses_for_Next_Generation_IP_Protocol_in_Hierarchical_Networks/links/6070858da6fdcc5f77948ec2/Adaptive-Addresses-for-Next-Generation-IP-Protocol-in-Hierarchical-Networks.pdf?origin=publication_detail&_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6InByb2ZpbGUiLCJwYWdlIjoicHVibGljYXRpb25Eb3dubG9hZCIsInByZXZpb3VzUGFnZSI6InB1YmxpY2F0aW9uIn19&__cf_chl_tk=SIPGEnoIS.7WtTH63F1auQRyVbZzJP7AprlQYXN7wGE-1768260809-1.0.1.1-lNehHWVj1D3bK0gRlaT4qD4Bw5rgC8_sVglRL1DD73A) > > > > On the other hand, supporting different lengths have many advantages: it > > can scale with the cluster size without any waste, it supports > > communication between clusters with different sizes, it doesn't need to > > respin the chips in case the network scale changes, and the same standard > > would be applied to any scenarios as laid out in our paper "Adaptive > > Addresses for Next Generation IP Protocol in Hierarchical > > Networks"(ICNP2020). Of course, there's a tradeoff on how fine the address > > length step should be supported (e.g., 1 bit, 2 bits, 4 bits, or 8 bits). > > This is subject to further study. > > Waste is relative and we get diminishing returns in both directions. > For instance, if we halve an IPv6 address then we save sixteen bytes per > packet. That's significant. But if we halve the sixteen bit addresses of SUNH > we'd save a whole two bytes per packet and that's nothing to write home > about. On the other hand, suppose we double the sixteen bit addresses of SUNH > then we have addresses the same size as > IPv4 addresses. Grant it, IPv4 header has some other stuff in the header, but > at some point it starts to be a question of why not just use IPv4? > > [HS] The benefits of the hierarchical addressing are two folds: flexibility > and the compatibility to IPv6. We make each node IPv6 accessible but > internally, it avoids the IPv6 header overhead. We see that in AI data > center, the supernode size ranges from 8 to 1K. I don't know how large the > size can become in the future. A fine granularity can minimize the waste yet > be ready to scale to any size. > > > > Haoyu, > > > > As I said, you get diminishing returns in finer granularity. For instance, if > you halve sixteen bits addresses then the savings is on two bytes per packet. > That's not even 1% of a 256 byte packet. I think it's going to be hard to > justify those miniscule savings against the complexity of supporting variable > length addresses and headers > > > > [HS] I think we can assume the minimum packet size is 64 byte. And I’m not > arguing that we must have very fine length granularity. That can be > determined as a tradeoff. The reason to support variable length address is to > support inter-cluster communication while maintaining the header efficiency. > > > > > > > > As for the granularity, my strong preference is to first keep addresses in > units of eight bits. It's unpleasant for a lot of processors to deal with > anything smaller and is a long held convention in IP addresses, port numbers, > and Ethernet addresses. I'd also prefer maintaining four byte alignment of > the transport layer like IPv4 and IPv6, but I suppose alignment is mostly > historical at this time so maybe it's not super critical. > > With all this in mind, if there were to be another address size in addition > to 16 bits, I might opt for 24 bits. It breaks four byte alignment, but has > the nice property that it maps 10/8 IPv4 addresses. > SUNH with 24 bit addresses is 10 bytes, compared to 20 bytes IPv4 header. I > suppose that might be worth it. > > > > > 2. If we assume the scale up network would take ethernet as the L2 > > technology, it can be envisioned that the scale up and scale out network > > would eventually converge into a single network. Then we would consider > > that the L3 should also have a common standard (strictly speaking, if we > > only have a separate scale up network, we don't need L3 at all, because an > > L2 fabirc is enough). > > Strictly speaking, yes, But people also want network layer functionality like > TOS and Hop Limits so L3 enters the picture and we see people go down the > path to reinventing L3 like AHF does. > > [HS] Up to now most scale up networks are like full mesh point to point > fabric. If L3 are needed, and Ethernet is used, I think it makes sense to > converge the scale out and scale up network into one. Then we may want to see > any node IPv6 reachable but within the data center, we don't want the IPv6 > overhead. The hierarchical address provides a simple solution. > > > Thus, the variable size address can support a hierarchical network > > naturally mapping to the DCN topology and more important, it allows the > > seamlessly connecting with the Internet which runs IPv4/IPv6 so the > > inter-DC communication can be supported without any modification to the > > public network. I think this is a reason we need an IP-like L3 header which > > can translate into IPv4/v6. Note the SUNH proposal support this already, > > the only issue is the 16-bit address is an overkill to the current cluster > > size, and the fixed length is not flexible. > > Like I said, adding different address sizes could just be a matter of getting > different EtherTypes for SUNH. But, I would only want to add support for more > address sizes sparingly. > > [HS] Using the EtherType based solution, you will have a fixed header which > can only be used for intra-cluster communication. Using hierarchical > addresses, a node A in cluster X can uses the same protocol header to > communicate with a node B in cluster Y. In this case, the source address and > the destination address are different in length because the destination > address needs to include B's cluster prefix. > > > > The different EtherTypes allow for different sized addresses. I can imagine > at most we ever need four sizes: 1, 2, 3, or 4 byte addresses. Anything > bigger just use IPv6, any odd number of bits just round up to the nearest > byte size. If nodes in two clusters want to talk then a gateway can map > addresses from one cluster to another, which is what anyone would need to do > when connecting domains with different address spaces. > > > > [HS] hmmm…how is that achieved? Assume we have two clusters and each cluster > uses 1 byte address (i.e., each cluster can have up to 256 nodes). Now node 0 > in cluster A wants to send a packet to node 0 in cluster B. How does the > packet header look like in this case?
Just use 16 bit addresses. Cluster A's addresses are 0x0-0xff and cluster B's addresses are 0x100-0x1ff. Packet header is just a SUNH header. Tom > > > > Tom > > > > > > Tom > > > > > > Best regards, > > Haoyu > > > > -----Original Message----- > > From: dave seddon <[email protected]> > > Sent: Friday, January 9, 2026 2:03 PM > > To: [email protected] > > Subject: [Int-area] Regarding the draft: Scale-Up Network Header > > (SUNH) > > > > G'day Tom and Haoyu, > > > > I'm trying to join the discussion about "draft: Scale-Up Network > > Header (SUNH)", but I just joined the mail list, so I don't know if > > posting to the subject line will do it. ( Apologies if this breaks > > threading ) > > > > Drafts: > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdata > > tracker.ietf.org%2Fdoc%2Fdraft-herbert-sunh%2F&data=05%7C02%7Chaoyu.so > > ng%40futurewei.com%7C761f7d9f68e44860485008de50609d54%7C0fee8ff2a3b240 > > 189c753a1d5591fedc%7C1%7C1%7C639036573084186972%7CUnknown%7CTWFpbGZsb3 > > d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoi > > TWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=%2BNQeccfibgnpwtlX0mjTdVFp > > ILI7xZlFP6Qh6KTuNHE%3D&reserved=0 > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdata > > tracker.ietf.org%2Fdoc%2Fhtml%2Fdraft-song-ship-edge-05&data=05%7C02%7 > > Chaoyu.song%40futurewei.com%7C761f7d9f68e44860485008de50609d54%7C0fee8 > > ff2a3b240189c753a1d5591fedc%7C1%7C1%7C639036573084212755%7CUnknown%7CT > > WFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiI > > sIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=O6TigIDyIICJv%2F% > > 2Fabg49jSaZlz7%2B1aKKYyVc3elDI5U%3D&reserved=0 > > > > It seems like the discussion centers on the address length. > > > > The SUNH "1.1. Problem statement" is very clear " > > 8% overhead in a 256 byte packet, and the forty bytes of IPv6 header would > > be about 16% overhead " > > > > Absolutely minimizing overhead makes sense currently, but for how long do > > we expect this to be true? Tom, since you've been talking to people who > > run the largest AI clusters in the world, you expect this to hold true for > > the foreseeable future. > > > > > > Tom - I wonder if draft-herbert-sunh would benefit from a small summary, > > maybe with a table, that compares the proposed addressing to other > > protocols that are common within data centers? > > > > For example, comparing protocols by their header, address lengths, and > > "overhead" > > - PCIe ( IEEE have paywalls, so it's hard to find a good source. > > Maybe this: > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww. > > pearsonhighered.com%2Fassets%2Fsamplechapter%2F0%2F3%2F2%2F1%2F0321156 > > 307.pdf&data=05%7C02%7Chaoyu.song%40futurewei.com%7C761f7d9f68e4486048 > > 5008de50609d54%7C0fee8ff2a3b240189c753a1d5591fedc%7C1%7C1%7C6390365730 > > 84230845%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuM > > DAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&s > > data=N5UpkmqCvA8hpe7ou2p%2B4cTeV6SZhS5C%2B6ZJZiWyuHQ%3D&reserved=0 > > ) > > - Infiniband ( addressing scheme found here on page 625 > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhjem > > mesider.diku.dk%2F~vinter%2FCC%2FInfinibandchap42.pdf&data=05%7C02%7Ch > > aoyu.song%40futurewei.com%7C761f7d9f68e44860485008de50609d54%7C0fee8ff > > 2a3b240189c753a1d5591fedc%7C1%7C1%7C639036573084453477%7CUnknown%7CTWF > > pbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsI > > kFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=0qEGPX7uWcVMChB7YuR > > 4WdGX1Cdxea9BCUCqfArnpJA%3D&reserved=0 ) > > - Ethernet > > - Ethernet with 802.1q ( and qnq ) > > - IPv4 > > - IPv6 > > - SUNH > > ... > > > > Now that the context is established, explain why 16 bits were chosen for > > the source/destination address. I guess, but it's not in the document; You > > were considering the number of hosts in the domain. > > > > Nit pick (sorry). "care must be taken to ensure the minimum packet size is > > maintained". Might help to explain why. > > > > Re section "TCP and UDP in SUNH". I remember recently Stuart from Apple > > saying something pretty interesting about UDP: "If IP had port numbers, you > > wouldn't really need a UDP header at all." > > > > Multicast? It might be worth mentioning multicast and explaining why it > > isn't discussed. e.g. No requirement for this, or it might be considered > > in the future if a need arises. > > > > > > > > Haoyu - I really like your draft-song-ship-edge-05 Hierarchical addressing > > stuff: > > a) > > This reminds me of good old fiber channel addressing, and I suppose the > > more modern Infiniband/RDMA. > > b) > > The words "variable length" are scary because variability clearly isn't > > ideal for hardware. I guess when you say "variable length" you don't > > actually mean the addresses would vary dynamically, but that there could be > > a range of set fixed length addressing that could be selected for different > > deployment scenarios? > > c) > > One core concept of draft-song-ship-edge-05, is that traffic destined for > > IoT devices needs a long, unique address, while the traffic _sourced_ from > > these devices towards the data center can have a much smaller destination > > address. > > I recall Geoff Huston discussing IPv6 at a recent NANGO, where he commented > > that because of the pervasive use of anycast by a relatively small number > > of CDNs, that the Internet might only need a /24 worth of addresses for 99% > > of all traffic. > > Other network protocols with asymmetric addresses include: > > - PCIe (Requester vs Completer addressing) > > - In InfiniBand / RDMA, requests carry full destination addressing (QPN + > > LID/GID + path), while responses omit it and are routed implicitly using > > the established queue-pair and path state, making the addressing > > directionally asymmetric. > > - QUIC has explicit directional asymmetry in connection IDs > > > > > > -- > > Regards, > > Dave Seddon > > > > _______________________________________________ > > Int-area mailing list -- [email protected] To unsubscribe send an > > email to [email protected] > > > > _______________________________________________ > > Int-area mailing list -- [email protected] To unsubscribe send an > > email to [email protected] _______________________________________________ Int-area mailing list -- [email protected] To unsubscribe send an email to [email protected]
