On Mon, Jan 12, 2026, 3:47 PM Haoyu Song <[email protected]> wrote:
> Hi Tom, > > Please see my response inline. > > Best regards, > Haoyu > > -----Original Message----- > From: Tom Herbert <[email protected]> > Sent: Saturday, January 10, 2026 7:55 AM > To: Haoyu Song <[email protected]> > Cc: dave seddon <[email protected]>; [email protected] > Subject: Re: [Int-area] Re: Regarding the draft: Scale-Up Network Header > (SUNH) > > On Fri, Jan 9, 2026 at 4:15 PM Haoyu Song <[email protected]> > wrote: > > > > Hi Dave, > > > > Thank you for the comments. We are on the same page that a compact > header with just enough address bits is critical in AI DCN (I would argue > this also applies to the scale-out networks). > > > > I want to further discuss two points: > > Hi Haoyu, thanks for the discussion! > > > > > 1. The variable size address isn't that "scary" actually. We have > verified the scheme with P4 and it's doable. Once it's realized in switch > ASIC, there's no performance implications at all. > > The size of addresses in lookups will have performance implications and > cost effects as well. For instance, with 16 bit addresses a switch could do > route lookup with a simple array lookup in SRAM, for 32 bit addresses we > need a CAM or TCAM, for 128 bit addresses we need a CAM or TCAM 4x the size > of the one for IPv4. > > [HS] The hierarchical addressing scheme never lookups full addresses. At > each level, it only searches the prefix assigned to the level. For example, > each cluster has 1K nodes and we have 1K clusters in a DC. The lowest level > node in a cluster has a 10bit address and 10bit prefix. In a cluster, the > nodes only uses 10bit addresses. If it needs to talk to another node in > another cluster, its address needs to be augmented with the 10bit prefix > (but the prefix is only stored in the gateway switch, which is oblivious to > the nodes in a cluster). Finally, the data center gateway switch holds a > 108bit prefixes, which can be used to augment all the addresses in the data > center to 128bit IPv6 addresses. A small TCAM or a small direct index table > for lookups is enough at each level. (details can be found at > https://www.researchgate.net/profile/Haoyu-Song/publication/347085487_Adaptive_Addresses_for_Next_Generation_IP_Protocol_in_Hierarchical_Networks/links/6070858da6fdcc5f77948ec2/Adaptive-Addresses-for-Next-Generation-IP-Protocol-in-Hierarchical-Networks.pdf?origin=publication_detail&_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6InByb2ZpbGUiLCJwYWdlIjoicHVibGljYXRpb25Eb3dubG9hZCIsInByZXZpb3VzUGFnZSI6InB1YmxpY2F0aW9uIn19&__cf_chl_tk=SIPGEnoIS.7WtTH63F1auQRyVbZzJP7AprlQYXN7wGE-1768260809-1.0.1.1-lNehHWVj1D3bK0gRlaT4qD4Bw5rgC8_sVglRL1DD73A > ) > > > > On the other hand, supporting different lengths have many advantages: it > can scale with the cluster size without any waste, it supports > communication between clusters with different sizes, it doesn't need to > respin the chips in case the network scale changes, and the same standard > would be applied to any scenarios as laid out in our paper "Adaptive > Addresses for Next Generation IP Protocol in Hierarchical > Networks"(ICNP2020). Of course, there's a tradeoff on how fine the address > length step should be supported (e.g., 1 bit, 2 bits, 4 bits, or 8 bits). > This is subject to further study. > > Waste is relative and we get diminishing returns in both directions. > For instance, if we halve an IPv6 address then we save sixteen bytes per > packet. That's significant. But if we halve the sixteen bit addresses of > SUNH we'd save a whole two bytes per packet and that's nothing to write > home about. On the other hand, suppose we double the sixteen bit addresses > of SUNH then we have addresses the same size as > IPv4 addresses. Grant it, IPv4 header has some other stuff in the header, > but at some point it starts to be a question of why not just use IPv4? > > [HS] The benefits of the hierarchical addressing are two folds: > flexibility and the compatibility to IPv6. We make each node IPv6 > accessible but internally, it avoids the IPv6 header overhead. We see that > in AI data center, the supernode size ranges from 8 to 1K. I don't know how > large the size can become in the future. A fine granularity can minimize > the waste yet be ready to scale to any size. Haoyu, As I said, you get diminishing returns in finer granularity. For instance, if you halve sixteen bits addresses then the savings is on two bytes per packet. That's not even 1% of a 256 byte packet. I think it's going to be hard to justify those miniscule savings against the complexity of supporting variable length addresses and headers > > > As for the granularity, my strong preference is to first keep addresses in > units of eight bits. It's unpleasant for a lot of processors to deal with > anything smaller and is a long held convention in IP addresses, port > numbers, and Ethernet addresses. I'd also prefer maintaining four byte > alignment of the transport layer like IPv4 and IPv6, but I suppose > alignment is mostly historical at this time so maybe it's not super > critical. > > With all this in mind, if there were to be another address size in > addition to 16 bits, I might opt for 24 bits. It breaks four byte > alignment, but has the nice property that it maps 10/8 IPv4 addresses. > SUNH with 24 bit addresses is 10 bytes, compared to 20 bytes IPv4 header. > I suppose that might be worth it. > > > > > 2. If we assume the scale up network would take ethernet as the L2 > technology, it can be envisioned that the scale up and scale out network > would eventually converge into a single network. Then we would consider > that the L3 should also have a common standard (strictly speaking, if we > only have a separate scale up network, we don't need L3 at all, because an > L2 fabirc is enough). > > Strictly speaking, yes, But people also want network layer functionality > like TOS and Hop Limits so L3 enters the picture and we see people go down > the path to reinventing L3 like AHF does. > > [HS] Up to now most scale up networks are like full mesh point to point > fabric. If L3 are needed, and Ethernet is used, I think it makes sense to > converge the scale out and scale up network into one. Then we may want to > see any node IPv6 reachable but within the data center, we don't want the > IPv6 overhead. The hierarchical address provides a simple solution. > > > Thus, the variable size address can support a hierarchical network > naturally mapping to the DCN topology and more important, it allows the > seamlessly connecting with the Internet which runs IPv4/IPv6 so the > inter-DC communication can be supported without any modification to the > public network. I think this is a reason we need an IP-like L3 header which > can translate into IPv4/v6. Note the SUNH proposal support this already, > the only issue is the 16-bit address is an overkill to the current cluster > size, and the fixed length is not flexible. > > Like I said, adding different address sizes could just be a matter of > getting different EtherTypes for SUNH. But, I would only want to add > support for more address sizes sparingly. > > [HS] Using the EtherType based solution, you will have a fixed header > which can only be used for intra-cluster communication. Using hierarchical > addresses, a node A in cluster X can uses the same protocol header to > communicate with a node B in cluster Y. In this case, the source address > and the destination address are different in length because the destination > address needs to include B's cluster prefix. The different EtherTypes allow for different sized addresses. I can imagine at most we ever need four sizes: 1, 2, 3, or 4 byte addresses. Anything bigger just use IPv6, any odd number of bits just round up to the nearest byte size. If nodes in two clusters want to talk then a gateway can map addresses from one cluster to another, which is what anyone would need to do when connecting domains with different address spaces. Tom > > Tom > > > > > > Best regards, > > Haoyu > > > > -----Original Message----- > > From: dave seddon <[email protected]> > > Sent: Friday, January 9, 2026 2:03 PM > > To: [email protected] > > Subject: [Int-area] Regarding the draft: Scale-Up Network Header > > (SUNH) > > > > G'day Tom and Haoyu, > > > > I'm trying to join the discussion about "draft: Scale-Up Network > > Header (SUNH)", but I just joined the mail list, so I don't know if > > posting to the subject line will do it. ( Apologies if this breaks > > threading ) > > > > Drafts: > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdata > > tracker.ietf.org%2Fdoc%2Fdraft-herbert-sunh%2F&data=05%7C02%7Chaoyu.so > > ng%40futurewei.com%7C761f7d9f68e44860485008de50609d54%7C0fee8ff2a3b240 > > 189c753a1d5591fedc%7C1%7C1%7C639036573084186972%7CUnknown%7CTWFpbGZsb3 > > d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoi > > TWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=%2BNQeccfibgnpwtlX0mjTdVFp > > ILI7xZlFP6Qh6KTuNHE%3D&reserved=0 > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdata > > tracker.ietf.org%2Fdoc%2Fhtml%2Fdraft-song-ship-edge-05&data=05%7C02%7 > > Chaoyu.song%40futurewei.com%7C761f7d9f68e44860485008de50609d54%7C0fee8 > > ff2a3b240189c753a1d5591fedc%7C1%7C1%7C639036573084212755%7CUnknown%7CT > > WFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiI > > sIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=O6TigIDyIICJv%2F% > > 2Fabg49jSaZlz7%2B1aKKYyVc3elDI5U%3D&reserved=0 > > > > It seems like the discussion centers on the address length. > > > > The SUNH "1.1. Problem statement" is very clear " > > 8% overhead in a 256 byte packet, and the forty bytes of IPv6 header > would be about 16% overhead " > > > > Absolutely minimizing overhead makes sense currently, but for how long > do we expect this to be true? Tom, since you've been talking to people who > run the largest AI clusters in the world, you expect this to hold true for > the foreseeable future. > > > > > > Tom - I wonder if draft-herbert-sunh would benefit from a small summary, > maybe with a table, that compares the proposed addressing to other > protocols that are common within data centers? > > > > For example, comparing protocols by their header, address lengths, and > "overhead" > > - PCIe ( IEEE have paywalls, so it's hard to find a good source. > > Maybe this: > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww. > > pearsonhighered.com%2Fassets%2Fsamplechapter%2F0%2F3%2F2%2F1%2F0321156 > > 307.pdf&data=05%7C02%7Chaoyu.song%40futurewei.com%7C761f7d9f68e4486048 > > 5008de50609d54%7C0fee8ff2a3b240189c753a1d5591fedc%7C1%7C1%7C6390365730 > > 84230845%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuM > > DAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&s > > data=N5UpkmqCvA8hpe7ou2p%2B4cTeV6SZhS5C%2B6ZJZiWyuHQ%3D&reserved=0 > > ) > > - Infiniband ( addressing scheme found here on page 625 > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhjem > > mesider.diku.dk%2F~vinter%2FCC%2FInfinibandchap42.pdf&data=05%7C02%7Ch > > aoyu.song%40futurewei.com%7C761f7d9f68e44860485008de50609d54%7C0fee8ff > > 2a3b240189c753a1d5591fedc%7C1%7C1%7C639036573084453477%7CUnknown%7CTWF > > pbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsI > > kFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=0qEGPX7uWcVMChB7YuR > > 4WdGX1Cdxea9BCUCqfArnpJA%3D&reserved=0 ) > > - Ethernet > > - Ethernet with 802.1q ( and qnq ) > > - IPv4 > > - IPv6 > > - SUNH > > ... > > > > Now that the context is established, explain why 16 bits were chosen for > the source/destination address. I guess, but it's not in the document; You > were considering the number of hosts in the domain. > > > > Nit pick (sorry). "care must be taken to ensure the minimum packet size > is maintained". Might help to explain why. > > > > Re section "TCP and UDP in SUNH". I remember recently Stuart from Apple > saying something pretty interesting about UDP: "If IP had port numbers, you > wouldn't really need a UDP header at all." > > > > Multicast? It might be worth mentioning multicast and explaining why it > isn't discussed. e.g. No requirement for this, or it might be considered > in the future if a need arises. > > > > > > > > Haoyu - I really like your draft-song-ship-edge-05 Hierarchical > addressing stuff: > > a) > > This reminds me of good old fiber channel addressing, and I suppose the > more modern Infiniband/RDMA. > > b) > > The words "variable length" are scary because variability clearly isn't > ideal for hardware. I guess when you say "variable length" you don't > actually mean the addresses would vary dynamically, but that there could be > a range of set fixed length addressing that could be selected for different > deployment scenarios? > > c) > > One core concept of draft-song-ship-edge-05, is that traffic destined > for IoT devices needs a long, unique address, while the traffic _sourced_ > from these devices towards the data center can have a much smaller > destination address. > > I recall Geoff Huston discussing IPv6 at a recent NANGO, where he > commented that because of the pervasive use of anycast by a relatively > small number of CDNs, that the Internet might only need a /24 worth of > addresses for 99% of all traffic. > > Other network protocols with asymmetric addresses include: > > - PCIe (Requester vs Completer addressing) > > - In InfiniBand / RDMA, requests carry full destination addressing (QPN > + LID/GID + path), while responses omit it and are routed implicitly using > the established queue-pair and path state, making the addressing > directionally asymmetric. > > - QUIC has explicit directional asymmetry in connection IDs > > > > > > -- > > Regards, > > Dave Seddon > > > > _______________________________________________ > > Int-area mailing list -- [email protected] To unsubscribe send an > > email to [email protected] > > > > _______________________________________________ > > Int-area mailing list -- [email protected] To unsubscribe send an > > email to [email protected] >
_______________________________________________ Int-area mailing list -- [email protected] To unsubscribe send an email to [email protected]
