On Mon, Jan 12, 2026 at 4:54 PM Haoyu Song <[email protected]> wrote:
>
>
>
> inline
>
> From: Tom Herbert <[email protected]>
> Sent: Monday, January 12, 2026 4:31 PM
> To: Haoyu Song <[email protected]>
> Cc: dave seddon <[email protected]>; int-area <[email protected]>
> Subject: Re: [Int-area] Re: Regarding the draft: Scale-Up Network Header 
> (SUNH)
>
>
>
>
>
> On Mon, Jan 12, 2026, 3:47 PM Haoyu Song <[email protected]> wrote:
>
> Hi Tom,
>
> Please see my response inline.
>
> Best regards,
> Haoyu
>
> -----Original Message-----
> From: Tom Herbert <[email protected]>
> Sent: Saturday, January 10, 2026 7:55 AM
> To: Haoyu Song <[email protected]>
> Cc: dave seddon <[email protected]>; [email protected]
> Subject: Re: [Int-area] Re: Regarding the draft: Scale-Up Network Header 
> (SUNH)
>
> On Fri, Jan 9, 2026 at 4:15 PM Haoyu Song <[email protected]> wrote:
> >
> > Hi Dave,
> >
> > Thank you for the comments. We are on the same page that a compact header 
> > with just enough address bits is critical in AI DCN (I would argue this 
> > also applies to the scale-out networks).
> >
> > I want to further discuss two points:
>
> Hi Haoyu, thanks for the discussion!
>
> >
> > 1. The variable size address isn't that "scary" actually. We have verified 
> > the scheme with P4 and it's doable. Once it's realized in switch ASIC, 
> > there's no performance implications at all.
>
> The size of addresses in lookups will have performance implications and cost 
> effects as well. For instance, with 16 bit addresses a switch could do route 
> lookup with a simple array lookup in SRAM, for 32 bit addresses we need a CAM 
> or TCAM, for 128 bit addresses we need a CAM or TCAM 4x the size of the one 
> for IPv4.
>
> [HS] The hierarchical addressing scheme never lookups full addresses. At each 
> level, it only searches the prefix assigned to the level. For example, each 
> cluster has 1K nodes and we have 1K clusters in a DC. The lowest level node 
> in a cluster has a 10bit address and 10bit prefix. In a cluster, the nodes 
> only uses 10bit addresses. If it needs to talk to another node in another 
> cluster, its address needs to be augmented with the 10bit prefix (but the 
> prefix is only stored in the gateway switch, which is oblivious to the nodes 
> in a cluster). Finally, the data center gateway switch holds a 108bit 
> prefixes, which can be used to augment all the addresses in the data center 
> to 128bit IPv6 addresses. A small TCAM or a small direct index table for 
> lookups is enough at each level. (details can be found at 
> https://www.researchgate.net/profile/Haoyu-Song/publication/347085487_Adaptive_Addresses_for_Next_Generation_IP_Protocol_in_Hierarchical_Networks/links/6070858da6fdcc5f77948ec2/Adaptive-Addresses-for-Next-Generation-IP-Protocol-in-Hierarchical-Networks.pdf?origin=publication_detail&_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6InByb2ZpbGUiLCJwYWdlIjoicHVibGljYXRpb25Eb3dubG9hZCIsInByZXZpb3VzUGFnZSI6InB1YmxpY2F0aW9uIn19&__cf_chl_tk=SIPGEnoIS.7WtTH63F1auQRyVbZzJP7AprlQYXN7wGE-1768260809-1.0.1.1-lNehHWVj1D3bK0gRlaT4qD4Bw5rgC8_sVglRL1DD73A)
>
>
> > On the other hand, supporting different lengths have many advantages: it 
> > can scale with the cluster size without any waste,  it supports 
> > communication between clusters with different sizes, it doesn't  need to 
> > respin the chips in case the network scale changes, and the same standard 
> > would be applied to any scenarios as laid out in our paper "Adaptive 
> > Addresses for Next Generation IP Protocol in Hierarchical 
> > Networks"(ICNP2020). Of course, there's a tradeoff on how fine the address 
> > length step should be supported (e.g., 1 bit, 2 bits, 4 bits, or 8 bits). 
> > This is subject to further study.
>
> Waste is relative and we get diminishing returns in both directions.
> For instance, if we halve an IPv6 address then we save sixteen bytes per 
> packet. That's significant. But if we halve the sixteen bit addresses of SUNH 
> we'd save a whole two bytes per packet and that's nothing to write home 
> about. On the other hand, suppose we double the sixteen bit addresses of SUNH 
> then we have addresses the same size as
> IPv4 addresses. Grant it, IPv4 header has some other stuff in the header, but 
> at some point it starts to be a question of why not just use IPv4?
>
> [HS] The benefits of the hierarchical addressing are two folds: flexibility 
> and the compatibility to IPv6. We make each node IPv6 accessible but 
> internally, it avoids the IPv6 header overhead. We see that in AI data 
> center, the supernode size ranges from 8 to 1K. I don't know how large the 
> size can become in the future. A fine granularity can minimize the waste yet 
> be ready to scale to any size.
>
>
>
> Haoyu,
>
>
>
> As I said, you get diminishing returns in finer granularity. For instance, if 
> you halve sixteen bits addresses then the savings is on two bytes per packet. 
> That's not even 1% of a 256 byte packet. I think it's going to be hard to 
> justify those miniscule savings against the complexity of supporting variable 
> length addresses and headers
>
>
>
> [HS] I think we can assume the minimum packet size is 64 byte. And I’m not 
> arguing that we must have very fine length granularity. That can be 
> determined as a tradeoff. The reason to support variable length address is to 
> support inter-cluster communication while maintaining the header efficiency.
>
>
>
>
>
>
>
> As for the granularity, my strong preference is to first keep addresses in 
> units of eight bits. It's unpleasant for a lot of processors to deal with 
> anything smaller and is a long held convention in IP addresses, port numbers, 
> and Ethernet addresses. I'd also prefer maintaining four byte alignment of 
> the transport layer like IPv4 and IPv6, but I suppose alignment is mostly 
> historical at this time so maybe it's not super critical.
>
> With all this in mind, if there were to be another address size in addition 
> to 16 bits, I might opt for 24 bits. It breaks four byte alignment, but has 
> the nice property that it maps 10/8 IPv4 addresses.
> SUNH with 24 bit addresses is 10 bytes, compared to 20 bytes IPv4 header. I 
> suppose that might be worth it.
>
> >
> > 2. If we assume the scale up network would take ethernet as the L2 
> > technology, it can be envisioned that the scale up and scale out network 
> > would eventually converge into a single network. Then we would consider 
> > that the L3 should also have a common standard (strictly speaking, if we 
> > only have a separate scale up network, we don't need L3 at all, because an 
> > L2 fabirc is enough).
>
> Strictly speaking, yes, But people also want network layer functionality like 
> TOS and Hop Limits so L3 enters the picture and we see people go down the 
> path to reinventing L3 like AHF does.
>
> [HS] Up to now most scale up networks are like full mesh point to point 
> fabric. If L3 are needed, and Ethernet is used, I think it makes sense to 
> converge the scale out and scale up network into one. Then we may want to see 
> any node IPv6 reachable but within the data center, we don't want the IPv6 
> overhead. The hierarchical address provides a simple solution.
>
> > Thus, the variable size address can support a hierarchical network 
> > naturally mapping to the DCN topology and more important, it allows the 
> > seamlessly connecting with the Internet which runs IPv4/IPv6 so the 
> > inter-DC communication can be supported without any modification to the 
> > public network. I think this is a reason we need an IP-like L3 header which 
> > can translate into IPv4/v6. Note the SUNH proposal support this already, 
> > the only issue is the 16-bit address is an overkill to the current cluster 
> > size, and the fixed length is not flexible.
>
> Like I said, adding different address sizes could just be a matter of getting 
> different EtherTypes for SUNH. But, I would only want to add support for more 
> address sizes sparingly.
>
> [HS] Using the EtherType based solution, you will have a fixed header which 
> can only be used for intra-cluster communication. Using hierarchical 
> addresses, a node A in cluster X can uses the same protocol header to 
> communicate with a node B in cluster Y. In this case, the source address and 
> the destination address are different in length because the destination 
> address needs to include B's cluster prefix.
>
>
>
> The different EtherTypes allow for different sized addresses. I can imagine 
> at most we ever need four sizes: 1, 2, 3, or 4 byte addresses. Anything 
> bigger just use IPv6, any odd number of bits just round up to the nearest 
> byte size. If nodes in two clusters want to talk then a gateway can map 
> addresses from one cluster to another, which is what anyone would need to do 
> when connecting domains with different address spaces.
>
>
>
> [HS] hmmm…how is that achieved? Assume we have two clusters and each cluster 
> uses 1 byte address (i.e., each cluster can have up to 256 nodes). Now node 0 
> in cluster A wants to send a packet to node 0 in cluster B. How does the 
> packet header look like in this case?

Just use 16 bit addresses. Cluster A's addresses are 0x0-0xff and
cluster B's addresses are 0x100-0x1ff. Packet header is just a SUNH
header.

Tom

>
>
>
> Tom
>
>
>
>
>
> Tom
> >
> >
> > Best regards,
> > Haoyu
> >
> > -----Original Message-----
> > From: dave seddon <[email protected]>
> > Sent: Friday, January 9, 2026 2:03 PM
> > To: [email protected]
> > Subject: [Int-area] Regarding the draft: Scale-Up Network Header
> > (SUNH)
> >
> > G'day Tom and Haoyu,
> >
> > I'm trying to join the discussion about "draft: Scale-Up Network
> > Header (SUNH)", but I just joined the mail list, so I don't know if
> > posting to the subject line will do it.  ( Apologies if this breaks
> > threading )
> >
> > Drafts:
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdata
> > tracker.ietf.org%2Fdoc%2Fdraft-herbert-sunh%2F&data=05%7C02%7Chaoyu.so
> > ng%40futurewei.com%7C761f7d9f68e44860485008de50609d54%7C0fee8ff2a3b240
> > 189c753a1d5591fedc%7C1%7C1%7C639036573084186972%7CUnknown%7CTWFpbGZsb3
> > d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoi
> > TWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=%2BNQeccfibgnpwtlX0mjTdVFp
> > ILI7xZlFP6Qh6KTuNHE%3D&reserved=0
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdata
> > tracker.ietf.org%2Fdoc%2Fhtml%2Fdraft-song-ship-edge-05&data=05%7C02%7
> > Chaoyu.song%40futurewei.com%7C761f7d9f68e44860485008de50609d54%7C0fee8
> > ff2a3b240189c753a1d5591fedc%7C1%7C1%7C639036573084212755%7CUnknown%7CT
> > WFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiI
> > sIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=O6TigIDyIICJv%2F%
> > 2Fabg49jSaZlz7%2B1aKKYyVc3elDI5U%3D&reserved=0
> >
> > It seems like the discussion centers on the address length.
> >
> > The SUNH "1.1.  Problem statement" is very clear "
> > 8% overhead in a 256 byte packet, and the forty bytes of IPv6 header would 
> > be about 16% overhead "
> >
> > Absolutely minimizing overhead makes sense currently, but for how long do 
> > we expect this to be true?  Tom, since you've been talking to people who 
> > run the largest AI clusters in the world, you expect this to hold true for 
> > the foreseeable future.
> >
> >
> > Tom - I wonder if draft-herbert-sunh would benefit from a small summary, 
> > maybe with a table, that compares the proposed addressing to other 
> > protocols that are common within data centers?
> >
> > For example, comparing protocols by their header, address lengths, and 
> > "overhead"
> > - PCIe ( IEEE have paywalls, so it's hard to find a good source.
> > Maybe this:
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.
> > pearsonhighered.com%2Fassets%2Fsamplechapter%2F0%2F3%2F2%2F1%2F0321156
> > 307.pdf&data=05%7C02%7Chaoyu.song%40futurewei.com%7C761f7d9f68e4486048
> > 5008de50609d54%7C0fee8ff2a3b240189c753a1d5591fedc%7C1%7C1%7C6390365730
> > 84230845%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuM
> > DAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&s
> > data=N5UpkmqCvA8hpe7ou2p%2B4cTeV6SZhS5C%2B6ZJZiWyuHQ%3D&reserved=0
> > )
> > - Infiniband ( addressing scheme found here on page 625
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhjem
> > mesider.diku.dk%2F~vinter%2FCC%2FInfinibandchap42.pdf&data=05%7C02%7Ch
> > aoyu.song%40futurewei.com%7C761f7d9f68e44860485008de50609d54%7C0fee8ff
> > 2a3b240189c753a1d5591fedc%7C1%7C1%7C639036573084453477%7CUnknown%7CTWF
> > pbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsI
> > kFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=0qEGPX7uWcVMChB7YuR
> > 4WdGX1Cdxea9BCUCqfArnpJA%3D&reserved=0 )
> > - Ethernet
> > - Ethernet with 802.1q ( and qnq )
> > - IPv4
> > - IPv6
> > - SUNH
> > ...
> >
> > Now that the context is established, explain why 16 bits were chosen for 
> > the source/destination address.  I guess, but it's not in the document; You 
> > were considering the number of hosts in the domain.
> >
> > Nit pick (sorry). "care must be taken to ensure the minimum packet size is 
> > maintained".  Might help to explain why.
> >
> > Re section "TCP and UDP in SUNH".  I remember recently Stuart from Apple 
> > saying something pretty interesting about UDP: "If IP had port numbers, you 
> > wouldn't really need a UDP header at all."
> >
> > Multicast?  It might be worth mentioning multicast and explaining why it 
> > isn't discussed.  e.g. No requirement for this, or it might be considered 
> > in the future if a need arises.
> >
> >
> >
> > Haoyu - I really like your draft-song-ship-edge-05 Hierarchical addressing 
> > stuff:
> > a)
> > This reminds me of good old fiber channel addressing, and I suppose the 
> > more modern Infiniband/RDMA.
> > b)
> > The words "variable length" are scary because variability clearly isn't 
> > ideal for hardware.  I guess when you say "variable length" you don't 
> > actually mean the addresses would vary dynamically, but that there could be 
> > a range of set fixed length addressing that could be selected for different 
> > deployment scenarios?
> > c)
> > One core concept of draft-song-ship-edge-05, is that traffic destined for 
> > IoT devices needs a long, unique address, while the traffic _sourced_ from 
> > these devices towards the data center can have a much smaller destination 
> > address.
> > I recall Geoff Huston discussing IPv6 at a recent NANGO, where he commented 
> > that because of the pervasive use of anycast by a relatively small number 
> > of CDNs, that the Internet might only need a /24 worth of addresses for 99% 
> > of all traffic.
> > Other network protocols with asymmetric addresses include:
> > - PCIe (Requester vs Completer addressing)
> > - In InfiniBand / RDMA, requests carry full destination addressing (QPN + 
> > LID/GID + path), while responses omit it and are routed implicitly using 
> > the established queue-pair and path state, making the addressing 
> > directionally asymmetric.
> > - QUIC has explicit directional asymmetry in connection IDs
> >
> >
> > --
> > Regards,
> > Dave Seddon
> >
> > _______________________________________________
> > Int-area mailing list -- [email protected] To unsubscribe send an
> > email to [email protected]
> >
> > _______________________________________________
> > Int-area mailing list -- [email protected] To unsubscribe send an
> > email to [email protected]

_______________________________________________
Int-area mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to