On Mon, Jan 12, 2026, 3:47 PM Haoyu Song <[email protected]> wrote:

> Hi Tom,
>
> Please see my response inline.
>
> Best regards,
> Haoyu
>
> -----Original Message-----
> From: Tom Herbert <[email protected]>
> Sent: Saturday, January 10, 2026 7:55 AM
> To: Haoyu Song <[email protected]>
> Cc: dave seddon <[email protected]>; [email protected]
> Subject: Re: [Int-area] Re: Regarding the draft: Scale-Up Network Header
> (SUNH)
>
> On Fri, Jan 9, 2026 at 4:15 PM Haoyu Song <[email protected]>
> wrote:
> >
> > Hi Dave,
> >
> > Thank you for the comments. We are on the same page that a compact
> header with just enough address bits is critical in AI DCN (I would argue
> this also applies to the scale-out networks).
> >
> > I want to further discuss two points:
>
> Hi Haoyu, thanks for the discussion!
>
> >
> > 1. The variable size address isn't that "scary" actually. We have
> verified the scheme with P4 and it's doable. Once it's realized in switch
> ASIC, there's no performance implications at all.
>
> The size of addresses in lookups will have performance implications and
> cost effects as well. For instance, with 16 bit addresses a switch could do
> route lookup with a simple array lookup in SRAM, for 32 bit addresses we
> need a CAM or TCAM, for 128 bit addresses we need a CAM or TCAM 4x the size
> of the one for IPv4.
>
> [HS] The hierarchical addressing scheme never lookups full addresses. At
> each level, it only searches the prefix assigned to the level. For example,
> each cluster has 1K nodes and we have 1K clusters in a DC. The lowest level
> node in a cluster has a 10bit address and 10bit prefix. In a cluster, the
> nodes only uses 10bit addresses. If it needs to talk to another node in
> another cluster, its address needs to be augmented with the 10bit prefix
> (but the prefix is only stored in the gateway switch, which is oblivious to
> the nodes in a cluster). Finally, the data center gateway switch holds a
> 108bit prefixes, which can be used to augment all the addresses in the data
> center to 128bit IPv6 addresses. A small TCAM or a small direct index table
> for lookups is enough at each level. (details can be found at
> https://www.researchgate.net/profile/Haoyu-Song/publication/347085487_Adaptive_Addresses_for_Next_Generation_IP_Protocol_in_Hierarchical_Networks/links/6070858da6fdcc5f77948ec2/Adaptive-Addresses-for-Next-Generation-IP-Protocol-in-Hierarchical-Networks.pdf?origin=publication_detail&_tp=eyJjb250ZXh0Ijp7ImZpcnN0UGFnZSI6InByb2ZpbGUiLCJwYWdlIjoicHVibGljYXRpb25Eb3dubG9hZCIsInByZXZpb3VzUGFnZSI6InB1YmxpY2F0aW9uIn19&__cf_chl_tk=SIPGEnoIS.7WtTH63F1auQRyVbZzJP7AprlQYXN7wGE-1768260809-1.0.1.1-lNehHWVj1D3bK0gRlaT4qD4Bw5rgC8_sVglRL1DD73A
> )
>
>
> > On the other hand, supporting different lengths have many advantages: it
> can scale with the cluster size without any waste,  it supports
> communication between clusters with different sizes, it doesn't  need to
> respin the chips in case the network scale changes, and the same standard
> would be applied to any scenarios as laid out in our paper "Adaptive
> Addresses for Next Generation IP Protocol in Hierarchical
> Networks"(ICNP2020). Of course, there's a tradeoff on how fine the address
> length step should be supported (e.g., 1 bit, 2 bits, 4 bits, or 8 bits).
> This is subject to further study.
>
> Waste is relative and we get diminishing returns in both directions.
> For instance, if we halve an IPv6 address then we save sixteen bytes per
> packet. That's significant. But if we halve the sixteen bit addresses of
> SUNH we'd save a whole two bytes per packet and that's nothing to write
> home about. On the other hand, suppose we double the sixteen bit addresses
> of SUNH then we have addresses the same size as
> IPv4 addresses. Grant it, IPv4 header has some other stuff in the header,
> but at some point it starts to be a question of why not just use IPv4?
>
> [HS] The benefits of the hierarchical addressing are two folds:
> flexibility and the compatibility to IPv6. We make each node IPv6
> accessible but internally, it avoids the IPv6 header overhead. We see that
> in AI data center, the supernode size ranges from 8 to 1K. I don't know how
> large the size can become in the future. A fine granularity can minimize
> the waste yet be ready to scale to any size.


Haoyu,

As I said, you get diminishing returns in finer granularity. For instance,
if you halve sixteen bits addresses then the savings is on two bytes per
packet. That's not even 1% of a 256 byte packet. I think it's going to be
hard to justify those miniscule savings against the complexity of
supporting variable length addresses and headers

>
>
> As for the granularity, my strong preference is to first keep addresses in
> units of eight bits. It's unpleasant for a lot of processors to deal with
> anything smaller and is a long held convention in IP addresses, port
> numbers, and Ethernet addresses. I'd also prefer maintaining four byte
> alignment of the transport layer like IPv4 and IPv6, but I suppose
> alignment is mostly historical at this time so maybe it's not super
> critical.
>
> With all this in mind, if there were to be another address size in
> addition to 16 bits, I might opt for 24 bits. It breaks four byte
> alignment, but has the nice property that it maps 10/8 IPv4 addresses.
> SUNH with 24 bit addresses is 10 bytes, compared to 20 bytes IPv4 header.
> I suppose that might be worth it.
>
> >
> > 2. If we assume the scale up network would take ethernet as the L2
> technology, it can be envisioned that the scale up and scale out network
> would eventually converge into a single network. Then we would consider
> that the L3 should also have a common standard (strictly speaking, if we
> only have a separate scale up network, we don't need L3 at all, because an
> L2 fabirc is enough).
>
> Strictly speaking, yes, But people also want network layer functionality
> like TOS and Hop Limits so L3 enters the picture and we see people go down
> the path to reinventing L3 like AHF does.
>
> [HS] Up to now most scale up networks are like full mesh point to point
> fabric. If L3 are needed, and Ethernet is used, I think it makes sense to
> converge the scale out and scale up network into one. Then we may want to
> see any node IPv6 reachable but within the data center, we don't want the
> IPv6 overhead. The hierarchical address provides a simple solution.
>
> > Thus, the variable size address can support a hierarchical network
> naturally mapping to the DCN topology and more important, it allows the
> seamlessly connecting with the Internet which runs IPv4/IPv6 so the
> inter-DC communication can be supported without any modification to the
> public network. I think this is a reason we need an IP-like L3 header which
> can translate into IPv4/v6. Note the SUNH proposal support this already,
> the only issue is the 16-bit address is an overkill to the current cluster
> size, and the fixed length is not flexible.
>
> Like I said, adding different address sizes could just be a matter of
> getting different EtherTypes for SUNH. But, I would only want to add
> support for more address sizes sparingly.
>
> [HS] Using the EtherType based solution, you will have a fixed header
> which can only be used for intra-cluster communication. Using hierarchical
> addresses, a node A in cluster X can uses the same protocol header to
> communicate with a node B in cluster Y. In this case, the source address
> and the destination address are different in length because the destination
> address needs to include B's cluster prefix.


The different EtherTypes allow for different sized addresses. I can imagine
at most we ever need four sizes: 1, 2, 3, or 4 byte addresses. Anything
bigger just use IPv6, any odd number of bits just round up to the nearest
byte size. If nodes in two clusters want to talk then a gateway can map
addresses from one cluster to another, which is what anyone would need to
do when connecting domains with different address spaces.

Tom


>
> Tom
> >
> >
> > Best regards,
> > Haoyu
> >
> > -----Original Message-----
> > From: dave seddon <[email protected]>
> > Sent: Friday, January 9, 2026 2:03 PM
> > To: [email protected]
> > Subject: [Int-area] Regarding the draft: Scale-Up Network Header
> > (SUNH)
> >
> > G'day Tom and Haoyu,
> >
> > I'm trying to join the discussion about "draft: Scale-Up Network
> > Header (SUNH)", but I just joined the mail list, so I don't know if
> > posting to the subject line will do it.  ( Apologies if this breaks
> > threading )
> >
> > Drafts:
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdata
> > tracker.ietf.org%2Fdoc%2Fdraft-herbert-sunh%2F&data=05%7C02%7Chaoyu.so
> > ng%40futurewei.com%7C761f7d9f68e44860485008de50609d54%7C0fee8ff2a3b240
> > 189c753a1d5591fedc%7C1%7C1%7C639036573084186972%7CUnknown%7CTWFpbGZsb3
> > d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoi
> > TWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=%2BNQeccfibgnpwtlX0mjTdVFp
> > ILI7xZlFP6Qh6KTuNHE%3D&reserved=0
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdata
> > tracker.ietf.org%2Fdoc%2Fhtml%2Fdraft-song-ship-edge-05&data=05%7C02%7
> > Chaoyu.song%40futurewei.com%7C761f7d9f68e44860485008de50609d54%7C0fee8
> > ff2a3b240189c753a1d5591fedc%7C1%7C1%7C639036573084212755%7CUnknown%7CT
> > WFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiI
> > sIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=O6TigIDyIICJv%2F%
> > 2Fabg49jSaZlz7%2B1aKKYyVc3elDI5U%3D&reserved=0
> >
> > It seems like the discussion centers on the address length.
> >
> > The SUNH "1.1.  Problem statement" is very clear "
> > 8% overhead in a 256 byte packet, and the forty bytes of IPv6 header
> would be about 16% overhead "
> >
> > Absolutely minimizing overhead makes sense currently, but for how long
> do we expect this to be true?  Tom, since you've been talking to people who
> run the largest AI clusters in the world, you expect this to hold true for
> the foreseeable future.
> >
> >
> > Tom - I wonder if draft-herbert-sunh would benefit from a small summary,
> maybe with a table, that compares the proposed addressing to other
> protocols that are common within data centers?
> >
> > For example, comparing protocols by their header, address lengths, and
> "overhead"
> > - PCIe ( IEEE have paywalls, so it's hard to find a good source.
> > Maybe this:
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.
> > pearsonhighered.com%2Fassets%2Fsamplechapter%2F0%2F3%2F2%2F1%2F0321156
> > 307.pdf&data=05%7C02%7Chaoyu.song%40futurewei.com%7C761f7d9f68e4486048
> > 5008de50609d54%7C0fee8ff2a3b240189c753a1d5591fedc%7C1%7C1%7C6390365730
> > 84230845%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuM
> > DAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&s
> > data=N5UpkmqCvA8hpe7ou2p%2B4cTeV6SZhS5C%2B6ZJZiWyuHQ%3D&reserved=0
> > )
> > - Infiniband ( addressing scheme found here on page 625
> > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhjem
> > mesider.diku.dk%2F~vinter%2FCC%2FInfinibandchap42.pdf&data=05%7C02%7Ch
> > aoyu.song%40futurewei.com%7C761f7d9f68e44860485008de50609d54%7C0fee8ff
> > 2a3b240189c753a1d5591fedc%7C1%7C1%7C639036573084453477%7CUnknown%7CTWF
> > pbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsI
> > kFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=0qEGPX7uWcVMChB7YuR
> > 4WdGX1Cdxea9BCUCqfArnpJA%3D&reserved=0 )
> > - Ethernet
> > - Ethernet with 802.1q ( and qnq )
> > - IPv4
> > - IPv6
> > - SUNH
> > ...
> >
> > Now that the context is established, explain why 16 bits were chosen for
> the source/destination address.  I guess, but it's not in the document; You
> were considering the number of hosts in the domain.
> >
> > Nit pick (sorry). "care must be taken to ensure the minimum packet size
> is maintained".  Might help to explain why.
> >
> > Re section "TCP and UDP in SUNH".  I remember recently Stuart from Apple
> saying something pretty interesting about UDP: "If IP had port numbers, you
> wouldn't really need a UDP header at all."
> >
> > Multicast?  It might be worth mentioning multicast and explaining why it
> isn't discussed.  e.g. No requirement for this, or it might be considered
> in the future if a need arises.
> >
> >
> >
> > Haoyu - I really like your draft-song-ship-edge-05 Hierarchical
> addressing stuff:
> > a)
> > This reminds me of good old fiber channel addressing, and I suppose the
> more modern Infiniband/RDMA.
> > b)
> > The words "variable length" are scary because variability clearly isn't
> ideal for hardware.  I guess when you say "variable length" you don't
> actually mean the addresses would vary dynamically, but that there could be
> a range of set fixed length addressing that could be selected for different
> deployment scenarios?
> > c)
> > One core concept of draft-song-ship-edge-05, is that traffic destined
> for IoT devices needs a long, unique address, while the traffic _sourced_
> from these devices towards the data center can have a much smaller
> destination address.
> > I recall Geoff Huston discussing IPv6 at a recent NANGO, where he
> commented that because of the pervasive use of anycast by a relatively
> small number of CDNs, that the Internet might only need a /24 worth of
> addresses for 99% of all traffic.
> > Other network protocols with asymmetric addresses include:
> > - PCIe (Requester vs Completer addressing)
> > - In InfiniBand / RDMA, requests carry full destination addressing (QPN
> + LID/GID + path), while responses omit it and are routed implicitly using
> the established queue-pair and path state, making the addressing
> directionally asymmetric.
> > - QUIC has explicit directional asymmetry in connection IDs
> >
> >
> > --
> > Regards,
> > Dave Seddon
> >
> > _______________________________________________
> > Int-area mailing list -- [email protected] To unsubscribe send an
> > email to [email protected]
> >
> > _______________________________________________
> > Int-area mailing list -- [email protected] To unsubscribe send an
> > email to [email protected]
>
_______________________________________________
Int-area mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to