[bess] Re: Some comments on carrying bandwidth in BGP, and also on draft-xu-idr-fare-04 (was Re: [Idr] Working Group Last Call on draft-ietf-bess-ebgp-dmz)

Tiger Xu Tue, 19 May 2026 20:16:54 -0700

Hi Jeff,

Sorry for my late response in line with [Tiger]

发件人: Jeffrey Haas <[email protected]>
日期: 星期三, 2026年5月6日 22:29
收件人: Tiger Xu <[email protected]>
抄送: BESS <[email protected]>, idr@ietf. org <[email protected]>
主题: Some comments on carrying bandwidth in BGP, and also on
draft-xu-idr-fare-04 (was Re: [Idr] Working Group Last Call on
draft-ietf-bess-ebgp-dmz)

[Speaking as an individual contributor in this response]

> On May 6, 2026, at 03:17, Tiger Xu <[email protected]> wrote:
> In essence, this changes the extended community from non‑transitive to
> transitive and introduces the concept of bandwidth aggregation – both of
> which were already present in draft-xu-idr-fare version -00.

First, a few words on the general use cases of carrying bandwidth/capacity in
BGP routes:

The link-bandwidth feature, and its varying uses over the years, and the
varying transitivities used for it[1] have been a long mess. At a fundamental
level, the feature of "we've sent a value, apply a multipath ratio across all
paths for that destination based on the received values" has been broadly
consistent across the implementations.

As the use cases started to split underlay vs. overlay topology and how
multipath was handled at each layer and its interaction load balancing became
messier. One could observe that there are benefits for splitting the feature
carrying the signaling for the bandwidth/capacity based on the role the routes
are intended to serve, and also where they are applied. The fact that the
"role" of a given route is generally clear in most BGP contexts where BGP is
carrying the underlay routing has made it less of a deployment problem to use
the same signaling mechanism for both underlay and overlay. However, it does
mean that in places where "math" on those values has been necessary that having
an overloaded signaling mechanism complicates implementation and operational
logic.

The ready example covered in many places is that having hop-by-hop underlay
bandwidth capacity is great for load balancing across nexthops. However, when
it comes time to consider multipath load balancing for individual
overlay/service routes passing over links of disparate capacities, there tends
to be a need to apply math based on the desired network-wide load balancing.
Is it that you want to have a receiver acquire the minimal functional bandwidth
that path can use? Or, is it a ratio for traffic to be broadly load balanced
behind a set of paths? And certainly there are more use cases.

The various use cases have been broadly solved on a single signaling mechanism
and - frustrating to some - by operational paradigm and discipline. Simply
having more than one signaling mechanism would offer some flexibility to
operators and implementors. This has been mentioned in multiple contexts over
the years.

[Tiger] I fully agree with your opinion.

There has also been appropriate criticism of the link-bw encoding. The choice
of IEEE 754 32-bit floating point numbers provided a useful way to carry big
numbers across BGP in an existing encoding - extended communities. However,
the poor granularity of that type for the numbers we use these days in networks
leads to mostly operational issues. For example, you can configure one number
and the closest rounded number is what is encoded on the wire. Similarly, how
do do policy on numbers where rounding may be in place? And finally, such
numbers don't encode or interact nicely with YANG. There have been some
proposals to simply change the encoding to get us out of this particular bit of
unpleasantness.

I think there is room for further work to provide for a less insane encoding.
However, that will also lead us to figuring out how a new such mechanism
(possibly a new community) interops with the existing stuff. Since most of the
use cases for link-bw are satisfied with being a ratio rather than carrying
precise numbers, the pressure to address the deficiencies above hasn't been
high. However, once there's a desire for more precise capacity encoding, we'll
likely see the appropriate mechanisms being proposed for those use cases - and
those use cases may overlap with the existing ones.

I think there's more room for work to provide cleaner separation of overlay and
underlay use cases. In this respect, I'm supportive of continuing discussion
on the work you've begun with FARE. But like the other comments above, much of
that discussion will be whether a separate signaling mechanism makes our lives
easier at the implementation and at the operations level. I look forward to
that discussion.

[Tiger] Although the current version of the link-bandwidth draft has eliminated
several limitations associated with the link-bandwidth extended community as
mentioned in Section 1.1 of the FARE draft, the requirement on the use of both
transitive and non-transitive link-bandwidth extended communities is not
suitable for the FARE, especially in a 5-stage CLOS environment (see Section
4.2 of the FARE draft for more details).

"Generally, a single Link Bandwidth Extended Community of the transitivity type
desired in a deployment is attached to a route. However during transition
(refer Section
7<https://datatracker.ietf.org/doc/html/draft-ietf-idr-link-bandwidth-22#Operational_Condiderations>
for details), a BGP speaker MAY attach one Link Bandwidth Extended Community
per transitivity (transitive/non-transitive); the bandwidth value field in both
communities SHOULD be the same.”

[Tiger] In a word, the use of both transitive and non-transitive types for the
link-bandwidth attributes in the link-bandwidth draft is to enable
interoperation between old and new implementations, whereas the use of both
transitive and non-transitive types for the path-bandwidth attribute is to
distinguish two different kinds of path bandwidth values (see Section 4.2 of
the FARE draft for more details). Defining two distinct bandwidth-specific
attributes—one for DMZ external link bandwidth and another for path
bandwidth—would simplify matters, unless the availability of sub-type values is
extremely scarce, IMHO.

A few terse technical comments on the draft itself:

Section 3: Your requested encoding is impossible in RFC 4360 extended
communities. You have six octets to work with. You both global and local-admin
fields that require 4-octets each.

[Tiger] In our current implementation, the 2-byte local-admin field of the
IPv4-address-specific extended community is filled with the path bandwidth
value in units of GB/s, using the IEEE 16-bit half-precision floating-point
format. However, your above comment makes me rethink the possibility of using
the ASN-specific extended community, which has a 4-byte field to convey the
path bandwidth value in the IEEE 32-bit single-precision floating-point format.

Security/Operational considerations: Your desire in this draft is to use
transitive extended communities. Unlike the hop-by-hop (re-)generated
non-transitive extended communities used by DMZ, you have attribute escape
issues to address:
- If a given node doesn't "do math" on the community because it doesn't
understand it, how does that impact the use case?
- You need to protect the deployment against receiving such communities from
outside the deployment.
- You need to discuss how you remove the communities when the routes are being
sent outside the deployment.

[Tiger] The path bandwidth attribute is targeted for AI back-end data center
network scenarios, and therefore there is no such risk. Anyway, I will add some
text to explain this consideration.

Best regards,
Tiger

Some of these considerations are already addressed as part of the link-bw
document.

-- Jeff

[1] Juniper issued the first version as non-transitive, and then immediately
started shipping code where it was transitive while squatting on the transitive
code point - sloppiness from my forebears that has made for unfortunate cleanup
work in IETF along with interop issues.

_______________________________________________
BESS mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[bess] Re: Some comments on carrying bandwidth in BGP, and also on draft-xu-idr-fare-04 (was Re: [Idr] Working Group Last Call on draft-ietf-bess-ebgp-dmz)

Reply via email to