Hi Jeff,

Sorry for my late response in line with [Tiger]

发件人: Jeffrey Haas <[email protected]>
日期: 星期三, 2026年5月6日 22:29
收件人: Tiger Xu <[email protected]>
抄送: BESS <[email protected]>, idr@ietf. org <[email protected]>
主题: Some comments on carrying bandwidth in BGP, and also on 
draft-xu-idr-fare-04 (was Re: [Idr] Working Group Last Call on 
draft-ietf-bess-ebgp-dmz)

[Speaking as an individual contributor in this response]

> On May 6, 2026, at 03:17, Tiger Xu <[email protected]> wrote:
> In essence, this changes the extended community from non‑transitive to 
> transitive and introduces the concept of bandwidth aggregation – both of 
> which were already present in draft-xu-idr-fare version -00.


First, a few words on the general use cases of carrying bandwidth/capacity in 
BGP routes:

The link-bandwidth feature, and its varying uses over the years, and the 
varying transitivities used for it[1] have been a long mess.  At a fundamental 
level, the feature of "we've sent a value, apply a multipath ratio across all 
paths for that destination based on the received values" has been broadly 
consistent across the implementations.

As the use cases started to split underlay vs. overlay topology and how 
multipath was handled at each layer and its interaction load balancing became 
messier.  One could observe that there are benefits for splitting the feature 
carrying the signaling for the bandwidth/capacity based on the role the routes 
are intended to serve, and also where they are applied.  The fact that the 
"role" of a given route is generally clear in most BGP contexts where BGP is 
carrying the underlay routing has made it less of a deployment problem to use 
the same signaling mechanism for both underlay and overlay.  However, it does 
mean that in places where "math" on those values has been necessary that having 
an overloaded signaling mechanism complicates implementation and operational 
logic.

The ready example covered in many places is that having hop-by-hop underlay 
bandwidth capacity is great for load balancing across nexthops.  However, when 
it comes time to consider multipath load balancing for individual 
overlay/service routes passing over links of disparate capacities, there tends 
to be a need to apply math based on the desired network-wide load balancing.  
Is it that you want to have a receiver acquire the minimal functional bandwidth 
that path can use?  Or, is it a ratio for traffic to be broadly load balanced 
behind a set of paths?  And certainly there are more use cases.

The various use cases have been broadly solved on a single signaling mechanism 
and - frustrating to some - by operational paradigm and discipline.  Simply 
having more than one signaling mechanism would offer some flexibility to 
operators and implementors.  This has been mentioned in multiple contexts over 
the years.

[Tiger] I fully agree with your opinion.

There has also been appropriate criticism of the link-bw encoding.  The choice 
of IEEE 754 32-bit floating point numbers provided a useful way to carry big 
numbers across BGP in an existing encoding - extended communities.  However, 
the poor granularity of that type for the numbers we use these days in networks 
leads to mostly operational issues.  For example, you can configure one number 
and the closest rounded number is what is encoded on the wire.  Similarly, how 
do do policy on numbers where rounding may be in place?  And finally, such 
numbers don't encode or interact nicely with YANG.  There have been some 
proposals to simply change the encoding to get us out of this particular bit of 
unpleasantness.

I think there is room for further work to provide for a less insane encoding.  
However, that will also lead us to figuring out how a new such mechanism 
(possibly a new community) interops with the existing stuff.  Since most of the 
use cases for link-bw are satisfied with being a ratio rather than carrying 
precise numbers, the pressure to address the deficiencies above hasn't been 
high.  However, once there's a desire for more precise capacity encoding, we'll 
likely see the appropriate mechanisms being proposed for those use cases - and 
those use cases may overlap with the existing ones.

I think there's more room for work to provide cleaner separation of overlay and 
underlay use cases.  In this respect, I'm supportive of continuing discussion 
on the work you've begun with FARE.  But like the other comments above, much of 
that discussion will be whether a separate signaling mechanism makes our lives 
easier at the implementation and at the operations level.  I look forward to 
that discussion.

[Tiger] Although the current version of the link-bandwidth draft has eliminated 
several limitations associated with the link-bandwidth extended community as 
mentioned in Section 1.1 of the FARE draft, the requirement on the use of both 
transitive and non-transitive link-bandwidth extended communities is not 
suitable for the FARE, especially in a 5-stage CLOS environment (see Section 
4.2 of the FARE draft for more details).

"Generally, a single Link Bandwidth Extended Community of the transitivity type 
desired in a deployment is attached to a route. However during transition 
(refer Section 
7<https://datatracker.ietf.org/doc/html/draft-ietf-idr-link-bandwidth-22#Operational_Condiderations>
 for details), a BGP speaker MAY attach one Link Bandwidth Extended Community 
per transitivity (transitive/non-transitive); the bandwidth value field in both 
communities SHOULD be the same.”

[Tiger]  In a word, the use of both transitive and non-transitive types for the 
link-bandwidth attributes in the link-bandwidth draft is to enable 
interoperation between old and new implementations, whereas the use of both 
transitive and non-transitive types for the path-bandwidth attribute is to 
distinguish two different kinds of path bandwidth values (see Section 4.2 of 
the FARE draft for more details). Defining two distinct bandwidth-specific 
attributes—one for DMZ external link bandwidth and another for path 
bandwidth—would simplify matters, unless the availability of sub-type values is 
extremely scarce, IMHO.

A few terse technical comments on the draft itself:

Section 3: Your requested encoding is impossible in RFC 4360 extended 
communities.  You have six octets to work with. You both global and local-admin 
fields that require 4-octets each.

[Tiger] In our current implementation, the 2-byte local-admin field of the 
IPv4-address-specific extended community is filled with the path bandwidth 
value in units of GB/s, using the IEEE 16-bit half-precision floating-point 
format. However, your above comment makes me rethink the possibility of using 
the ASN-specific extended community, which has a 4-byte field to convey the 
path bandwidth value in the IEEE 32-bit single-precision floating-point format.

Security/Operational considerations: Your desire in this draft is to use 
transitive extended communities.  Unlike the hop-by-hop (re-)generated 
non-transitive extended communities used by DMZ, you have attribute escape 
issues to address:
- If a given node doesn't "do math" on the community because it doesn't 
understand it, how does that impact the use case?
- You need to protect the deployment against receiving such communities from 
outside the deployment.
- You need to discuss how you remove the communities when the routes are being 
sent outside the deployment.

[Tiger] The path bandwidth attribute is targeted for AI back-end data center 
network scenarios, and therefore there is no such risk. Anyway, I will add some 
text to explain this consideration.

Best regards,
Tiger

Some of these considerations are already addressed as part of the link-bw 
document.


-- Jeff

[1] Juniper issued the first version as non-transitive, and then immediately 
started shipping code where it was transitive while squatting on the transitive 
code point - sloppiness from my forebears that has made for unfortunate cleanup 
work in IETF along with interop issues.
_______________________________________________
BESS mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to