Dear authors:
Thank you for documenting this important feature.
After reading the document, I think that it still needs work:
(1) The terminology used is not aligned with rfc4271. Of major importance is
the description of the Routing Table, the Decision Process, the use of best
routes (not paths!), etc. I pointed out multiple occurrences below, buy I
need you to check the whole docuent for consistency.
(2) The choice to describe the mechanism using a complex scenario (VPNs, etc.)
has resulted in the text being more complex than it needs to be.
It is probably too late for this, but a better approach (especially because
this is an Informational document) would have been to describe the
functionality using a simple IP-only network. A more elaborate example
could have then been included in an Appendix.
(3) The text (mainly in §5) related to hardware limitations is interesting but
also highly speculative. Given that it is not part of the main mechanism,
I would suggest putting this information in an appendix.
I put detailed comments below.
To the Chairs: I couldn't find a request to the idr WG for review.
Did I miss it? Given the subject, we need to give idr the opportunity
to comment.
Thanks!
Alvaro.
[Line numbers from idnits.]
...
12 Abstract
14 In a network comprising thousands of BGP peers exchanging millions of
15 routes, many routes are reachable via more than one next-hop. Given
16 the large scaling targets, it is desirable to restore traffic after
17 failure in a time period that does not depend on the number of BGP
18 prefixes. In this document we proposed an architecture by which
19 traffic can be re-routed to ECMP or pre-calculated backup paths in a
20 timeframe that does not depend on the number of BGP prefixes. The
21 objective is achieved through organizing the forwarding data
22 structures in a hierarchical manner and sharing forwarding elements
23 among the maximum possible number of routes. The proposed technique
24 achieves prefix independent convergence while ensuring incremental
25 deployment, complete automation, and zero management and provisioning
26 effort. It is noteworthy to mention that the benefits of BGP Prefix
27 Independent Convergence (BGP-PIC) are hinged on the existence of more
28 than one path whether as ECMP or primary-backup.
[style nit] "we proposed"
Please don't write using the first person ("we propose"), use the
third person instead ("this document proposes").
[minor] s/we proposed an architecture/describes an architecture
Other places also say that the document "proposes" -- at this point in
the process we're past proposals. :-)
[minor] Please expand ECMP in the Abstract, and later on first use.
[] "The proposed technique achieves prefix independent convergence
while ensuring incremental deployment, complete automation, and zero
management and provisioning effort."
Great marketing! This is a technical document, please avoid selling
the solution.
[minor] Consider breaking the Abstract into two paragraphs.
...
109 1. Introduction
111 BGP speakers exchange reachability information about
112 prefixes[1][2] and, for labeled address families, namely AFI/SAFI
113 1/4, 2/4, 1/128, and 2/128, an edge router assigns local labels to
114 prefixes and associates the local label with each advertised
115 prefix using technologies such as L3VPN [9], 6PE [10], and
116 Softwire [8] using BGP label unicast (BGP-LU) technique[3]. A BGP
117 speaker then applies the path selection steps to choose the best
118 path. In modern networks, it is not uncommon to have a prefix
119 reachable via multiple edge routers. In addition to proprietary
120 techniques, multiple techniques have been proposed to allow for
121 BGP to advertise more than one path for a given prefix
122 [7][12][13], whether in the form of equal cost multipath or
123 primary-backup. Another common and widely deployed scenario is
124 L3VPN with multi-homed VPN sites with unique Route Distinguisher.
125 It is advantageous to utilize the commonality among paths used by
126 NLRIs[1] to significantly improve convergence in case of topology
127 modifications.
[] I still haven't read the rest of the document... PIC is a valuable
and simple technique. Just from the first sentence (and peeking ahead
at some of the figures and examples), it seems to me that the
description could have been significantly simpler. :-(
[major] rfc7322 requires that RFCs be referenced as [RFCXXXX], please
update the citations.
[minor nit] Multiple citations don't have spaces around the brackets ("[]").
[minor] The reference to rfc4271 is the general reference for BGP, no
need to include a reference to rfc4760 with it.
[] "for labeled address families...L3VPN with multi-homed VPN sites
with unique Route Distinguisher"
The detail about labeled AFs/L3VPN seems unnecessary given that PIC
doesn't just apply to them.
[major] "BGP label unicast (BGP-LU) technique[3]"
Note that rfc8277 doesn't use the terms BGP-LU, "label unicast", or
"labeled unicast". I know that BGP-LU is commonly used, but it is not
defined in any IETF document (that I could find). Please define it in
the terminology section. Also, don't refer to rfc8277 every time you
use the term.
[major] "A BGP speaker then applies the path selection steps to choose
the best path."
rfc4271 doesn't talk about "path selection" or finding a "best path".
It specifies a Decision Process that results in best routes. Please
use consistent terminology.
[minor] "In addition to proprietary techniques..."
These proprietary techniques are not described anywhere. Please don't
mention them -- it will only raise unnecessary questions.
[nit] "techniques have been proposed...[7][12][13]"
Add-path and Diverse paths are not just "proposed".
[minor] "It is advantageous to utilize the commonality among paths
used by NLRIs[1] to significantly improve convergence in case of
topology modifications."
You may have to be more explicit about which commonality -- mention it
explicitly.
129 This document proposes a hierarchical and shared forwarding chain
130 organization that allows traffic to be restored to pre-calculated
131 alternative equal cost primary path or backup path in a time
132 period that does not depend on the number of BGP prefixes. The
133 technique relies on internal router behavior that is completely
134 transparent to the operator and can be incrementally deployed and
135 enabled with zero operator intervention. In other words, once it
136 is implemented and deployed on a router, nothing is required from
137 the operator to make it work. It is noteworthy to mention that
138 this document describes FIB architecture that can be implemented
139 in both hardware and/or software.
[nit] s/to pre-calculated/to a pre-calculated
[nit] s/describes FIB/describes a FIB
[minor] Expand FIB on first mention.
141 1.1. Terminology
[major] Please don't deviate (at least) from terminology used in
standards track specifications. Please don't redefine terms to mean
something different in this document.
If using terminology defined elsewhere, please just reference those
documents, and don't redefine the terms here.
143 This section defines the terms used in this document. For ease of
144 use, we will use terms similar to those used by L3VPN [9].
[major] Similar? If you're using the terminology from rfc4364, then
please just use it. Redefining terms doesn't make anything easier.
To be clear: remove any terms that are already defined in rfc4364 and
make the reference Normative.
146 o BGP prefix: A prefix P/m (of any AFI/SAFI) that a BGP speaker
147 has a path for.
[major] rfc4271 uses the concept of an "IP prefix" carried in a
"route" and described by a "path" (from §1.1):
Route
A unit of information that pairs a set of destinations with the
attributes of a path to those destinations. The set of
destinations are systems whose IP addresses are contained in one
IP address prefix carried in the Network Layer Reachability
Information (NLRI) field of an UPDATE message. The path is the
information reported in the path attributes field of the same
UPDATE message.
149 o IGP prefix: A prefix P/m (of any AFI/SAFI) that is learnt via
150 an Interior Gateway Protocol (IGP), such as OSPF and ISIS. The
151 prefix may be learnt directly through the IGP or redistributed
152 from other protocol(s).
[major] rfc4271 uses the term "IGP route".
[minor] The "P/m" notation is not explained -- and is not used outside
of this section.
[major] "is learnt via an...IGP...may be learnt directly through the
IGP or redistributed from other protocol(s)"
There's a conflict in the definition: does it come directly from the
IGP or from something else? Might might those "other protocol(s)" be?
154 o CE[7]: An external router through which an egress PE can reach
155 a prefix P/m.
[minor] Please expand CE.
[major] The reference to draft-ietf-idr-best-external seems incorrect;
that draft doesn't use CE at all. This seems to be a term that comes
from RFC4364.
[minor] Please expand PE in first use.
157 o Egress PE[7], "ePE": A BGP speaker that learns about a prefix
158 through an eBGP peer and chooses that eBGP peer as the next-hop
159 for that prefix.
[major] The reference to draft-ietf-idr-best-external seems incorrect;
that draft doesn't use PE at all. This seems to be a term that comes
from RFC4364.
[minor] s/eBGP/EBGP/g
That is how rfc4271 uses it.
[minor] Expand EBGP on first use.
161 o Ingress PE, "iPE": A BGP speaker that learns about a prefix
162 through a iBGP peer and chooses an egress PE as the next-hop for
163 the prefix.
[minor] s/iBGP/IBGP/g
That is how rfc4271 uses it.
[minor] Expand IBGP on first use.
165 o Path: The next-hop in a sequence of nodes starting from the
166 current node and ending with the destination node or network
167 identified by the prefix. The nodes may not be directly
168 connected.
[major] See the definition above from rfc4271.
[major] Some of the definitions refer to BGP, but others to the FIB
characteristics....differentiate. "FIB Path" vs "BGP Path", for
example. Note that rfc4271 uses the term "Routing Table" as follows
(from §3.2: Routing Information Base):
Routing information that the BGP speaker uses to forward packets (or
to construct the forwarding table used for packet forwarding) is
maintained in the Routing Table. The Routing Table accumulates
routes to directly connected networks, static routes, routes learned
from the IGP protocols, and routes learned from BGP. Whether a
specific BGP route should be installed in the Routing Table, and
whether a BGP route should override a route to the same destination
installed by another source, is a local policy decision, and is not
specified in this document. In addition to actual packet forwarding,
the Routing Table is used for resolution of the next-hop addresses
specified in BGP updates (see Section 5.1.3).
170 o Recursive path: A path consisting only of the IP address of the
171 next-hop without the outgoing interface. Subsequent lookups are
172 necessary to determine the outgoing interface and a directly
173 connected next-hop.
[major] rfc4271 already used "recursive route".
175 o Non-recursive path: A path consisting of the IP address of a
176 directly connected next-hop and outgoing interface.
[] See above.
...
181 o Primary path: A recursive or non-recursive path that can be
182 used all the time as long as a walk starting from this path can
183 end to an adjacency. A prefix can have more than one primary
184 path.
[minor] "can be used all the time"
I'm not sure what you mean here.
[minor] The concept of a "walk" hasn't been introduced -- please put a
pointer to where that is discussed.
...
189 o Leaf: A container data structure for a prefix or local label.
190 Alternatively, it is the data structure that contains prefix
191 specific information.
[] So a leaf can be many things: "prefix or local label...[or] prefix
specific information". Hmmm....
...
198 o Pathlist: An array of paths used by one or more prefixes to
199 forward traffic to destination(s) covered by an IP prefix. Each
200 path in the pathlist carries its "path-index" that identifies its
201 position in the array of paths. In general, the value of the
202 "path-index" stored in path may not necessarily have the same
203 value of the location of the path in the pathlist. For example
204 the 3rd path may carry path-index value of 1. A pathlist may
205 contain a mix of primary and backup paths.
[nit] s/its "path-index"/a "path-index"
207 o OutLabel-List: Each labeled prefix is associated with an
208 OutLabel-List. The OutLabel-List is an array of one or more
209 outgoing labels and/or label actions where each label or label
210 action has 1-to-1 correspondence to a path in the pathlist.
211 Label actions[6] are: push the label, pop the label, swap the
212 incoming label with the label in the OutLabel-List entry, or
213 don't push anything at all in case of "unlabeled". The prefix
214 may be an IGP or BGP prefix.
[major] If the label label actions are defined in rfc3031, please just
point to it -- is the "swap the incoming label with the label in the
OutLabel-List entry" defined there too?
It seems to me that the action defined here is part of a general
action taken by the FIB, and not specific only to labels (as in a
different encapsulation, for example).
216 o Forwarding chain: It is a compound data structure consisting of
217 multiple connected block that a forwarding engine walks one
218 block at a time to forward the packet out of an interface.
219 Section 2.2 explains an example of a forwarding chain.
220 Subsequent sections provide additional examples
[nit] s/multiple connected block/multiple connected blocks
[nit] s/additional examples/additional examples.
[minor] Note that both "forwarding chain" and "forwarding Chain" are
used throughout the text. Please be consistent.
...
229 o Route: A prefix with one or more paths associated with it. The
230 minimum set of objects needed to construct a route is a leaf
231 and a pathlist.
[major] See the comments above about rfc4271 and the FIB in general.
233 2. Overview
235 The idea of BGP-PIC is based on two pillars
[minor] The descriptions below are very thick (hard to read through)
without more context. Suggestion: keep the description simple in this
section and point to where there is more detail.
236 o A shared hierarchical forwarding chain: It is not uncommon to see
237 multiple destinations reachable via the same list of next-hops.
238 Instead of having a separate list of next-hops for each
239 destination, all destinations sharing the same list of next-hops
240 can point to a single copy of this list thereby allowing fast
241 convergence by making changes to a single shared list of next-
242 hops rather than possibly a large number of destinations. Because
243 paths in a pathlist may be recursive, a hierarchy is formed
244 between pathlist and the resolving prefix whereby the pathlist
245 depends on the resolving prefix.
247 o A forwarding plane that supports multiple levels of indirection:
248 A forwarding chain that starts with a destination and ends with
249 an outgoing interface is not a simple flat structure. Instead a
250 forwarding entry is constructed via multiple levels of
251 dependency. A BGP NLRI uses a recursive next-hop, which in turn
252 resolves via an IGP next-hop, which in turn resolves via an
253 adjacency consisting of one or more outgoing interface(s) and
254 next-hop(s).
[nit] s/Instead a/Instead, a
[minor] Are "multiple levels of indirection" and "multiple levels of
dependency" the same thing? The latter is used to describe the
former.
[] Next-hop is also a term that needs to be differentiated. In this
paragraph is it used in the context of a BGP route, an IGP route, and
a local address on the FIB. See rfc4271.
256 Designing a forwarding plane that constructs multi-level forwarding
257 chains with maximal sharing of forwarding objects allows rerouting a
258 large number of destinations by modifying a small number of objects
259 thereby achieving convergence in a time frame that does not depend
260 on the number of destinations. For example, if the IGP prefix that
261 resolves a recursive next-hop is updated there is no need to update
262 the possibly large number of BGP NLRIs that use this recursive next-
263 hop.
[] The example is orders of magnitue simpler that the actual
explanation. BTW, none of the words used ("multi-level forwarding
chains with maximal sharing of forwarding objects") are mentioned in
the description of the pillars.
265 2.1. Dependency
267 This section describes the required functionalities in the
268 forwarding and control planes to support BGP-PIC described in this
269 document.
[nit] s/BGP-PIC described/BGP-PIC as described
271 2.1.1. Hierarchical Hardware FIB (Forwarding Information Base)
[major] The Introduction says that "this document describes FIB
architecture that can be implemented in both hardware and/or
software", but this section requires a hardware implementation. ??
273 BGP-PIC requires a hierarchical hardware FIB support: for each BGP
274 forwarded packet, a BGP leaf is looked up, then a BGP Pathlist is
275 consulted, then an IGP Pathlist, then an Adjacency.
[major] "BGP forwarded packet"
I think you mean something along the lines of a packet destined to an
external destination, or to a destination learned through BGP, or ...
??
[major] The definitions of both a leaf and a pathlist don't assume
protocol origin. IOW, the reader has to make significant assumptions
to understand that you mean "a leaf/pathlist that resulted from
BGP/IGP-learned information" (or something like that). If the
identification of the origin protocol is required, please also say so.
[major] "Pathlist" and "Adjancency" are capitalized, but "leaf" isn't.
Personally, I like capitalizing terms with specific meaning. Not all
instances (throughout the document) are treated the same. Please be
consistent throughout.
[major] The general process above: lookup the destination (BGP leaf),
which points to possible alternatives (BGP pathlist), which
recursively resolve... assumes knowledge of how a FIB is generally
built and the operations that forwarding a packet requires. This may
not be common knowledge to every reader. Please either explain the
assumptions or reference a place where the explanation exists already.
277 An alternative method consists in "flattening" the dependencies when
278 programming the BGP destinations into HW FIB resulting in
279 potentially eliminating both the BGP Path-List and IGP Path-List
280 consultation. Such an approach decreases the number of memory
281 lookups per forwarding operation at the expense of HW FIB memory
282 increase (flattening means less sharing thereby less duplication),
283 loss of ECMP properties (flattening means less pathlist entropy) and
284 loss of BGP-PIC properties.
[minor] HW is mentioned here again...and the concept of programming,
which goes back to the general knowledge assumptions that you're
making.
[minor] s/Path-List/Pathlist/g
[major] Am I to assume that this "alternative method" shouldn't be
considered/used, especially if it results in loss of the "BGP-PIC
properties"?? This section started by mentioning a requirement -- if
this other approach is not what is required the why even mention it?
286 2.1.2. Availability of more than one BGP next-hops
288 When the primary BGP next-hop fails, BGP-PIC depends on the
289 availability of one or more pre-computed and pre-installed secondary
290 BGP next-hop(s) in the BGP Pathlist.
[major] "primary BGP next-hop"
At this point in the document you've hinted at the possibility of
having multiple BGP routes (with potentially different NEXT_HOPs), but
the concept of "primary BGP next-hop" hasn't been explained. I'm
assuming that by this you mean the BGP best route [rfc4271]. (Again,
you're making assumptions of the knowledge of the reader.)
I can see that this section talks about "secondary next-hops" later.
Perhaps change the order of the text so that the conditions precede
the explanation of the actions.
[major] "BGP next-hop fails"
rfc4271 talks about the the address in the NEXT_HOP attribute becoming
not resolvable. Please use established terminology!
[major] "pre-computed and pre-installed...in the BGP Pathlist"
There's an example of "pre-computed" later on, but I couldn't find an
explanation of "pre-installed...in the BGP Pathlist".
292 The existence of a secondary next-hop is clearly required for the
293 following reason: a service caring for network availability will
294 require two disjoint network connections resulting in two BGP next-
295 hops.
[minor] You mean "clearly required" for BGP-PIC, right? Not for the service.
297 The BGP distribution of secondary next-hops is available thanks to
298 the following BGP mechanisms: Add-Path [12], BGP Best-External [7],
299 diverse path [13], and the frequent use in VPN deployments of
300 different VPN RD's per PE. Another option to learn multiple BGP
301 NH/path is simply to receive IBGP paths from multiple BGP RR
302 selection a different path as best. It is noteworthy to mention that
303 the availability of another BGP path does not mean that all failure
304 scenarios can be covered by simply forwarding traffic to the
305 available secondary path. The discussion of how to cover various
306 failure scenarios is beyond the scope of this document.
[minor] "BGP NH/path" is a new term. Even if using "NH" seems
obvious, please mention it at least once. Also, you've been talking
(incorrectly) about BGP paths (instead of routes) or BGP next-hops and
now you're combining the two. This is the only instance.
[nit] s/is simply to receive/is to receive
[nit] s/multiple BGP RR selection/multiple BGP RRs selecting
[minor] You might want to reference rfc9107 as an example of "multiple
BGP RRs selecting a different path as best".
308 2.2. BGP-PIC Illustration
310 To illustrate the two pillars above as well as the platform
311 dependency, we will use an example of a simple multihomed L3VPN [9]
312 prefix in a BGP-free core running LDP [4] or segment routing over
313 MPLS forwarding plane [5].
[nit] s/a simple multihomed/a multihomed
What is "simple" to you may not be simple to others. Just state the facts.
[minor] s/L3VPN [9]/L3VPN
The citation is not needed all the time.
[minor] "core running LDP [4] or segment routing over MPLS forwarding plane [5]"
This is an example. To simplify, pick one.
315 +--------------------------------+
316 | |
317 | ePE2 (IGP-IP1 192.0.2.1, Loopback)
318 | | \
319 | | \
320 | | \
321 iPE | CE....VRF "Blue", ASnum 65000
322 | | / (VPN-IP1 198.51.100.0/24)
323 | | / (VPN-IP2 203.0.113.0/24)
324 | LDP/Segment-Routing Core | /
325 | ePE1 (IGP-IP2 192.0.2.2, Loopback)
326 | |
327 +--------------------------------+
328 Figure 1 VPN prefix reachable via multiple PEs
[nit] s/ASnum/ASN
Expand on first use.
330 Referring to Figure 1, suppose the iPE (the ingress PE) receives
331 NLRIs for the VPN prefixes VPN-IP1 and VPN-IP2 from two egress PEs,
332 ePE1 and ePE2 with next-hop BGP-NH1 and BGP-NH2, respectively.
333 Assume that ePE1 advertise the VPN labels VPN-L11 and VPN-L12 while
334 ePE2 advertise the VPN labels VPN-L21 and VPN-L22 for VPN-IP1 and
335 VPN-IP2, respectively. Suppose that BGP-NH1 and BGP-NH2 are resolved
336 via the IGP prefixes IGP-IP1 and IGP-IP2, where each happen to have
337 2 equal cost paths with IGP-NH1 and IGP-NH2 reachable via the
338 interfaces I1 and I2, respectively. Suppose that local labels
339 (whether LDP [4] or segment routing [5]) on the downstream LSRs[4]
340 for IGP-IP1 are IGP-L11 and IGP-L12 while for IGP-IP2 are IGP-L21
341 and IGP-L22. As such, the routing table at iPE is as follows:
[minor] "receives NLRIs"
I other places you're talked about paths or routes...please be consistent!
[minor] "IGP-NH1 and IGP-NH2 reachable via the interfaces I1 and I2"
I'm sure you're talking about interfaces on the iPE, but others may
find that the picture is lacking detail when compared to the
description: the core is just a box..
343 65000: 198.51.100.0/24
344 via ePE1 (192.0.2.1), VPN Label: VPN-L11
345 via ePE2 (192.0.2.2), VPN Label: VPN-L21
[minor] You need to explain the notation and, for example, why some
entries have an ASN and others don't.
[minor] It might help if the description included the IP addresses.
For example, when mentioning BGP-NH1, indicate that it refers to
192.0.2.1.
...
362 IP Leaf: Pathlist: IP Leaf: Pathlist:
363 -------- +-------+ -------- +----------+
364 VPN-IP1-->|BGP-NH1|-->IGP-IP1(BGP NH1)--->|IGP-NH1,I1|--->Adjacency1
365 | |BGP-NH2|-->.... | |IGP-NH2,I2|--->Adjacency2
366 | +-------+ | +----------+
367 | |
368 | |
369 v v
370 OutLabel-List: OutLabel-List:
371 +----------------------+ +----------------------+
372 |VPN-L11 (VPN-IP1, NH1)| |IGP-L11 (IGP-IP1, NH1)|
373 |VPN-L21 (VPN-IP1, NH2)| |IGP-L12 (IGP-IP1, NH2)|
374 +----------------------+ +----------------------+
376 Figure 2 Shared Hierarchical Forwarding Chain at iPE
378 The forwarding chain depicted in Figure 2 illustrates the first
379 pillar, which is sharing and hierarchy. We can see that the BGP
380 pathlist consisting of BGP-NH1 and BGP-NH2 is shared by all NLRIs
381 reachable via ePE1 and ePE2. As such, it is possible to make changes
382 to the pathlist without having to make changes to the NLRIs. For
383 example, if BGP-NH2 becomes unreachable, there is no need to modify
384 any of the possibly large number of NLRIs. Instead only the shared
385 pathlist needs to be modified. Likewise, due to the hierarchical
386 structure of the forwarding chain, it is possible to make
387 modifications to the IGP routes without having to make any changes
388 to the BGP NLRIs. For example, if the interface "I2" goes down, only
389 the shared IGP pathlist needs to be updated, but none of the IGP
390 prefixes sharing the IGP pathlist nor the BGP NLRIs using the IGP
391 prefixes for resolution need to be modified.
[minor] "We can see that the BGP pathlist consisting of BGP-NH1 and
BGP-NH2 is shared by all NLRIs reachable via ePE1 and ePE2."
I only see VPN-IP1 using that pathlist.
I know it is not easy to draw using ASCII, but you may want to provide
a conceptual figure showing how the leafs and Pathlists relate to each
other and then describe the contents. OR You can also integrate an
SVI drawing in the XML.
[minor] The representation of the Pathlists is not consistent: the
first one shows just BGP-NHs, while the second one has an extra
element. It may be obvious to you that the interface element points
at the corresponding Adjacencies, but please explain that too.
[major] The description is not complete -- it jumps from using one
Pathlist to the other, but it doesn't cover the recursive nature of
the resolution: resolving the BGP-NHs in the Pathlist through another
leaf.
[major] The outlabel-list is not explained anywhere.
393 Figure 2 can also be used to illustrate the second BGP-PIC pillar.
394 Having a deep forwarding chain such as the one illustrated in Figure
395 2 requires a forwarding plane that is capable of accessing multiple
396 levels of indirection in order to calculate the outgoing
397 interface(s) and next-hops(s). While a deeper forwarding chain
398 minimizes the re-convergence time on topology change, there will
399 always exist platforms with limited capabilities and hence imposing
400 a limit on the depth of the forwarding chain. Section 5 describes
401 how to gracefully trade off convergence speed with the number of
402 hierarchical levels to support platforms with different
403 capabilities.
[minor] "Figure 2 can also be used to illustrate the second BGP-PIC pillar."
Ok. But the rest of the paragraph doesn't explain how that is needed
in the example.
405 Another example using IPv6 addresses can be something like the
406 following
408 65000: 2001:DB8:1::/48
409 via ePE1 (192::1), VPN Label: VPN6-L11
410 via ePE2 (192::2), VPN Label: VPN6-L21
412 65000: 2001:DB8:2:/48
413 via ePE1 (192::1), VPN Label: VPN6-L12
414 via ePE2 (192::2), VPN Label: VPN6-L22
416 192::1/128
417 via Core, Label: IGP6-L11
418 via Core, Label: IGP6-L12
420 192::2/128
421 via Core, Label: IGP6-L21
422 via Core, Label: IGP6-L22
424 The same hierarchical forwarding chain described can be constructed
425 for IPv6 addresses/prefixes.
[major] Please use the addresses from rfc3849.
[] BTW, sorry to be blunt, but this is a lazy attempt at using IPv6 in
an example. You should at least show a figure and maybe a different
scenario...not just similar numbers.
427 3. Constructing the Shared Hierarchical Forwarding Chain
429 Constructing the forwarding chain is an application of the two
430 pillars described in Section 2. This section describes how to
431 construct the forwarding chain in hierarchical shared manner.
[nit] s/in hierarchical/in a hierarchical
433 3.1. Constructing the BGP-PIC forwarding Chain
435 The whole process starts when BGP downloads a prefix to FIB. The
436 prefix contains one or more outgoing paths. For certain labeled
437 prefixes, such as VPN [9] prefixes, each path may be associated with
438 an outgoing label and the prefix itself may be assigned a local
439 label. The list of outgoing paths defines a pathlist. If such
440 pathlist does not already exist, then FIB manager (software or
441 hardware entity responsible for managing the FIB) creates a new
442 pathlist, otherwise the existing pathlist is used. The BGP prefix is
443 added as a dependent of the pathlist.
[major] "BGP downloads a prefix to FIB"
rfc4271 talks about "routes...installed in the Routing Table".
Please, as much as possible, use the terminology from established
RFCS. Note that this document doesn't define what a FIB is.
[major] "The prefix contains one or more outgoing paths."
This may be another terminology issue: a BGP route doesn't contain
multiple next-hops. If the routes came from ADD_PATH or ORR, etc.,
there will be multiple BGP routes with a common NLRI and different
next-hops. IOW, it looks like the Pathlist would be built
incrementally as routes for the same NLRI are installed.
The end result should be the same. Terminology matters!
[minor] "VPN [9]"
The other references to rfc4364 use "L3VPN", should this one use it too?
[major] "If such pathlist does not already exist..."
It should be mentioned somewhere the reason for this Pathlist to
already exist: there's probably another BGP route using it. This is
where the connection to the shared nature of the chain needs to be
made. Otherwise, the description sounds as if there might simply be a
Pathlist structure available...
[nit] s/FIB manager/the FIB manager/g
[] It might be easier/better/clearer to explain the process as a
series of steps. Note that the description above is really a series
of steps: BGP has a route to install, the route is accepted in the
routing table (local policy), a leaf for this NLRI exists (or not),
the Pathlist exists (or not), etc...
445 The previous step constructs the upper part of the hierarchical
446 forwarding chain. The forwarding chain is completed by resolving the
447 paths of the pathlist. A BGP path usually consists of a next-hop.
448 The next-hop is resolved by finding a matching IGP prefix.
[major] "The forwarding chain is completed by resolving the paths of
the pathlist."
The example in the last section shows that this "lower part" (is that
the right term?) could also include a Pathlist. The explanation of
how the "upper part" is constructed didn't mention that the process
was to "resolve" the BGP NLRI. As a result, there's a disconnect
between what was done before and what is assumed here.
Again, a series of steps would serve the purpose better.
Also, be mindful of the use of "resolve" in rfc4271.
[major] "A BGP path usually consists of a next-hop."
As opposed to what? I mean the NEXT_HOP attribute is mandatory, a
route includes multiple other Path Attributes... Not sure what the
point it.
[minor] "The next-hop is resolved by finding a matching IGP prefix."
Yes, as an extreme simplification. §9.1.2.1/rfc4271 offers a better
description.
450 The end result is a hierarchical shared forwarding chain where the
451 BGP pathlist is shared by all BGP prefixes that use the same list of
452 paths and the IGP prefix is shared by all pathlists that have a path
453 resolving via that IGP prefix. It is noteworthy to mention that the
454 forwarding chain is constructed without any operator intervention at
455 all.
[minor] "The end result is a hierarchical shared forwarding chain
where the BGP pathlist is shared..."
As I mentioned above, more emphasis should be places on (when
explaining the steps) the shared nature of the Pathlists and how its
dependents are independent of each other too.
[minor] "without any operator intervention"
Yes, this was mentioned as an advantage somewhere. It would be better
if the document was explicit from the start on the point that it is
describing a set of abstract structures and procedures to be
internally instantiated. Just like protocol structures (neighbors,
routes, etc.) that are transparent to the user.
...
460 3.2. Example: Primary-Backup Path Scenario
462 Consider the egress PE ePE1 in the case of the multi-homed VPN
463 prefixes in the BGP-free core depicted in Figure 1. Suppose ePE1
464 determines that the primary path is the external path but the backup
465 path is the iBGP path to the other PE ePE2 with next-hop BGP-NH2.
466 ePE2 constructs the forwarding chain depicted in Figure 3. We are
467 only showing a single VPN prefix for simplicity. But all prefixes
468 that are multihomed to ePE1 and ePE2 share the BGP pathlist.
[minor] "BGP-free core depicted in Figure 1"
It doesn't really make a difference, but it wasn't mentioned before
that Figure 1 has a BGP-free core. I'm mentioning this because there
are inconsistencies in the description. Also, this is a detail that
is not important to the description, but that can create confusion for
other readers and generate unnecessary questions and comments.
[minor] "Suppose ePE1 determines..."
It would be great if there was either a pointer to how ePE1 does all
this (where are backups specified?), or a little bit more explanation
of the assumptions.
[nit] s/external path but the backup/external path, while the backup
470 BGP OutLabel-List
471 VPN-L11 +---------+
472 (Label-leaf)---+---->|Unlabeled|
473 | +---------+
474 v | VPN-L21 |
475 | | (swap) |
476 | +---------+
477 |
478 | BGP Pathlist
479 | +------------+ Connected route
480 | | CE-NH |------>(to the CE)
481 | |path-index=0|
482 | +------------+
483 | | VPN-NH2 |
484 VPN-IP1 -----+------------------>| (backup) |------>IGP Leaf
485 (IP leaf) |path-index=1| (Towards ePE2)
486 | +------------+
487 |
488 | BGP OutLabel-List
489 | +---------+
490 +------------->|Unlabeled|
491 +---------+
492 | VPN-L21 |
493 | (push) |
494 +---------+
496 Figure 3 : VPN Prefix Forwarding Chain with eiBGP paths on egress PE
[minor] Figure 2 provided an easy-to-follow dependency order: Leaf to
Pathlist to Leaf to Pathlist. The dependencies in Figure 3 are not
straightforward, please provide a description of the figure before
explaining the differences with respect to the other example.
[major] The definition of Pathlist (§1.1) implies that all its entries
can be used for forwarding, but the Pathlist above (and this use case)
clearly indicate that not all the entries in a Pathlist are used in
the same way, or at the same time.
498 The example depicted in Figure 3 differs from the example in Figure
499 2 in two main aspects. First, as long as the primary path towards
500 the CE (external path) is useable, it will be the only path used for
501 forwarding while the OutLabel-List contains both the unlabeled label
502 (primary path) and the VPN label (backup path) advertised by the
503 backup path ePE2. The second aspect is presence of the label leaf
504 corresponding to the VPN prefix. This label leaf is used to match
505 VPN traffic arriving from the core. Note that the label leaf shares
506 the pathlist with the IP prefix.
[major] "the...path...is useable"
What does "useable" mean? Please describe this term in the terminology section.
[major] "primary path towards the CE (external path) is useable, it
will be the only path used"
The arrow from VPN-IP1 points to the VPN-NH2 entry in the Pathlist, so
the description doesn't seem to match.
[major] "the OutLabel-List"
There are two boxes named "BGP OutLabel-List", and they contain
different backup actions (swap vs push).
[minor] "unlabeled label"
rfc3031 talks about unlabeled packets, not labels.
[minor] A third difference is that this figure includes a "path-index"
-- but that is not explained.
508 4. Forwarding Behavior
...
513 When a packet arrives at a router, it matches a leaf. A labeled
514 packet matches a label leaf while an IP packet matches an IP leaf.
515 The forwarding engines walks the forwarding chain starting from the
516 leaf until the walk terminates on an adjacency. Thus when a packet
517 arrives, the chain is walked as follows:
[minor] Instead of summarizing, simply start with the steps.
519 1. Lookup the leaf based on the destination address or the label at
520 the top of the packet.
[major] Lookup how? How do you know you found the right leaf?
I think rfc3031 has some text related to the forwarding of labeled
packets. But I don't know of a reference for IP forwarding, so you
should at least mention something around a longest-prefix-match.
522 2. Retrieve the parent pathlist of the leaf.
[minor] While it is related, the terminology section defines the
dependency relationship as the leaf being the child of the Pathlist --
keeping consistency is important.
524 3. Pick the outgoing path "Pi" from the list of resolved paths in
525 the pathlist. The method by which the outgoing path is picked is
526 beyond the scope of this document (e.g. flow-preserving hash
527 exploiting entropy within the MPLS stack and IP header). Let the
528 "path-index" of the outgoing path "Pi" be "j".
[minor] s/Pick the outgoing path "Pi"/Pick an outgoing path "Pi"
[] "Let the "path-index" of the outgoing path "Pi" be "j"."
I'm not sure if you mean that "j is the path-index of Pi" or "assign a
path-index of j". If the former, how is the path-index assigned?
530 4. If the prefix is labeled, use the "path-index" "j" to retrieve
531 the jth label "Lj" stored the jth entry in the OutLabel-List and
532 apply the label action of the label on the packet (e.g. for VPN
533 label on the ingress PE, the label action is "push"). As
534 mentioned in Section 1.1 the value of the "path-index" stored
535 in path may not necessarily be the same value of the location of
536 the path in the pathlist.
[minor] "retrieve the jth label "Lj" stored the jth entry in the OutLabel-List"
This sounds as if there were (at least) j labels in each of (at least)
j entries in the OutLabel-List. But I assume that there is really
only one label (or label stack) in that entry. ??
538 5. Move to the parent of the chosen path "Pi".
540 6. If the chosen path "Pi" is recursive, move to its parent prefix
541 and go to step 2.
543 7. If the chosen path is non-recursive move to its parent adjacency.
544 Otherwise go to the next step.
[minor] These 2 steps say the same thing: "move to the parent
of..."Pi"", "path "Pi"...move to its parent", and "move to its
parent". I think you need to better differentiate (?) the options.
...
549 Let's apply the above forwarding steps to the forwarding chain
550 depicted in Figure 2 in Section 2. Suppose a packet arrives at
551 ingress PE iPE from an external neighbor. Assume the packet matches
552 the VPN prefix VPN-IP1. While walking the forwarding chain, the
553 forwarding engine applies a hashing algorithm to choose the path and
554 the hashing at the BGP level yields path 0 while the hashing at the
555 IGP level yields path 1. In that case, the packet will be sent out
556 of interface I2 with the label stack "IGP-L12,VPN-L11".
[minor] s/path 0...path 1/path-index 0...path-index 1
[minor] The example would be better understood if it was presented as
steps matching the process described above (vs a narrative
explanation). Also, be specific in which path is the one with each
path-index...
[major] Figure 2 doesn't contain any indication of a path-index.
558 5. Handling Platforms with Limited Levels of Hierarchy
[] This whole section, which is longer than the text describing the
"normal" mechanism (!), seems to be more appropriate for an Appendix
than the main body of the document. Please move it.
...
564 o Being able to reduce the number of hierarchical levels from any
565 arbitrary value to a smaller arbitrary value that can be
566 supported by the forwarding engine.
[?] What is the worst case in terms of number of levels? 3? Just
curious. I'm asking simply because an "arbitrary value" sounds like it
could be 10s o even 1000s, when in reality it is not more than a
handful. Are there many platforms out there that wouldn't support 2-3
levels (for the number of routes/paths they normally would carry)?
[minor] "smaller arbitrary value"
This smaller value is not really arbitrary (as in random), it is
determined by the platform. Maybe something along the lines of a
"value supported by the platform" would be better.
...
571 5.1. Flattening the Forwarding Chain
...
584 1. The forwarding engine is now at leaf "R1".
...
605 9. Now the forwarding engine continues the walk to the parent of
606 "Qj".
[minor] This list of steps is just a restatement of the process in the
last section. IMO there's no need to repeat the steps here. Point at
the 2-level example, and jump directly to the next paragraph.
To reference from the explanation below, "a picture is worth a
thousand words". ;-)
...
614 1. FIB manager wants to reduce the number of levels used by "Pi" by
615 1.
[nit] This is not really a step...
...
622 4. FIB manager also extracts the OutLabel-list(R2) associated with
623 the leaf "R2". Remember that OutLabel-list(R2) = <L1, L2,...,
624 Lm>.
[] New syntax: "OutLabel-list(R2)".
...
643 It is noteworthy to mention that the labels in the OutLabel-list
644 associated with the "flattened" pathlist may be stored in the same
645 memory location as the path itself to avoid additional memory
646 access. But that is an implementation detail that is beyond the
647 scope of this document.
[] Why even mention it then?
649 The same steps can be applied to all paths in the pathlist <P1,
650 P2,..., Pn> so that all paths are "flattened" thereby reducing the
651 number of hierarchical levels by one. Note that that "flattening" a
652 pathlist pulls in all paths of the parent paths, a desired feature
653 to utilize all ECMP/UCMP paths at all levels. A platform that has a
654 limit on the number of paths in a pathlist for any given leaf may
655 choose to reduce the number paths using methods that are beyond the
656 scope of this document.
[minor] Expand UCMP.
...
663 Because a flattened pathlist may have an associated OutLabel-list
664 the forwarding behavior has to be slightly modified. The
665 modification is done by adding the following step right after step 4
666 in Section 4.
668 5. If there is an OutLabel-list associated with the pathlist, then
669 if the path "Pi" is chosen by the hashing algorithm, retrieve the
670 label at location "i" in that OutLabel-list and apply the label
671 action of that label on the packet.
[major] Step 4 says this:
4. If the prefix is labeled, use the "path-index" "j" to retrieve
the jth label "Lj" stored the jth entry in the OutLabel-List and
apply the label action of the label on the packet...
Besides using "j" instead of "i" as the path-index, both seem to say
the same thing. Am I missing something?
...
676 5.2. Example: Flattening a forwarding chain.
678 This example uses a case of inter-AS option C [9] where there are 3
679 levels of hierarchy. Figure 4 illustrates the sample topology. To
680 force 3 levels of hierarchy, the ASBRs[9] on the ingress domain
681 (domain 1) advertise the core routers of the egress domain (domain
682 2) to the ingress PE (iPE) via BGP-LU [3] instead of redistributing
683 them into the IGP of domain 1. The end result is that the ingress PE
684 (iPE) has 2 levels of recursion for the VPN prefixes VPN-IP1 and
685 VPN-IP2.
[] Suggestion:
OLD>
To force 3 levels of hierarchy, the ASBRs[9] on the ingress
domain (domain 1) advertise the core routers of the egress
domain (domain 2) to the ingress PE (iPE) via BGP-LU [3]
instead of redistributing them into the IGP of domain 1.
NEW>
The ASBRs on the ingress domain (Domain 1) use BGP to advertise
the core routers (ASBRs and ePEs) of the egress domain (Domain
2) to the iPE.
[minor] Expand ASBR on first use. No need to add a reference.
687 Domain 1 Domain 2
688 +-------------+ +-------------+
689 | | | |
690 | LDP/SR Core | | LDP/SR core |
691 | | | |
692 | (192.0.2.4) | |
693 | ASBR11-------ASBR21........ePE1(192.0.2.1)
694 | | \ / | . . |\
695 | | \ / | . . | \
696 | | \ / | . . | \
697 | | \/ | .. | \VPN-IP1(198.51.100.0/24)
698 | | /\ | . . | /VRF "Blue" ASn: 65000
699 | | / \ | . . | /
700 | | / \ | . . | /
701 | | / \ | . . |/
702 iPE ASBR12-------ASBR22........ePE2 (192.0.2.2)
703 | (192.0.2.5) | |\
704 | | | | \
705 | | | | \
706 | | | | \VRF "Blue" ASn: 65000
707 | | | | /VPN-IP2(203.0.113.0/24)
708 | | | | /
709 | | | | /
710 | | | |/
711 | ASBR13-------ASBR23........ePE3(192.0.2.3)
712 | (192.0.2.6) | |
713 | | | |
714 | | | |
715 +-------------+ +-------------+
716 <=========== <========= <============
717 Advertise ePEx Advertise Redistribute
718 Using iBGP-LU ePEx Using IGP into
719 eBGP-LU BGP
721 Figure 4 : Sample 3-level hierarchy topology
[minor] s/ASn/ASN/g
[major] "Redistribute IGP into BGP" ??
I'm sure you don't mean that, but probably more something like
"redistribute the ePEx routes only"...
723 We will make the following assumptions about connectivity
[nit] s/connectivity/connectivity:
725 o In "domain 2", both ASBR21 and ASBR22 can reach both ePE1 and
726 ePE2 using the same distance.
[nit] The figure capitalizes "Domain", but the text doesn't. Please
be consistent!
[minor] By "distance" I assume you mean "IGP metric". Please use that
term instead. Apply to other occurences.
...
733 We will make the following assumptions about the labels
[nit] s/labels/labels:
...
740 o The labels advertised by ASBR11 to iPE using BGP-LU [3] for the
741 egress PEs ePE1 and ePE2 are LASBR111(ePE1) and LASBR112(ePE2),
742 respectively.
[] Why change the notation of the labels? This new notation
("LASBR111(ePE1)") just makes the explanation more complex. :-(
...
764 65000: 198.51.100.0/24
765 via ePE1 (192.0.2.1), VPN Label: VPN-L11
766 via ePE2 (192.0.2.2), VPN Label: VPN-L21
767 65000: 203.0.113.0/24
768 via ePE2 (192.0.2.2), VPN Label: VPN-L22
769 via ePE3 (192.0.2.3), VPN Label: VPN-L32
771 192.0.2.1/32 (ePE1)
772 Via ASBR11, BGP-LU Label: LASBR111(ePE1)
773 Via ASBR12, BGP-LU Label: LASBR121(ePE1)
774 192.0.2.2/32 (ePE2)
775 Via ASBR11, BGP-LU Label: LASBR112(ePE2)
776 Via ASBR12, BGP-LU Label: LASBR122(ePE2)
777 192.0.2.3/32 (ePE3)
778 Via ASBR13, BGP-LU Label: LASBR13(ePE3)
780 192.0.2.4/32 (ASBR11)
781 via Core, Label: IGP-L11
782 192.0.2.5/32 (ASBR12)
783 via Core, Label: IGP-L12
784 192.0.2.6/32 (ASBR13)
785 via Core, Label: IGP-L13
[nit] s/Via/via/g
[minor] Only some of the addresses are annotated ("192.0.2.1/32
(ePE1)"). This is helpful, but inconsistent -- both within this
example and with previous ones.
...
797 o The index inside the pathlist entry indicates the label that will
798 be picked from the OutLabel-List associated with the child leaf
799 if that path is chosen by the forwarding engine hashing function.
[] s/index/path-index
801 OutLabel-List OutLabel-List
802 For VPN-IP1 For VPN-IP2
803 +------------+ +--------+ +-------+ +------------+
804 | VPN-L11 |<---| VPN-IP1| |VPN-IP2|-->| VPN-L22 |
805 +------------+ +---+----+ +---+---+ +------------+
806 | VPN-L21 | | | | VPN-L32 |
807 +------------+ | | +------------+
808 | |
809 V V
810 +---+---+ +---+---+
811 | 0 | 1 | | 0 | 1 |
812 +-|-+-\-+ +-/-+-\-+
813 | \ / \
814 | \ / \
815 | \ / \
816 | \ / \
817 v \ / \
818 +-----+ +-----+ +-----+
819 +----+ ePE1| |ePE2 +-----+ | ePE3+-----+
820 | +--+--+ +-----+ | +--+--+ |
821 v | / v | v
822 +--------------+ | / +--------------+ | +-------------+
823 |LASBR111(ePE1)| | / |LASBR112(ePE2)| | |LASBR13(ePE3)|
824 +--------------+ | / +--------------+ | +-------------+
825 |LASBR121(ePE1)| | / |LASBR122(ePE2)| | OutLabel-List
826 +--------------+ | / +--------------+ | For ePE3
827 OutLabel-List | / OutLabel-List |
828 For ePE1 | / For ePE2 |
829 | / |
830 | / |
831 | / |
832 v / v
833 +---+---+ Shared Pathlist +---+ Pathlist
834 | 0 | 1 | For ePE1 and ePE2 | 0 | For ePE3
835 +-|-+-\-+ +-|-+
836 | \ |
837 | \ |
838 | \ |
839 | \ |
840 v \ v
841 +------+ +------+ +------+
842 +---+ASBR11| |ASBR12+--+ |ASBR13+---+
843 | +------+ +------+ | +------+ |
844 v v v
845 +-------+ +-------+ +-------+
846 |IGP-L11| |IGP-L12| |IGP-L13|
847 +-------+ +-------+ +-------+
849 Figure 5 : Forwarding Chain for hardware supporting 3 Levels
[] The OutLabel-Lists at the bottom are not tagged as such.
[] Some down arrows ("v") are missing -- on the non-vertical lines.
851 Now suppose the hardware on iPE (the ingress PE) supports 2 levels
852 of hierarchy only. In that case, the 3-levels forwarding chain in
853 Figure 5 needs to be "flattened" into 2 levels only.
855 OutLabel-List OutLabel-List
856 For VPN-IP1 For VPN-IP2
857 +------------+ +-------+ +-------+ +------------+
858 | VPN-L11 |<---|VPN-IP1| | VPN-IP2|--->| VPN-L22 |
859 +------------+ +---+---+ +---+---+ +------------+
860 | VPN-L21 | | | | VPN-L32 |
861 +------------+ | | +------------+
862 | |
863 | |
864 | |
865 Flattened | | Flattened
866 pathlist V V pathlist
867 +===+===+ +===+===+===+ +==============+
868 +--------+ 0 | 1 | | 0 | 0 | 1 +---->|LASBR112(ePE2)|
869 | +=|=+=\=+ +=/=+=/=+=\=+ +==============+
870 v | \ / / \ |LASBR122(ePE2)|
871 +==============+ | \ +-----+ / \ +==============+
872 |LASBR111(ePE1)| | \/ / \ |LASBR13(ePE3) |
873 +==============+ | /\ / \ +==============+
874 |LASBR121(ePE1)| | / \ / \
875 +==============+ | / \ / \
876 | / \ / \
877 | / + + \
878 | + | | \
879 | | | | \
880 v v v v \
881 +------+ +------+ +------+
882 +----|ASBR11| |ASBR12+---+ |ASBR13+---+
883 | +------+ +------+ | +------+ |
884 v v v
885 +-------+ +-------+ +-------+
886 |IGP-L11| |IGP-L12| |IGP-L13|
887 +-------+ +-------+ +-------+
[] Same comments as above.
889 Figure 6 : Flattening 3 levels to 2 levels of Hierarchy on iPE
891 Figure 6 represents one way to "flatten" a 3 levels hierarchy into
892 two levels. There are few important points:
[nit] s/are few/are a few
...
908 o Let's take a look at the flattened pathlist used by the prefix
909 "VPN-IP2", The pathlist associated with the prefix "VPN-IP2" has
910 three entries.
[nit] s/"VPN-IP2", The/"VPN-IP2". The
912 o The first and second entry have index "0". This is because
913 both entries correspond to ePE2. Thus when hashing performed
914 by the forwarding engine results in using first or the second
915 entry in the pathlist, the forwarding engine will pick the
916 correct VPN label "VPN-L22", which is the label advertised by
917 ePE2 for the prefix "VPN-IP2".
[nit] s/using first/using the first
...
952 o So the packet is forwarded towards the ASBR "ASBR12" and the IGP
953 label at the top will be "L12".
[minor] The figure calls the label "IGP-L12", not just "L12". There
are other instances of this below.
...
982 6. Forwarding Chain Adjustment at a Failure
984 The hierarchical and shared structure of the forwarding chain
985 explained in the previous section allows modifying a small number of
986 forwarding chain objects to re-route traffic to a pre-calculated
987 equal-cost or backup path without the need to modify the possibly
988 very large number of BGP prefixes. In this section, we go over
989 various core and edge failure scenarios to illustrate how FIB
990 manager can utilize the forwarding chain structure to achieve BGP
991 prefix independent convergence.
[nit] s/how FIB manager/how FIB the manager
993 6.1. BGP-PIC core
[minor] The meaning of "core" and "edge" should be explained somwehre.
...
1001 When a remote link or node fails, IGP on the ingress PE receives
1002 advertisement indicating a topology change so IGP re-converges to
1003 either find a new next-hop and/or outgoing interface or remove the
1004 path completely from the IGP prefix used to resolve BGP next-hops.
1005 IGP and/or LDP download the modified IGP leaves with modified
1006 outgoing labels for labeled core.
[nit] s/IGP/the IGP/g
[nit] s/receives advertisement/receives an advertisement
[nit] s/for labeled core/for the labeled core
[] "download"
This is the second time that the "download" concept is used without
explanation. See my comment in §3.1.
1008 When a local link fails, FIB manager detects the failure almost
1009 immediately. The FIB manager marks the impacted path(s) as unusable
1010 so that only useable paths are used to forward packets. Hence only
1011 IGP pathlists with paths using the failed local link need to be
1012 modified. All other pathlists are not impacted. Note that in this
1013 particular case there is actually no need even to backwalk to IGP
1014 leaves to adjust the OutLabel-Lists because FIB can rely on the
1015 path-index stored in the useable paths in the pathlist to pick the
1016 right label.
[nit] s/is actually no need even to/is no need to
[nit] s/to IGP leaves/to the IGP leaves
[major] The term "backwalk" is introduced here, but not explained.
1018 It is noteworthy to mention that because FIB manager modifies the
1019 forwarding chain starting from the IGP leaves only. BGP pathlists
1020 and leaves are not modified. Hence traffic restoration occurs within
1021 the time frame of IGP convergence, and, for local link failure,
1022 assuming a backup path has been precomputed, within the timeframe of
1023 local detection (e.g. 50ms). Examples of solutions that pre-
1024 computing backup paths are IP FRR [16] remote LFA [17], Ti-LFA [15]
1025 and MRT [18] or eBGP path having a backup path [11].
[major] The case on the previous paragraph deals with a local link
failure that results in the routes being "unusable", but this
paragraph mentions restoration "assuming a backup path has been
precomputed" -- the restoration timeframe also applies to the case
where there is no backup path (as explained above).
Note that the next paragraph also talks about a backup route and seems
to ignore the "unusable" case above.
[nit] s/solutions that pre- computing/solutions that can pre-compute
[nit] s/Ti-LFA/TI-LFA
[minor] Expand FRR, LFA, TI-LFA, and MRT on first use.
[nit] s/IP FRR [16] remote LFA...[18] or eBGP/IP FRR [16], remote
LFA...[18], or eBGP
[minor] "eBGP path having a backup path [11]"
I don't think that's the right reference.
1027 Let's apply the procedure mentioned in this subsection to the
1028 forwarding chain depicted in Figure 2. Suppose a remote link failure
1029 occurs and impacts the first ECMP IGP path to the remote BGP next-
1030 hop. Upon IGP convergence, the IGP pathlist used by the BGP next-hop
1031 is updated to reflect the new topology (one path instead of two). As
1032 soon as the IGP convergence is effective for the BGP next-hop entry,
1033 the new forwarding state is immediately available to all dependent
1034 BGP prefixes. The same behavior would occur if the failure was local
1035 such as an interface going down. As soon as the IGP convergence is
1036 complete for the BGP next-hop IGP route, all its BGP depending
1037 routes benefit from the new path. In fact, upon local failure, if
1038 LFA protection is enabled for the IGP route to the BGP next-hop and
1039 a backup path was pre-computed and installed in the pathlist, upon
1040 the local interface failure, the LFA backup path is immediately
1041 activated (e.g. sub-50msec) and thus protection benefits all the
1042 depending BGP traffic through the hierarchical forwarding dependency
1043 between the routes.
[minor] "Upon IGP convergence... As soon as the IGP convergence is
effective for the BGP next-hop entry,"
There's some redundancy in the text. Also, by using different text,
you're implying that there is a difference between "IGP convergence"
and "IGP convergence...effective for the BGP next-hop entry".
[major] "The same behavior would occur if the failure was local such
as an interface going down. As soon as the IGP convergence..."
According to the text above (2 or 3 paragraphs before), IGP
convergence is not necessary in the local failure case.
...
1050 6.2.1. Adjusting forwarding Chain in egress node failure
[minor] Most of the text talks about an "egress node". I assume that
is the same as an "edge node" -- is it? If so, please be consistent
and/or explain somewhere that you're referring to the same thing.
1052 When an edge node fails, IGP on neighboring core nodes send route
1053 updates indicating that the edge node is no longer reachable. IGP
1054 running on the iPE instructs FIB to remove the IP and label leaves
1055 corresponding to the failed edge node from FIB. So FIB manager
1056 performs the following steps:
[major] "IGP on neighboring core nodes send route updates indicating
that the edge node is no longer reachable"
§6.1 says that "Node failures are treated as link failures." Is that
the same in this section?
The neighbor nodes can't detect that the node is down -- they can only
detect the link being down. The rest of the explanation assumes that
there was a single link connecting the egress node to the core, or
that all the links were reported as down. Neither case is explained.
If there is more than one link there shouldn't be a failure in
reaching the next-hop -- except for the change in the local router(s),
similar to a core failure.
...
1073 6.2.2. Adjusting Forwarding Chain on PE-CE link Failure
...
1082 In the first case, the rest of iBGP peers will remain unaware of the
1083 link failure and will continue to forward traffic to the edge node
1084 until the edge node attached to the failed link withdraws the BGP
1085 prefixes. If the destination prefixes are multi-homed to another
1086 iBGP peer, say ePE2, then FIB manager on the edge router detecting
1087 the link failure applies the following steps (see Figure 3):
[minor] Figure 3 is an example -- it does show a sample chain. Figure
4 may be used as a reference for the type of backup described here.
...
1100 o The label entry in OutLabel-List corresponding to the
1101 internal path to backup egress PE has swap action to the
1102 label advertised by backup egress PE.
[nit] s/has swap action/has a swap action
[nit] s/by backup/by the backup
1104 o For an arriving label packet (e.g. VPN), the top label is
1105 swapped with the label advertised by backup egress PE and the
1106 packet is sent towards that backup egress PE.
[nit] s/label packet/labeled packet
[major] Even though you're expanding on an example, it is important
for the text to be precise.
- The label mentioned is "e.g. VPN", but you're been using more
specific labels -- from figure 3, "VPN-L21"...
- The OutLabel-Lists in Figure 3 show both "VPN-L21 (swap)" and
"VPN-L21 (push)". The same label is used, only the swap action is
mentioned here. What am I missing?
- As I mentioned before, the diagram in Figure 3 is not clear.
1108 o For unlabeled traffic, packets are simply redirected towards
1109 backup egress PE.
[nit] s/backup/the backups
[major] "ackets are simply redirected towards backup egress PE"
How? The backup ePE is presumably not connected (which is why a label
was needed)...??
1111 In the second case where the edge router uses the IP address of the
1112 failed link as the BGP next-hop, the edge router will still perform
1113 the previous steps. But, unlike the case of next-hop self, IGP on
1114 failed edge node informs the rest of the iBGP peers that IP address
1115 of the failed link is no longer reachable. Hence the FIB manager on
1116 iBGP peers will delete the IGP leaf corresponding to the IP prefix
1117 of the failed link. The behavior of the iBGP peers will be identical
1118 to the case of edge node failure outlined in Section 6.2.1.
[nit] s/IGP on failed edge node/the IGP on the failed edge node
[nit] s/that IP address/that the IP address
1120 It is noteworthy to mention that because the edge link failure is
1121 local to the edge router, sub-50 msec convergence can be achieved as
1122 described in [11].
[] But only at the edge router -- not in other places of the network
where the IGP has to converge.
Please keep the numbers in the Appendix. IOW, there's no need to
claim this performance here.
1124 Let's try to apply the case of next-hop self to the forwarding chain
1125 depicted in Figure 3. After failure of the link between ePE1 and CE,
1126 the forwarding engine will route traffic arriving from the core
1127 towards VPN-NH2 with path-index=1. A packet arriving from the core
1128 will contain the label VPN-L11 at top. The label VPN-L11 is swapped
1129 with the label VPN-L21 and the packet is forwarded towards ePE2.
[minor] This paragraph is out of place: you already tried to explain
this case a few paragraphs above. Please move the text and
deduplicate.
1131 6.3. Handling Failures for Flattened Forwarding Chains
1133 As explained in the in Section 5 if the number of hierarchy levels
1134 of a platform cannot support the native number of hierarchy levels
1135 of a recursive forwarding chain, the instantiated forwarding chain
1136 is constructed by flattening two or more levels. Hence a 3 levels
1137 chain in Figure 5 is flattened into the 2 levels chain in Figure 6.
[nit] s/a 3 levels/the 3-level
[nit] s/2 levels/2-level
1139 While reducing the benefits of BGP-PIC, flattening one hierarchy
1140 into a shallower hierarchy does not always result in a complete loss
1141 of the benefits of the BGP-PIC. To illustrate this fact suppose
1142 ASBR12 is no longer reachable in domain 1. If the platform supports
1143 the full hierarchy depth, the forwarding chain is the one depicted
1144 in Figure 5 and hence the FIB manager needs to backwalk one level to
1145 the pathlist shared by "ePE1" and "ePE2" and adjust it. If the
1146 platform supports 2 levels of hierarchy, then a useable forwarding
1147 chain is the one depicted in Figure 6. In that case, if ASBR12 is no
1148 longer reachable, the FIB manager has to backwalk to the two
1149 flattened pathlists and updates both of them.
1151 The main observation is that the loss of convergence speed due to
1152 the loss of hierarchy depth depends on the structure of the
1153 forwarding chain itself. To illustrate this fact, let's take two
1154 extremes. Suppose the forwarding objects in level i+1 depend on the
1155 forwarding objects in level i. If every object on level i+1 depends
1156 on a separate object in level i, then flattening level i into level
1157 i+1 will not result in loss of convergence speed. Now let's take the
1158 other extreme. Suppose "n" objects in level i+1 depend on 1 object
1159 in level i. Now suppose FIB flattens level i into level i+1. If a
1160 topology change results in modifying the single object in level i,
1161 then FIB has to backwalk and modify "n" objects in the flattened
1162 level, thereby losing all the benefit of BGP-PIC. Experience shows
1163 that flattening forwarding chains usually results in moderate loss
1164 of BGP-PIC benefits. Further analysis is needed to corroborate and
1165 quantify this statement.
[major] "The main observation is that the loss of convergence speed
due to the loss of hierarchy depth... Further analysis is needed to
corroborate and quantify this statement."
The explanation above sounds plausible. However, if you really don't
know, and there is a dependency on the structure, and it depends on
the failure, and there's no alternative (if the platform supports less
levels)... Why even include this information? It feels like the
equivalent of "hand waving". :-(
1167 7. Properties
[minor] This section is really a summary. I wonder if it would serve
a better purpose at the start of the document as a "road map".
1169 7.1. Coverage
1171 All the possible failures, except CE node failure, are covered,
1172 whether they impact a local or remote IGP path or a local or remote
1173 BGP next-hop as described in Section 6. This section provides
1174 details for each failure and how the hierarchical and shared FIB
1175 structure proposed in this document allows recovery that does not
1176 depend on number of BGP prefixes.
[major] "except CE node failure"
Isn't that covered in §6.2.2?
1178 7.1.1. A remote failure on the path to a BGP next-hop
1180 Upon IGP convergence, the IGP leaf for the BGP next-hop is updated
1181 upon IGP convergence and all the BGP depending routes leverage the
1182 new IGP forwarding state immediately. Details of this behavior can
1183 be found in Section 6.1.
[minor] Redundant text: "Upon IGP convergence, the IGP leaf for the
BGP next-hop is updated upon IGP convergence..."
1185 This BGP resiliency property only depends on IGP convergence and is
1186 independent of the number of BGP prefixes impacted.
[] "BGP resiliency property"
The rest of the text (and the name of the document!) refers to
convergence. It isn't until this section that you change the
terminology. Please be consistent.
1188 7.1.2. A local failure on the path to a BGP next-hop
1190 Upon LFA protection, the IGP leaf for the BGP next-hop is updated to
1191 use the precomputed LFA backup path and all the BGP depending routes
1192 leverage this LFA protection. Details of this behavior can be found
1193 in Section 6.1.
[minor] Redundant text: LFA is mentioned 3 times in the first sentence.
[major] What about the case where an LFA doesn't exist?
...
1238 7.3. Automated
1240 The BGP-PIC solution does not require any operator involvement. The
1241 process is entirely automated as part of the FIB implementation.
[] Ok -- it is not as much an automation as it is an implementation choice.
1243 The salient points enabling this automation are:
[minor] I"m not sure if you're saying that the list below is what
enables the "automation", or if the "automation" results in that.
1245 o Extension of the BGP Best Path to compute more than one primary
1246 ([12]and [13]) or backup BGP next-hop ([7] and [14]).
[major] This functionality is available regardless of the structure of the FIB.
1248 o Sharing of BGP Path-list across BGP destinations with same
1249 primary and backup BGP next-hop.
[nit] s/same/the same
1251 o Hierarchical indirection and dependency between BGP pathlist and
1252 IGP pathlist.
[] The last two bullets are a characteristic of the implementation.
1254 7.4. Incremental Deployment
1256 As soon as one router supports BGP-PIC solution, it benefits from
1257 all its benefits (most notably convergence that does not depend in
1258 the number of prefixes) without any requirement for other routers to
1259 support BGP-PIC.
[nit] Reduncant text: "it benefits from all its benefits"
[major] The assertion above is true. However, (in the extreme) a
single router supporting BGP-PIC doesn't necessarily translate in to
better overall convergence in the network (it depends on where the
router is, the failure, etc.). Also, having a mix of routers could
result in transient micro loops as the speed of convergence is not the
same.
1261 8. Security Considerations
1263 The behavior described in this document is internal functionality
1264 to a router that result in significant improvement to convergence
1265 time as well as reduction in CPU and memory used by FIB while not
1266 showing change in basic routing and forwarding functionality. As
1267 such no additional security risk is introduced by using the
1268 mechanisms proposed in this document.
[] I can't think of anything else to say here, but you did not show
(nor do I need you to) "significant...reduction in CPU and memory
used". It sounds more like a marketing statement. :-(
1270 9. IANA Considerations
1272 No requirements for IANA
[major] s/.../This document has no IANA actions.
1274 10. Conclusions
[] This section is not needed, it contains redundant information.
Please remove it.
...
1283 11. References
[major] The RFC Editor requires that the OID be listed for all
references [1]. Please update the references -- here's an example of
what they should look like [2]:
[RFC4271] Rekhter, Y., Ed., Li, T., Ed., and S. Hares, Ed., "A Border
Gateway Protocol 4 (BGP-4)", RFC 4271, DOI 10.17487/RFC4271,
January 2006, <https://www.rfc-editor.org/info/rfc4271>.
[1] https://www.rfc-editor.org/styleguide/part2/
[2] https://www.rfc-editor.org/refs/ref4271.txt
1285 11.1. Normative References
...
1293 [3] E. Rosen, " Carrying Label Information in BGP-4", RFC 8277,
1294 October 2017
[minor] This reference can be Informative.
1296 [4] Andersson, L., Minei, I., and B. Thomas, "LDP Specification",
1297 RFC 5036, October 2007
[minor] This reference can be Informative.
1299 [5] A. Bashandy, C. Filsfils, S. Previdi, B. Decraene, S.
1300 Litkowski, M. Horneffer, R. Shakir, "Segment Routing with MPLS
1301 data plane", RFC 8660, December 2019
[minor] This reference can be Informative.
...
1377 Appendix A. Perspective
...
1392 o BGP Convergence per BGP destination ~ 200usec conservative,
1394 ~ 100usec optimistic
[minor] "BGP Convergence"
The explanation below talks about recalculating the best route --
which is different than convergence (which includes sending BGP
Updates, etc.).
[EoR -18]
_______________________________________________
rtgwg mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/rtgwg