Re: [bess] Comments on draft-ietf-bess-evpn-irb-mcast

Eric C Rosen Mon, 10 Sep 2018 10:50:43 -0700

On 8/15/2018 4:07 AM, Ashutosh Gupta wrote:

Hi Folks,
I have following comments on draft-ietf-bess-evpn-irb-mcast. I alsocompare it to draft-sajassi-bess-evpn-mvpn-seamless-interop, whichutilizes existing MVPN technology to achieve mcast-irb functionalityin EVPN.
*1. Re-branding MVPN constructs into EVPN *
/evpn-irb/draft proposes a lot of MVPN constructs into EVPN.Originating multicast receiver interest "per PE" instead of "per BD",use of selective tunnels are few examples. If solution really isachievable through MVPN, why do we need to re-brand it in EVPN?

As I and others have endeavored to explain in several messages to thislist, the solution is not achievable by simply using MVPN routes andprocedures. Correct emulation of the ethernet muliticast servicerequires the addition of BD-specific information and procedures. Without this information, the distinction between "ES" and "BD" islost. This results in such problems as:


- Failure to correctly emulate ethernet behavior, and

- Inability to summarize routes on a per-subnet basis, thus requiringthe MVPN PEs to be bombarded with host routes.

The seamless-mcast draft also severely underspecifies the behaviorneeded when interconnecting an EVPN domain to an MVPN domain that uses adifferent tunnel type, and does not appear to recognize either howcommon that scenario is, or how tricky it is to get that right. (That'swhy the MVPN P-tunnel segmentation procedures are so intricate; that'sa whole area that seems to be ignored in the seamless-mcast draft.)

It's also worth reminding folks that 90% of EVPN is based on messagesand procedures borrowed from L3VPN/MVPN, with the addition ofinformation and procedures that are needed to do EVPN-specific thingsinvolving Ethernet Segments and Broadcast Domains. This is certainlytrue of the EVPN procedures for advertising unicast IP routes, which donot "just use L3VPN". The OISM proposal in the irb-mcast draft justfollows along these lines.

Furthermore, it's worth noting that the so-called "seamless-mcast" drafthas EVPN-specific procedures as well. And more keep getting added asadditional problems get discovered (e.g., the new intra-ES tunnels). It's very far from being "just use MVPN protocols as is".

Finally, one might also point out that the folks with the most in-depthknowledge of MVPN seem to be the least enthusiastic about the"seamless-mcast" draft. Just saying ;-)

In the remainder of this message I want to focus on Ashutosh'scriticisms of the OISM proposal (as we call the proposal in theirb-mcast draft).

*4. Data Plane considerations*
*
*
*4.1.* The data-plane nuances of solution has been underplayed. Forexample, if a PE1 has a (S, G) receivers in BD2, BD3 till BD10,whereas source S belongs in BD1 subnet on PE2. And if BD1 is notconfigured locally on PE1, a special BD (called SBD) is programmed asIIF in forwarding entry. Later if BD1 gets configured on PE1, it wouldcause IIF to change on PE1 from SBD to BD1.


So far correct.

This would result in traffic disruption for all existing receivers inBD2, BD3 till BD10.

The IIF in an (S,G) or (*,G) state can change at any time, due todistant (or not so distant) link failures, and/or due to other routingchanges. It changes whenever the UMH selection changes, and it changesif the ingress PE decides to move a flow from an I-PMSI to an S-PMSI. Of all the events that might cause a particular multicast state tochange its IIF, the addition of a BD on the local PE does not seem to beone of the more frequent.

Some PIM implementations try to delay changing the IIF in a given (S,G)state until an (S,G) packet actually arrives from the new IIF. However,such data-driven state changes introduce a lot of complexity. MVPN hasalways tried to avoid data-driven state changes, and MVPNimplementations will generally see some small amount of disruption whenthe IIF changes. Methods for minimizing this are really animplementation matter.

In the case where a new BD is configured, there may be some risk of an(S,G) packet getting lost in the following situation:


- Time t0: The packet arrives and gets marked as being from the SBD.
- Time t1: The IIF changes from SBD to BD1

- Time t2: The packet is processed against the (S,G) state and discardedas having arrived from the wrong IIF

Given (a) how short the time t2-t0 is, (b) how infrequently new BDs getconfigured on a PE, and (c) the fact that there are many other causes ofIIF changes, this does not seem like a real problem. If it is regardedas a problem, resolving it would be an implementation matter.

*4.2. *Also, /evpn-irb/ solution proposes to relax the RPF check for a(*, G) multicast entry. This poses a great risk of traffic loopsespecially in transient network conditions in addition to poordebug-ability.

In the case where the receiving tenant systems ask for (*,G), and allthe sources are known to be in the EVPN Tenant Domain, a RendezvousPoint (RP) is not required, as there is no real need for "sourcediscovery", "shared trees", or "switching from shared trees to sourcetrees". Thus some of the most complicated and error-prone features ofIP multicast can be avoided. Note that we also avoid the need to createan (S,G) state for every source, which may result in a considerablereduction in state. (Though obviously this optimization can only beused if there is a priori knowledge that all sources for a given groupare within the EVPN domain.)

I don't think I've ever heard anybody say "I wish I had more PIM RPs,and more switching between shared trees and source trees; that makesthings really easy to debug!". I've heard the opposite though, quite alot ;-)

I've heard it suggested that the use of MVPN's SPT-ONLY mode provides anequivalent simplification. However, that is not true at all. TheSPT-ONLY mode (RFC 6514 section 14) requires each PE to function as anRP. This creates a whole new set of RPs that need to be managed,creates more work for the PEs, and requires each PE to originate moreroutes. It requires the PE to create a lot of (S,G) states that are nototherwise needed. And if an MVPN customer (or EVPN tenant) already hasits own RP infrastructure, the PEs may need to participate in BSR orAuto-RP, and/or may need to talk MSDP to the other RPs.

If a given multicast group has all its sources and receivers in the EVPNdomain, it's much simpler if one can avoid RPs entirely for that group. And if that EVPN domain has other multicast groups with sources orreceivers in an MVPN domain, it may even be impossible to configure theEVPN-PEs to be in SPT-ONLY mode. The MVPN nodes are likely to beconfigured with RPT-SPT mode, and the two modes do not interoperate.

With regard to the "great risk of traffic loops ", I'd like to hear morespecifics. To get a loop, there'd have to be a situation in which a PEgets a (*,G) packet, sends it out one of its local ACs, and then thesame packet is received back (at the same or at another PE) over an AC. Given that the all-active multi-homing procedures are in place, I justdon't see how this will happen. (Remember that we're not talking aboutthe multicast states used to create trees in the underlay; when creatingtrees in the underlay, looping is more of a worry, but that's notrelevant to the current discussion.)

*3. Control plane scale in fabric / core *
... each PE one additional Tunnel per BD apart from existing BUMtunnel. Essentially one tunnel for B+U and another for M. This isproposed to avoid all B+U traffic in the BD1, indiscriminatelyreaching all PE's in domain, irrespective of whether they have BD1configured locally or not. This increases the state in fabric by *"numof PE" x "num of BD"*.

I think the issue alluded to here is actually the data plane state,i.e., the amount of state that has to be maintained in order to do theforwarding.

We should also make clear that when we say things like "one per PE",what we really mean is "one per PE per Tenant Domain" (just as MVPNtunnels are "one per PE per VPN"). Everything below will be in thecontext of a given Tenant Domain.

Let me point out first that the number of BUM tunnels that exist per RFC7432 (assuming P2MP LSPs are used) is *O(number of PEs x number ofBDs)*. If one adds a second per-BD tunnel so that one can carry the IPmulticast frames separately, the number of tunnels is still *O(number ofPEs x number of BDs)*.

However, for the IP multicast tunnels, I think it would be better to useaggregated tunnels, so that a single P2MP LSP from, say, PE1, carriesthe IP multicast frames from all of PE1's locally attached ACs (in thegiven Tenant Domain), no matter which BD the AC is attached to. Thatis, one really needs only one additional P2MP LSP per PE per TenantDomain, not one per BD.

In order to properly perform the ethernet emulation, an egress PE, sayPE2, has to be able to determine whether an IP multicast frame from PE1came from one of the BDs to which PE1 is locally attached. This meansthat the data plane encapsulation has to contain information that allowsPE2 to make this determination. In the draft, this is done by using anMPLS label or a VNID (depending upon encapsulation) that identifies theframe's source BD.

When using MPLS encapsulation, the assignment of labels to BDs is bestdone by using the technique described indraft-zzhang-bess-mvpn-evpn-aggregation-label ("MVPN/EVPN TunnelAggregation with Common Labels"). Per that draft, each BD would beassigned a domain-wide unique label (from a "Domain-Wide Common Block",or DCB). Then the number of BD-identifying labels that each PE needs toknow is proportional to the number of BDs in the Tenant Domains to whichit is attached. So in the data plane the number of labels one has toknow for receiving is just **"number of BDs *in locally attached TenantDomains";* the number of PEs does not factor in. (Note that when usingVXLAN encapsulation, the VNIDs are already domain-wide unique.)

When one is just doing RFC 7432, the number of data plane labels oneneeds is **"number of locally attached BDs", **which will certainly be asmaller number, but the difference between these two numbers does notseem very scary.

One might well ask though whether the number of labels a receiving PEhas to know could be reduced from **"number of BDs *in locally attachedTenant Domains" *to **"number of locally attached BDs". **This iscertainly worth exploring. If one receives a frame on a given P2MP LSP,one knows from the top label what Tenant Domain it belongs to. Thesecond label tells one what BD the frame belongs to. If that secondlabel is unrecognized, one could decide to treat the frame as belongingto the SBD of the Tenant Domain identified by the top label. This sortof technique would really minimize the data plane state, while stillallowing the source BD to be identified. This is worth some furtherinvetigation.****

*2. Scale of BGP routes*
/evpn-irb/ solution mandates a PE to process and store all IMET NLRI'sfrom all peer PE's in tenant domain (as opposed to processing andstoring only NLRI's for BD's it has locally present).


Now we're talking about control plane scale.

Just for clarity, please note that when Ingress Replication is used, theBD-specific IMETs do not carry the SBD-RT (see section 3.2.2) and henceare not distributed to all PEs in the Tenant Domain. (There are somesections of the draft that are inconsistent about this; that will befixed.) So this issue does not arise in that case.

This is proposed because multicast traffic could be originating fromany PE in any BD. To put this in perspective, lets take an example ofa tenant domain with 101 PE's with each PE having 1000 BD's. Each PEhas at most 10 BD's common with any other PE in network. In this casePE1 will have to process and store, 100 (remote PE's) x 1000 (BD's perPE) x 1 (IMET per BD) = 0.1 Million IMET routes. Essentially, it is oforder *"Num of BD's" x "Num of PE's"*.

I'd like to see the example deployment where a single Tenant Domainattaches to 100 PEs, each of those PEs attaches to 1000 BDs of thatTenant Domain, and each individual BD attaches to only 10 of those PEs!

However, the basic point is well taken. In the context of a givenTenant Domain, the number of IMET routes a PE receives will be *O(numberof PEs x average number of BDs per PE).

*Whether this really presents a significant control plane scaling issueis debatable, but let's see if we can reduce this.

If we use aggregate tunnels, the tunnel will be specified in theSBD-IMET route. When a PE sends an SBD-IMET route, it can include inthat route the label used to identify each of the BDs to which that PEis locally attached. If we do that, we don't need to send the per-BDIMET routes to all PEs in the Tenant Domain. (If there are too manylocally attached BDs of a given Tenant Domain to fit into a singleSBD-IMET route, we could borrow the technique of RFC 7432 Section 8.2 tosend a small number of additional SBD-IMET routes. ) With this minorchange, we could reduce the number of IMET routes so that it isproportional only to the number of PEs

Further, if the labels used to identify the BDs are domain-wide unique,it may be possible to omit them from the SBD-IMET routes altogether. Or perhaps only the DR for each BD will advertise the label for thatBD. Or perhaps they can be learned directly from a controller. Theseare all options worth exploring as a way of reducing the overhead.

So I think that by making some small adjustments, we can greatly reducethe control plane overhead without incurring all the problems of the"seamless-mcast" scheme.

_______________________________________________
BESS mailing list
BESS@ietf.org
https://www.ietf.org/mailman/listinfo/bess

Re: [bess] Comments on draft-ietf-bess-evpn-irb-mcast

Reply via email to