On 8/15/2018 4:07 AM, Ashutosh Gupta wrote:
Hi Folks,

I have following comments on draft-ietf-bess-evpn-irb-mcast. I also compare it to draft-sajassi-bess-evpn-mvpn-seamless-interop, which utilizes existing MVPN technology to achieve mcast-irb functionality in EVPN.


*1. Re-branding MVPN constructs into EVPN *
/evpn-irb/draft proposes a lot of MVPN constructs into EVPN. Originating multicast receiver interest "per PE"  instead of "per BD", use of selective tunnels are few examples. If solution really is achievable through MVPN, why do we need to re-brand it in EVPN?

As I and others have endeavored to explain in several messages to this list, the solution is not achievable by simply using MVPN routes and procedures.  Correct emulation of the ethernet muliticast service requires the addition of BD-specific information and procedures.  Without this information, the distinction between "ES" and "BD" is lost.   This results in such problems as:

- Failure to correctly emulate ethernet behavior, and

- Inability to summarize routes on a per-subnet basis, thus requiring the MVPN PEs to be bombarded with host routes.

The seamless-mcast draft also severely underspecifies the behavior needed when interconnecting an EVPN domain to an MVPN domain that uses a different tunnel type, and does not appear to recognize either how common that scenario is, or how tricky it is to get that right.  (That's why the MVPN P-tunnel segmentation procedures are so intricate;  that's a whole area that seems to be ignored in the seamless-mcast draft.)

It's also worth reminding folks that 90% of EVPN is based on messages and procedures borrowed from L3VPN/MVPN, with the addition of information and procedures that are needed to do EVPN-specific things involving Ethernet Segments and Broadcast Domains.  This is certainly true of the EVPN procedures for advertising unicast IP routes, which do not "just use L3VPN".  The OISM proposal in the irb-mcast draft just follows along these lines.

Furthermore, it's worth noting that the so-called "seamless-mcast" draft has EVPN-specific procedures as well.  And more keep getting added as additional problems get discovered (e.g., the new intra-ES tunnels).   It's very far from being "just use MVPN protocols as is".

Finally, one might also point out that the folks with the most in-depth knowledge of MVPN seem to be the least enthusiastic about the "seamless-mcast" draft.  Just saying ;-)

In the remainder of this message I want to focus on Ashutosh's criticisms of the OISM proposal (as we call the proposal in the irb-mcast draft).

*4. Data Plane considerations*
*
*
*4.1.* The data-plane nuances of solution has been underplayed. For example, if a PE1 has a (S, G) receivers in BD2, BD3 till BD10, whereas source S belongs in BD1 subnet on PE2. And if BD1 is not configured locally on PE1, a special BD (called SBD) is programmed as IIF in forwarding entry. Later if BD1 gets configured on PE1, it would cause IIF to change on PE1 from SBD to BD1.

So far correct.

This would result in traffic disruption for all existing receivers in BD2, BD3 till BD10.

The IIF in an (S,G) or (*,G) state can change at any time, due to distant (or not so distant) link failures, and/or due to other routing changes.  It changes whenever the UMH selection changes, and it changes if the ingress PE decides to move a flow from an I-PMSI to an S-PMSI.  Of all the events that might cause a particular multicast state to change its IIF, the addition of a BD on the local PE does not seem to be one of the more frequent.

Some PIM implementations try to delay changing the IIF in a given (S,G) state until an (S,G) packet actually arrives from the new IIF.  However, such data-driven state changes introduce a lot of complexity.  MVPN has always tried to avoid data-driven state changes, and MVPN implementations will generally see some small amount of disruption when the IIF changes.  Methods for minimizing this are really an implementation matter.

In the case where a new BD is configured, there may be some risk of an (S,G) packet getting lost in the following situation:

- Time t0: The packet arrives and gets marked as being from the SBD.
- Time t1: The IIF changes from SBD to BD1
- Time t2: The packet is processed against the (S,G) state and discarded as having arrived from the wrong IIF

Given (a) how short the time t2-t0 is, (b) how infrequently new BDs get configured on a PE, and (c) the fact that there are many other causes of IIF changes, this does not seem like a real problem.  If it is regarded as a problem, resolving it would be an implementation matter.


*4.2. *Also, /evpn-irb/ solution proposes to relax the RPF check for a (*, G) multicast entry. This poses a great risk of traffic loops especially in transient network conditions in addition to poor debug-ability.

In the case where the receiving tenant systems ask for (*,G), and all the sources are known to be in the EVPN Tenant Domain, a Rendezvous Point (RP) is not required, as there is no real need for "source discovery", "shared trees", or "switching from shared trees to source trees".  Thus some of the most complicated and error-prone features of IP multicast can be avoided.  Note that we also avoid the need to create an (S,G) state for every source, which may result in a considerable reduction in state.  (Though obviously this optimization can only be used if there is a priori knowledge that all sources for a given group are within the EVPN domain.)

I don't think I've ever heard anybody say "I wish I had more PIM RPs, and more switching between shared trees and source trees; that makes things really easy to debug!".  I've heard the opposite though, quite a lot ;-)

I've heard it suggested that the use of MVPN's SPT-ONLY mode provides an equivalent simplification.  However, that is not true at all.  The SPT-ONLY mode (RFC 6514 section 14) requires each PE to function as an RP.   This creates a whole new set of RPs that need to be managed, creates more work for the PEs, and requires each PE to originate more routes.  It requires the PE to create a lot of (S,G) states that are not otherwise needed.  And if an MVPN customer (or EVPN tenant) already has its own RP infrastructure, the PEs may need to participate in BSR or Auto-RP, and/or may need to talk MSDP to the other RPs.

If a given multicast group has all its sources and receivers in the EVPN domain, it's much simpler if one can avoid RPs entirely for that group.  And if that EVPN domain has other multicast groups with sources or receivers in an MVPN domain, it may even be impossible to configure the EVPN-PEs to be in SPT-ONLY mode.  The MVPN nodes are likely to be configured with RPT-SPT mode, and the two modes do not interoperate.

With regard to the "great risk of traffic loops ", I'd like to hear more specifics.  To get a loop, there'd have to be a situation in which a PE gets a (*,G) packet, sends it out one of its local ACs, and then the same packet is received back (at the same or at another PE) over an AC.  Given that the all-active multi-homing procedures are in place, I just don't see how this will happen.  (Remember that we're not talking about the multicast states used to create trees in the underlay; when creating trees in the underlay, looping is more of a worry, but that's not relevant to the current discussion.)

*3. Control plane scale in fabric / core *
... each PE one additional Tunnel per BD apart from existing BUM tunnel. Essentially one tunnel for B+U and another for M. This is proposed to avoid all B+U traffic in the BD1, indiscriminately reaching all PE's in domain, irrespective of whether they have BD1 configured locally or not. This increases the state in fabric by *"num of PE" x "num of BD"*.

I think the issue alluded to here is actually the data plane state, i.e.,  the amount of state that has to be maintained in order to do the forwarding.

We should also make clear that when we say things like "one per PE", what we really mean is "one per PE per Tenant Domain" (just as MVPN tunnels are "one per PE per VPN").  Everything below will be in the context of a given Tenant Domain.

Let me point out first that the number of BUM tunnels that exist per RFC 7432 (assuming P2MP LSPs are used) is *O(number of PEs x number of BDs)*.  If one adds a second per-BD tunnel so that one can carry the IP multicast frames separately, the number of tunnels is still *O(number of PEs x number of BDs)*.

However, for the IP multicast tunnels, I think it would be better to use aggregated tunnels, so that a single P2MP LSP from, say, PE1, carries the IP multicast frames from all of PE1's locally attached ACs (in the given Tenant Domain), no matter which BD the AC is attached to.  That is, one really needs only one additional P2MP LSP per PE per Tenant Domain, not one per BD.

In order to properly perform the ethernet emulation, an egress PE, say PE2, has to be able to determine whether an IP multicast frame from PE1 came from one of the BDs to which PE1 is locally attached. This means that the data plane encapsulation has to contain information that allows PE2 to make this determination.  In the draft, this is done by using an MPLS label or a VNID (depending upon encapsulation) that identifies the frame's source BD.

When using MPLS encapsulation, the assignment of labels to BDs is best done by using the technique described in draft-zzhang-bess-mvpn-evpn-aggregation-label ("MVPN/EVPN Tunnel Aggregation with Common Labels").   Per that draft, each BD would be assigned a domain-wide unique label (from a "Domain-Wide Common Block", or DCB).  Then the number of BD-identifying labels that each PE needs to know is proportional to the number of BDs in the Tenant Domains to which it is attached.  So in the data plane the number of labels one has to know for receiving is just **"number of BDs *in locally attached Tenant Domains";* the number of PEs does not factor in.  (Note that when using VXLAN encapsulation, the VNIDs are already domain-wide unique.)

When one is just doing RFC 7432, the number of data plane labels one needs is **"number of locally attached BDs", **which will certainly be a smaller number, but the difference between these two numbers does not seem very scary.

One might well ask though whether the number of labels a receiving PE has to know could be reduced from **"number of BDs *in locally attached Tenant Domains" *to **"number of locally attached BDs". **This is certainly worth exploring.  If one receives a frame on a given P2MP LSP, one knows from the top label what Tenant Domain it belongs to.  The second label tells one what BD the frame belongs to.  If that second label is unrecognized, one could decide to treat the frame as belonging to the SBD of the Tenant Domain identified by the top label.  This sort of technique would really minimize the data plane state, while still allowing the source BD to be identified.  This is worth some further invetigation.****


*2. Scale of BGP routes*
/evpn-irb/ solution mandates a PE to process and store all IMET NLRI's from all peer PE's in tenant domain (as opposed to processing and storing only NLRI's for BD's it has locally present).

Now we're talking about control plane scale.

Just for clarity, please note that when Ingress Replication is used, the BD-specific IMETs do not carry the SBD-RT  (see section 3.2.2) and hence are not distributed to all PEs in the Tenant Domain. (There are some sections of the draft that are inconsistent about this; that will be fixed.)  So this issue does not arise in that case.

This is proposed because multicast traffic could be originating from any PE in any BD. To put this in perspective, lets take an example of a tenant domain with 101 PE's with each PE having 1000 BD's. Each PE has at most 10 BD's common with any other PE in network. In this case PE1 will have to process and store, 100 (remote PE's) x 1000 (BD's per PE) x 1 (IMET per BD) = 0.1 Million IMET routes. Essentially, it is of order *"Num of BD's" x "Num of PE's"*.

I'd like to see the example deployment where a single Tenant Domain attaches to 100 PEs, each of those PEs attaches to 1000 BDs of that Tenant Domain, and each individual BD attaches to only 10 of those PEs!

However, the basic point is well taken.  In the context of a given Tenant Domain, the number of IMET routes a PE receives will be *O(number of PEs  x average number of BDs per PE).

*Whether this really presents a significant control plane scaling issue is debatable, but let's see if we can reduce this.

If we use aggregate tunnels, the tunnel will be specified in the SBD-IMET route.  When a PE sends an SBD-IMET route, it can include in that route the label used to identify each of the BDs to which that PE is locally attached.  If we do that, we don't need to send the per-BD IMET routes to all PEs in the Tenant Domain.  (If there are too many locally attached BDs of a given Tenant Domain to fit into a single SBD-IMET route, we could borrow the technique of RFC 7432 Section 8.2 to send a small number of additional SBD-IMET routes. )  With this minor change, we could reduce the number of IMET routes so that it is proportional only to the number of PEs

Further, if the labels used to identify the BDs are domain-wide unique, it may be possible to omit them from the SBD-IMET routes altogether.   Or perhaps only the DR for each BD will advertise the label for that BD.  Or perhaps they can be learned directly from a controller.  These are all options worth exploring as a way of reducing the overhead.

So I think that by making some small adjustments, we can greatly reduce the control plane overhead without incurring all the problems of the "seamless-mcast" scheme.


_______________________________________________
BESS mailing list
BESS@ietf.org
https://www.ietf.org/mailman/listinfo/bess

Reply via email to