On 8/15/2018 4:07 AM, Ashutosh Gupta wrote:
Hi Folks,
I have following comments on draft-ietf-bess-evpn-irb-mcast. I also
compare it to draft-sajassi-bess-evpn-mvpn-seamless-interop, which
utilizes existing MVPN technology to achieve mcast-irb functionality
in EVPN.
*1. Re-branding MVPN constructs into EVPN *
/evpn-irb/draft proposes a lot of MVPN constructs into EVPN.
Originating multicast receiver interest "per PE" instead of "per BD",
use of selective tunnels are few examples. If solution really is
achievable through MVPN, why do we need to re-brand it in EVPN?
As I and others have endeavored to explain in several messages to this
list, the solution is not achievable by simply using MVPN routes and
procedures. Correct emulation of the ethernet muliticast service
requires the addition of BD-specific information and procedures.
Without this information, the distinction between "ES" and "BD" is
lost. This results in such problems as:
- Failure to correctly emulate ethernet behavior, and
- Inability to summarize routes on a per-subnet basis, thus requiring
the MVPN PEs to be bombarded with host routes.
The seamless-mcast draft also severely underspecifies the behavior
needed when interconnecting an EVPN domain to an MVPN domain that uses a
different tunnel type, and does not appear to recognize either how
common that scenario is, or how tricky it is to get that right. (That's
why the MVPN P-tunnel segmentation procedures are so intricate; that's
a whole area that seems to be ignored in the seamless-mcast draft.)
It's also worth reminding folks that 90% of EVPN is based on messages
and procedures borrowed from L3VPN/MVPN, with the addition of
information and procedures that are needed to do EVPN-specific things
involving Ethernet Segments and Broadcast Domains. This is certainly
true of the EVPN procedures for advertising unicast IP routes, which do
not "just use L3VPN". The OISM proposal in the irb-mcast draft just
follows along these lines.
Furthermore, it's worth noting that the so-called "seamless-mcast" draft
has EVPN-specific procedures as well. And more keep getting added as
additional problems get discovered (e.g., the new intra-ES tunnels).
It's very far from being "just use MVPN protocols as is".
Finally, one might also point out that the folks with the most in-depth
knowledge of MVPN seem to be the least enthusiastic about the
"seamless-mcast" draft. Just saying ;-)
In the remainder of this message I want to focus on Ashutosh's
criticisms of the OISM proposal (as we call the proposal in the
irb-mcast draft).
*4. Data Plane considerations*
*
*
*4.1.* The data-plane nuances of solution has been underplayed. For
example, if a PE1 has a (S, G) receivers in BD2, BD3 till BD10,
whereas source S belongs in BD1 subnet on PE2. And if BD1 is not
configured locally on PE1, a special BD (called SBD) is programmed as
IIF in forwarding entry. Later if BD1 gets configured on PE1, it would
cause IIF to change on PE1 from SBD to BD1.
So far correct.
This would result in traffic disruption for all existing receivers in
BD2, BD3 till BD10.
The IIF in an (S,G) or (*,G) state can change at any time, due to
distant (or not so distant) link failures, and/or due to other routing
changes. It changes whenever the UMH selection changes, and it changes
if the ingress PE decides to move a flow from an I-PMSI to an S-PMSI.
Of all the events that might cause a particular multicast state to
change its IIF, the addition of a BD on the local PE does not seem to be
one of the more frequent.
Some PIM implementations try to delay changing the IIF in a given (S,G)
state until an (S,G) packet actually arrives from the new IIF. However,
such data-driven state changes introduce a lot of complexity. MVPN has
always tried to avoid data-driven state changes, and MVPN
implementations will generally see some small amount of disruption when
the IIF changes. Methods for minimizing this are really an
implementation matter.
In the case where a new BD is configured, there may be some risk of an
(S,G) packet getting lost in the following situation:
- Time t0: The packet arrives and gets marked as being from the SBD.
- Time t1: The IIF changes from SBD to BD1
- Time t2: The packet is processed against the (S,G) state and discarded
as having arrived from the wrong IIF
Given (a) how short the time t2-t0 is, (b) how infrequently new BDs get
configured on a PE, and (c) the fact that there are many other causes of
IIF changes, this does not seem like a real problem. If it is regarded
as a problem, resolving it would be an implementation matter.
*4.2. *Also, /evpn-irb/ solution proposes to relax the RPF check for a
(*, G) multicast entry. This poses a great risk of traffic loops
especially in transient network conditions in addition to poor
debug-ability.
In the case where the receiving tenant systems ask for (*,G), and all
the sources are known to be in the EVPN Tenant Domain, a Rendezvous
Point (RP) is not required, as there is no real need for "source
discovery", "shared trees", or "switching from shared trees to source
trees". Thus some of the most complicated and error-prone features of
IP multicast can be avoided. Note that we also avoid the need to create
an (S,G) state for every source, which may result in a considerable
reduction in state. (Though obviously this optimization can only be
used if there is a priori knowledge that all sources for a given group
are within the EVPN domain.)
I don't think I've ever heard anybody say "I wish I had more PIM RPs,
and more switching between shared trees and source trees; that makes
things really easy to debug!". I've heard the opposite though, quite a
lot ;-)
I've heard it suggested that the use of MVPN's SPT-ONLY mode provides an
equivalent simplification. However, that is not true at all. The
SPT-ONLY mode (RFC 6514 section 14) requires each PE to function as an
RP. This creates a whole new set of RPs that need to be managed,
creates more work for the PEs, and requires each PE to originate more
routes. It requires the PE to create a lot of (S,G) states that are not
otherwise needed. And if an MVPN customer (or EVPN tenant) already has
its own RP infrastructure, the PEs may need to participate in BSR or
Auto-RP, and/or may need to talk MSDP to the other RPs.
If a given multicast group has all its sources and receivers in the EVPN
domain, it's much simpler if one can avoid RPs entirely for that group.
And if that EVPN domain has other multicast groups with sources or
receivers in an MVPN domain, it may even be impossible to configure the
EVPN-PEs to be in SPT-ONLY mode. The MVPN nodes are likely to be
configured with RPT-SPT mode, and the two modes do not interoperate.
With regard to the "great risk of traffic loops ", I'd like to hear more
specifics. To get a loop, there'd have to be a situation in which a PE
gets a (*,G) packet, sends it out one of its local ACs, and then the
same packet is received back (at the same or at another PE) over an AC.
Given that the all-active multi-homing procedures are in place, I just
don't see how this will happen. (Remember that we're not talking about
the multicast states used to create trees in the underlay; when creating
trees in the underlay, looping is more of a worry, but that's not
relevant to the current discussion.)
*3. Control plane scale in fabric / core *
... each PE one additional Tunnel per BD apart from existing BUM
tunnel. Essentially one tunnel for B+U and another for M. This is
proposed to avoid all B+U traffic in the BD1, indiscriminately
reaching all PE's in domain, irrespective of whether they have BD1
configured locally or not. This increases the state in fabric by *"num
of PE" x "num of BD"*.
I think the issue alluded to here is actually the data plane state,
i.e., the amount of state that has to be maintained in order to do the
forwarding.
We should also make clear that when we say things like "one per PE",
what we really mean is "one per PE per Tenant Domain" (just as MVPN
tunnels are "one per PE per VPN"). Everything below will be in the
context of a given Tenant Domain.
Let me point out first that the number of BUM tunnels that exist per RFC
7432 (assuming P2MP LSPs are used) is *O(number of PEs x number of
BDs)*. If one adds a second per-BD tunnel so that one can carry the IP
multicast frames separately, the number of tunnels is still *O(number of
PEs x number of BDs)*.
However, for the IP multicast tunnels, I think it would be better to use
aggregated tunnels, so that a single P2MP LSP from, say, PE1, carries
the IP multicast frames from all of PE1's locally attached ACs (in the
given Tenant Domain), no matter which BD the AC is attached to. That
is, one really needs only one additional P2MP LSP per PE per Tenant
Domain, not one per BD.
In order to properly perform the ethernet emulation, an egress PE, say
PE2, has to be able to determine whether an IP multicast frame from PE1
came from one of the BDs to which PE1 is locally attached. This means
that the data plane encapsulation has to contain information that allows
PE2 to make this determination. In the draft, this is done by using an
MPLS label or a VNID (depending upon encapsulation) that identifies the
frame's source BD.
When using MPLS encapsulation, the assignment of labels to BDs is best
done by using the technique described in
draft-zzhang-bess-mvpn-evpn-aggregation-label ("MVPN/EVPN Tunnel
Aggregation with Common Labels"). Per that draft, each BD would be
assigned a domain-wide unique label (from a "Domain-Wide Common Block",
or DCB). Then the number of BD-identifying labels that each PE needs to
know is proportional to the number of BDs in the Tenant Domains to which
it is attached. So in the data plane the number of labels one has to
know for receiving is just **"number of BDs *in locally attached Tenant
Domains";* the number of PEs does not factor in. (Note that when using
VXLAN encapsulation, the VNIDs are already domain-wide unique.)
When one is just doing RFC 7432, the number of data plane labels one
needs is **"number of locally attached BDs", **which will certainly be a
smaller number, but the difference between these two numbers does not
seem very scary.
One might well ask though whether the number of labels a receiving PE
has to know could be reduced from **"number of BDs *in locally attached
Tenant Domains" *to **"number of locally attached BDs". **This is
certainly worth exploring. If one receives a frame on a given P2MP LSP,
one knows from the top label what Tenant Domain it belongs to. The
second label tells one what BD the frame belongs to. If that second
label is unrecognized, one could decide to treat the frame as belonging
to the SBD of the Tenant Domain identified by the top label. This sort
of technique would really minimize the data plane state, while still
allowing the source BD to be identified. This is worth some further
invetigation.****
*2. Scale of BGP routes*
/evpn-irb/ solution mandates a PE to process and store all IMET NLRI's
from all peer PE's in tenant domain (as opposed to processing and
storing only NLRI's for BD's it has locally present).
Now we're talking about control plane scale.
Just for clarity, please note that when Ingress Replication is used, the
BD-specific IMETs do not carry the SBD-RT (see section 3.2.2) and hence
are not distributed to all PEs in the Tenant Domain. (There are some
sections of the draft that are inconsistent about this; that will be
fixed.) So this issue does not arise in that case.
This is proposed because multicast traffic could be originating from
any PE in any BD. To put this in perspective, lets take an example of
a tenant domain with 101 PE's with each PE having 1000 BD's. Each PE
has at most 10 BD's common with any other PE in network. In this case
PE1 will have to process and store, 100 (remote PE's) x 1000 (BD's per
PE) x 1 (IMET per BD) = 0.1 Million IMET routes. Essentially, it is of
order *"Num of BD's" x "Num of PE's"*.
I'd like to see the example deployment where a single Tenant Domain
attaches to 100 PEs, each of those PEs attaches to 1000 BDs of that
Tenant Domain, and each individual BD attaches to only 10 of those PEs!
However, the basic point is well taken. In the context of a given
Tenant Domain, the number of IMET routes a PE receives will be *O(number
of PEs x average number of BDs per PE).
*Whether this really presents a significant control plane scaling issue
is debatable, but let's see if we can reduce this.
If we use aggregate tunnels, the tunnel will be specified in the
SBD-IMET route. When a PE sends an SBD-IMET route, it can include in
that route the label used to identify each of the BDs to which that PE
is locally attached. If we do that, we don't need to send the per-BD
IMET routes to all PEs in the Tenant Domain. (If there are too many
locally attached BDs of a given Tenant Domain to fit into a single
SBD-IMET route, we could borrow the technique of RFC 7432 Section 8.2 to
send a small number of additional SBD-IMET routes. ) With this minor
change, we could reduce the number of IMET routes so that it is
proportional only to the number of PEs
Further, if the labels used to identify the BDs are domain-wide unique,
it may be possible to omit them from the SBD-IMET routes altogether.
Or perhaps only the DR for each BD will advertise the label for that
BD. Or perhaps they can be learned directly from a controller. These
are all options worth exploring as a way of reducing the overhead.
So I think that by making some small adjustments, we can greatly reduce
the control plane overhead without incurring all the problems of the
"seamless-mcast" scheme.
_______________________________________________
BESS mailing list
BESS@ietf.org
https://www.ietf.org/mailman/listinfo/bess