Re: Is there any data on packet duplication?
On Mon, Jun 22, 2020 at 10:21 PM Saku Ytti wrote: > On Tue, 23 Jun 2020 at 08:12, William Herrin wrote: > > That's what spanning tree and its compatriots are for. Otherwise, > > ordinary broadcast traffic (like those arp packets) would travel in a > > loop, flooding the network and it would just about instantly collapse > > when you first turned it on. > > Metro: S1-S2-S3-S1 > PE1: S1 > PE2: S2 > Customer: S3 > STP blocking: ANY > > S3 sends frame, it is unknown unicast flooded, S1+S2 both get it > (regardless of which metro port blocks), which will send it via PE to > Internet. There's a link in the chain you haven't explained. The packet which entered at S3 has a unicast destination MAC address. That's what was in the arp table. If they're following the standards, only one of PE1 and PE2 will accept packets with that destination mac address. The other, recognizing that the packet is not addressed to it, drops it. Recall that ethernet worked without duplicating packets back in the days of hubs when all stations received all packets. This is how. That having been said, I've seen vendors creatively breach the boundary between L2 and L3 with some really peculiar results. AWS VPCs for example. But then this ring configuration doesn't exist in an AWS VPC and I've not particularly observed a lot of packet duplication out of Amazon. Regards, Bill Herrin -- William Herrin b...@herrin.us https://bill.herrin.us/
Re: Is there any data on packet duplication?
On Tue, 23 Jun 2020 at 09:32, Sabri Berisha wrote: > Aaah yes, fair point! Thanks $deity for default timers that make no sense. Add low-traffic connection and default 1024s maxPoll of NTP and this duplication is guaranteed to happen for 97.9% of packets. -- ++ytti
Re: Is there any data on packet duplication?
- On Jun 22, 2020, at 11:21 PM, Saku Ytti s...@ytti.fi wrote: Hi Saku, > On Tue, 23 Jun 2020 at 09:15, Sabri Berisha wrote: > >> Yeah, except that unless you use static ARP entries, I can't come up >> with a plausible scenario in which this would happen for NTP. Assuming >> we're talking about a non-local NTP server, S3 will not send an NTP >> packet without first sending an ARP. Yes, your ARP will be flooded, >> but your NTP packet won't be transmitted until there is an ARP reply. >> By that time MACs have been learned, and the NTP packet will not be >> considered BUM traffic, right? > > The plausible scenario is the one I explained. The crucial detail is > MAC timeout (catalyst 300s) being shorter than ARP timeout (cisco 4h). > So the device generating the packet knows the MAC address, the L2 does > not. Aaah yes, fair point! Thanks $deity for default timers that make no sense. Thanks, Sabri
Re: Is there any data on packet duplication?
On Tue, 23 Jun 2020 at 09:15, Sabri Berisha wrote: > Yeah, except that unless you use static ARP entries, I can't come up > with a plausible scenario in which this would happen for NTP. Assuming > we're talking about a non-local NTP server, S3 will not send an NTP > packet without first sending an ARP. Yes, your ARP will be flooded, > but your NTP packet won't be transmitted until there is an ARP reply. > By that time MACs have been learned, and the NTP packet will not be > considered BUM traffic, right? The plausible scenario is the one I explained. The crucial detail is MAC timeout (catalyst 300s) being shorter than ARP timeout (cisco 4h). So the device generating the packet knows the MAC address, the L2 does not. Hope this helps! -- ++ytti
Re: Is there any data on packet duplication?
- On Jun 22, 2020, at 10:21 PM, Saku Ytti s...@ytti.fi wrote: Hi, > Metro: S1-S2-S3-S1 > PE1: S1 > PE2: S2 > Customer: S3 > STP blocking: ANY > > S3 sends frame, it is unknown unicast flooded, S1+S2 both get it > (regardless of which metro port blocks), which will send it via PE to > Internet. > > STP doesn't help, at all. Hope this helps. Yeah, except that unless you use static ARP entries, I can't come up with a plausible scenario in which this would happen for NTP. Assuming we're talking about a non-local NTP server, S3 will not send an NTP packet without first sending an ARP. Yes, your ARP will be flooded, but your NTP packet won't be transmitted until there is an ARP reply. By that time MACs have been learned, and the NTP packet will not be considered BUM traffic, right? That said, I have seen packet duplication in L2 onlu networks that I've worked on myself, but that was because I disregarded a lot of rules from the imaginary networking handbook. Thanks, Sabri
Re: Is there any data on packet duplication?
On 23/Jun/20 07:52, Saku Ytti wrote: > > S1-S2-S3-S1 is operator L2 metro-ring, which connects customers and > 2xPE routers. It VLAN backhauls customers to PE. Okay. In 2014, we hit a similar issue, although not in a ring. Our previous architecture was to interconnect edge routers via downstream, interconnected aggregation switches to which customers connected in order to support VRRP. Since customers do strange things, both edge routers received the same traffic, which caused pain. Since then, we don't support VRRP for customers any longer, nor do we interconnect aggregation switches that map to different edge routers. Your example scenario describes what we experienced back then. Mark.
Re: Is there any data on packet duplication?
On Tue, 23 Jun 2020 at 08:36, Mark Tinka wrote: > To be clear, is the customer's device S3, or is S3 the ISP's device that > terminates the customer's service? S1-S2-S3-S1 is operator L2 metro-ring, which connects customers and 2xPE routers. It VLAN backhauls customers to PE. -- ++ytti
Re: Is there any data on packet duplication?
On 23/Jun/20 07:32, Saku Ytti wrote: > > Ring of 3 switches, minimum possible topology to explain the issue for > people not familiar with L2. To be clear, is the customer's device S3, or is S3 the ISP's device that terminates the customer's service? Mark.
Re: Is there any data on packet duplication?
On Tue, 23 Jun 2020 at 08:29, Mark Tinka wrote: > In the above, is S3 part of the Metro-E ring, or simply downstream of S1 > and S2? Ring of 3 switches, minimum possible topology to explain the issue for people not familiar with L2. -- ++ytti
Re: Is there any data on packet duplication?
On 23/Jun/20 07:21, Saku Ytti wrote: > Metro: S1-S2-S3-S1 > PE1: S1 > PE2: S2 > Customer: S3 > STP blocking: ANY > > S3 sends frame, it is unknown unicast flooded, S1+S2 both get it > (regardless of which metro port blocks), which will send it via PE to > Internet. > > STP doesn't help, at all. Hope this helps. In the above, is S3 part of the Metro-E ring, or simply downstream of S1 and S2? Mark.
Re: Is there any data on packet duplication?
On Tue, 23 Jun 2020 at 08:12, William Herrin wrote: Hey Bill, > That's what spanning tree and its compatriots are for. Otherwise, > ordinary broadcast traffic (like those arp packets) would travel in a > loop, flooding the network and it would just about instantly collapse > when you first turned it on. Metro: S1-S2-S3-S1 PE1: S1 PE2: S2 Customer: S3 STP blocking: ANY S3 sends frame, it is unknown unicast flooded, S1+S2 both get it (regardless of which metro port blocks), which will send it via PE to Internet. STP doesn't help, at all. Hope this helps. -- ++ytti
Re: Is there any data on packet duplication?
On 23/Jun/20 06:41, Saku Ytti wrote: > > I can't tell you how common it is, because that type of visibility is > not easy to acquire, But I can explain at least one scenario when it > occasionally happens. > > 1) Imagine a ring of L2 metro ethernet > 2) Ring is connected to two PE routers, for redundancy > 3) Customers are connected to ring ports and backhauled over VLAN to PE > > If there is very little traffic from Network=>Customer, the L2 metro > forgets the MAC of customer subinterfaces (or VRRP) on the PE routers. > Then when the client sends a packet to the Internet, the L2 floods it > to all eligible ports, and it'll arrive to both PE routers, which will > continue to forward it to the Internet. > This requires an unfortunate (but typical) combination of ARP timeout > and MAC timeout, so that sender still has ARP cache, while switch > doesn't have MAC cache. > > In the opposite direction this same topology can cause loops, when PE > routers still have a customer MAC in the ARP table, but L2 switch > doesn't have the MAC. > > I wouldn't personally add code in applications to handle this case > more gracefully. My understanding of Layer 2-based Metro-E networks is that multi-directional traffic would be prevented by way of Spanning Tree. Mark.
Re: Is there any data on packet duplication?
On Mon, Jun 22, 2020 at 9:43 PM Saku Ytti wrote: > I can't tell you how common it is, because that type of visibility is > not easy to acquire, But I can explain at least one scenario when it > occasionally happens. > > 1) Imagine a ring of L2 metro ethernet > 2) Ring is connected to two PE routers, for redundancy > 3) Customers are connected to ring ports and backhauled over VLAN to PE > > If there is very little traffic from Network=>Customer, the L2 metro > forgets the MAC of customer subinterfaces (or VRRP) on the PE routers. > Then when the client sends a packet to the Internet, the L2 floods it > to all eligible ports, and it'll arrive to both PE routers, which will > continue to forward it to the Internet. Hi Saku, That's what spanning tree and its compatriots are for. Otherwise, ordinary broadcast traffic (like those arp packets) would travel in a loop, flooding the network and it would just about instantly collapse when you first turned it on. A slightly more likely scenario is a wifi link. 802.11 employs layer-2 retries across the wireless segment. When the packet is successfully transmitted but the ack is garbled, the packet may be sent a second time. Even then I wouldn't expect duplicated packets to be more than a very small fraction of a percent. Hal, if you're seeing a non-trivial amount of identical packets, my best guess is that the client is sending identical packets for some reason. NTP you say? How does iburst work during initial sync up? Regards, Bill Herrin -- William Herrin b...@herrin.us https://bill.herrin.us/
Re: Is there any data on packet duplication?
Hey Hal, > How often do packets magically get duplicated within the network so that the > target receives 2 copies? That seems like something somebody at NANOG might > have studied and given a talk on. I can't tell you how common it is, because that type of visibility is not easy to acquire, But I can explain at least one scenario when it occasionally happens. 1) Imagine a ring of L2 metro ethernet 2) Ring is connected to two PE routers, for redundancy 3) Customers are connected to ring ports and backhauled over VLAN to PE If there is very little traffic from Network=>Customer, the L2 metro forgets the MAC of customer subinterfaces (or VRRP) on the PE routers. Then when the client sends a packet to the Internet, the L2 floods it to all eligible ports, and it'll arrive to both PE routers, which will continue to forward it to the Internet. This requires an unfortunate (but typical) combination of ARP timeout and MAC timeout, so that sender still has ARP cache, while switch doesn't have MAC cache. In the opposite direction this same topology can cause loops, when PE routers still have a customer MAC in the ARP table, but L2 switch doesn't have the MAC. I wouldn't personally add code in applications to handle this case more gracefully. -- ++ytti
Is there any data on packet duplication?
How often do packets magically get duplicated within the network so that the target receives 2 copies? That seems like something somebody at NANOG might have studied and given a talk on. Any suggestions for other places to look? Context is NTP. If a client gets an answer, should it keep the socket around for a short time so that any late responses or duplicates from the network don't turn into ICMP port unreachable back at the server. Nothing critical, just general clutter reduction. I have packet captures from a NTP server. I'm trying to sort things out. There are a surprising (to me) number of duplicates that arrive back-to-back, sometimes the timestamp is the same microsecond. They could come from buggy clients, but that seems like an unlikely sort of bug. -- These are my opinions. I hate spam.
Re: 60 ms cross-continent
Microwave is used for long haul wireless transmission for the ultra-latency crowd. Free space laser has more bandwidth, but is sensitive to fog and at least until the last few years much less range. I sell ULL routes to financial players. A 10 meg microwave circuit CME/Secaucus Equinix ranges from $185K per month to $20K a month. From: NANOG on behalf of Joe Hamelin Sent: Saturday, June 20, 2020 10:19 PM To: Alejandro Acosta Cc: NANOG list Subject: Re: 60 ms cross-continent On Sat, Jun 20, 2020 at 12:56 PM Alejandro Acosta mailto:alejandroacostaal...@gmail.com>> wrote: Hello, Taking advantage of this thread may I ask something?. I have heard of "wireless fiber optic", something like an antenna with a laser pointing from one building to the other, having said this I can assume this link with have lower RTT than a laser thru a fiber optic made of glass? See: Terrabeam from about the year 2000. -- Joe Hamelin, W7COM, Tulalip, WA, +1 (360) 474-7474
Re: 60 ms cross-continent
Have you accounted for glass as opposed to vacuum? And the fact that fiber optic networks can't be straight lines if their purpose is to aggregate traffic along the way and they also need to follow some less-than-straight right of way. Regards, Roderick. From: NANOG on behalf of Stephen Satchell via NANOG Sent: Monday, June 22, 2020 9:37 PM To: nanog@nanog.org Subject: Re: 60 ms cross-continent On 6/22/20 12:59 AM, adamv0...@netconsultings.com wrote: >> William Herrin >> >> Howdy, >> >> Why is latency between the east and west coasts so bad? Speed of light >> accounts for about 15ms each direction for a 30ms round trip. Where does >> the other 30ms come from and why haven't we gotten rid of it? >> > Wallstreet did :) > https://www.wired.com/2012/08/ff_wallstreet_trading/ “Of course, you’d need a particle accelerator to make it work.” So THAT'S why CERN wants to build an even bigger accelerator than the LHC!
Re: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)
>> The requirement from the E2E principle is that routers should be >> dumb and hosts should be clever or the entire system do not. >> scale reliably. > > And yet in the PTT world, it was the other way around. Clever switching > and dumb telephone boxes. how did that work out for the ptts? :)
Re: 60 ms cross-continent
On 6/22/20 12:59 AM, adamv0...@netconsultings.com wrote: William Herrin Howdy, Why is latency between the east and west coasts so bad? Speed of light accounts for about 15ms each direction for a 30ms round trip. Where does the other 30ms come from and why haven't we gotten rid of it? Wallstreet did :) https://www.wired.com/2012/08/ff_wallstreet_trading/ “Of course, you’d need a particle accelerator to make it work.” So THAT'S why CERN wants to build an even bigger accelerator than the LHC!
infrastructure focused interviews and questions
hi there, I've started a few weeks ago hosting a youtube/twitch/twitter live video show (simultaneous stream) hosting key people who are involved in the exec/operations/engineering of internet infrastructure companies either as consumer or service providers. my idea is to create a platform where questions/concerns can be asked directly to executives/key decision-makers and hopefully get answers. Very similar to Reddit AMA but with focus on telecom/datacenter/infrastructure/DNS/etc. For example, I will have execs from zayo, Windstream, aqua comms, packet fabric this week as my guests where you will be able to ask questions directly. There will be also a 1-1 discussion with APNIC's Chief Scientist Geoff Houston about internet growth during lockdowns, and more. you can watch the shows live, ask questions and be part of the conversation from all these platforms. example: https://www.youtube.com/watch?v=ff4x9IwBHEI starting in 15 minutes with Windstream. fblive https://www.facebook.com/infrapediainc/videos/270701217367462/ twitter https://twitter.com/mhm7kcn if you have recommendation on speakers you want to see, ask questions, you can contact me offlist. I thought I would share this here, I am sorry if this is off-topic. Invite people to participate, ask questions, and more. mehmet
Re: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)
On 22/Jun/20 16:30, adamv0...@netconsultings.com wrote: > Not quite, > The routing information is flooded by default, but the receivers will cherry > pick what they need and drop the rest. > And even if the default flooding of all and dropping most is a concern -it > can be addressed where only the relevant subset of all the routing info is > sent to each receiver. > The key takeaway however is that no single entity in SP network, be it PE, > or RR, or ASBR, ever needs everything, you can always slice and dice > indefinitely. > So to sum it up you simply can not run into any scaling ceiling with MP-BGP > architecture. The only nodes in our network that have ALL the NLRI is our RR's. Depending on the function of the egress/ingress router, the RR sends it only what it needs for its function. This is how we get away using communities in lieu of VRF's :-). And as Adam points out, those RR's will swallow anything and everything, and still remain asleep. Mark.
Re: 60 ms cross-continent
On Sat, Jun 20, 2020 at 12:56 PM Alejandro Acosta < alejandroacostaal...@gmail.com> wrote: > Hello, > > Taking advantage of this thread may I ask something?. I have heard of > "wireless fiber optic", something like an antenna with a laser pointing > from one building to the other, having said this I can assume this link > with have lower RTT than a laser thru a fiber optic made of glass? > > See: Terrabeam from about the year 2000. -- Joe Hamelin, W7COM, Tulalip, WA, +1 (360) 474-7474
RE: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)
> Masataka Ohta > Sent: Monday, June 22, 2020 1:49 PM > > Robert Raszuk wrote: > > > Moreover if you have 1000 PEs and those three sites are attached only > > to 6 of them - only those 6 PEs will need to learn those routes (Hint: > > RTC - > > RFC4684) > > If you have 1000 PEs, you should be serving for somewhere around 1000 > customers. > > And, if I understand BGP-MP correctly, all the routing information of all the > customers is flooded by BGP-MP in the ISP. > Not quite, The routing information is flooded by default, but the receivers will cherry pick what they need and drop the rest. And even if the default flooding of all and dropping most is a concern -it can be addressed where only the relevant subset of all the routing info is sent to each receiver. The key takeaway however is that no single entity in SP network, be it PE, or RR, or ASBR, ever needs everything, you can always slice and dice indefinitely. So to sum it up you simply can not run into any scaling ceiling with MP-BGP architecture. adam
Re: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)
Masataka Ohta wrote on 22/06/2020 13:49: But, it should be noted that a single class B routing table entry "a single class B routing table entry"? Did 1993 just call and ask for its addressing back? :-) But, it should be noted that a single class B routing table entry often serves for an organization with 1s of users, which is at least our case here at titech.ac.jp. It should also be noted that, my concern is scalability in ISP side. This entire conversation is puzzling: we already have "hierarchical routing" to a large degree, to the extent that the public DFZ only sees aggregate routes exported by ASNs. Inside ASNs, there will be internal aggregation of individual routes (e.g. an ISP DHCP pool), and possibly multiple levels of aggregation, depending on how this is configured. Aggregation is usually continued right down to the end-host edge, e.g. a router might have a /26 assigned on an interface, but the hosts will be aggregated within this /26. If you have 1000 PEs, you should be serving for somewhere around 1000 customers. And, if I understand BGP-MP correctly, all the routing information of all the customers is flooded by BGP-MP in the ISP. Well, maybe. Or maybe not. This depend on lots of things. Then, it should be a lot better to let customer edges encapsulate L2 or L3 over IP, with which, routing information within customers is exchanged by customer provided VPN without requiring extra overhead of maintaining customer local routing information by the ISP. If you have 1000 or even 1s of PEs, injecting simplistic non-aggregated routing information is unlikely to be an issue. If you have 1,000,000 PEs, you'll probably need to rethink that position. If your proposition is that the nature of the internet be changed so that route disaggregation is prevented, or that addressing policy be changed so that organisations are exclusively handed out IP address space by their upstream providers, then this is simple matter of misunderstanding of how impractical the proposition is: that horse bolted from the barn 30 years ago; no organisation would accept exclusive connectivity provided by a single upstream; and today's world of dense interconnection would be impossible on the terms you suggest. You may not like that there are lots of entries in the DFZ and many operators view this as a bit of a drag, but on today's technology, this can scale to significantly more than what we foresee in the medium-long term future. Nick
Re: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)
On 22/Jun/20 15:17, Masataka Ohta wrote: > > > The point of Yakov on day one was that, flow driven approach of > Ipsilon does not scale and is unacceptable. > > Though I agree with Yakov here, we must also eliminate all the > flow driven approaches by MPLS or whatever. I still don't see them in practice, even though they may have been proposed. Mark.
Re: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)
On 22/Jun/20 15:08, Masataka Ohta wrote: > > The requirement from the E2E principle is that routers should be > dumb and hosts should be clever or the entire system do not. > scale reliably. And yet in the PTT world, it was the other way around. Clever switching and dumb telephone boxes. How things have since evened out. I can understand the concern about making the network smart. But even a smart network is not as smart as a host. My laptop can do a lot of things more cleverly than any of the routers in my network. It just can't do them at scale, consistently, for a bunch of users. So the responsibility gets to be shared, with the number of users being served diminishing as you enter and exit the edge of the network. It's probably not yet an ideal networking paradigm, but it's the one we have now that is a reasonably fair compromise. > > In this case, such clever router can ever exist only near the > destination unless very detailed routing information is flooded > all over the network to all the possible sources. I will admit that bloating router code over recent years to become terribly smart (CGN, Acceleration, DoS mitigation, VPN's, SD-WAN, IDS, Video Monitoring, e.t.c.) can become a run away problem. I've often joked that with all the things being thrown into BGP, we may just see it carrying DNS too, hehe. Personally, the level of intelligence we have in routers now beyond being just Layer 1, 2, 3 - and maybe 4 - crunching machines is just as far as I'm willing to go. If, like me, you keep pushing back on vendors trying to make your routers also clean your dishes, they'll take the hint and stop bloating the code. Are MPLS/VPN's overly clever? I think so. But considering the pay-off and how much worse it could get, I'm willing to accept that. > > A router can't be clever on something, unless it is provided > with very detailed information on all the possible destinations, > which needs a lot of routing traffic making entire system not > to scale. Well, if you can propose a better way to locate hosts on a global network not owned by anyone, in a connectionless manner, I'm sure we'd all be interested. Mark.
RE: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)
> From: Masataka Ohta > Sent: Monday, June 22, 2020 2:17 PM > > adamv0...@netconsultings.com wrote: > > > But MPLS can be made flow driven (it can be made whatever the policy > > dictates), for instance DSCP driven. > > The point of Yakov on day one was that, flow driven approach of Ipsilon does > not scale and is unacceptable. > > Though I agree with Yakov here, we must also eliminate all the flow driven > approaches by MPLS or whatever. > First I'd need a definition of what flow means in this discussion are we considering 5-tuple or 4-tuple or just SRC-IP & DST-IP, is DSCP marking part of it? Second, although I agree that ~1M unique identifiers is not ideal, can you provide examples of MPLS applications where 1M is limiting? What particular aspect? Is it 1M interfaces per MPLS switching fabric box? Or 1M unique flows (or better flow groups) originated by a given VM/Container/CPE? Or 1M destination entities (IPs or apps on those IPs) that any particular VM/Container/CPE needs to talk to? Or 1M customer VPNs or 1M PE-CPE links, if PE acts as a bottleneck? adam
Re: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)
On 22/Jun/20 14:49, Masataka Ohta wrote: > > But, it should be noted that a single class B... CIDR - let's not teach the kids old news :-). > If you have 1000 PEs, you should be serving for somewhere around 1000 > customers. It's not linear. We probably have 1 edge router serving several-thousand customers. > > And, if I understand BGP-MP correctly, all the routing information of > all the customers is flooded by BGP-MP in the ISP. Yes, best practice is in iBGP. Some operators may still be using an IGP for this. It would work, but scales poorly. > > Then, it should be a lot better to let customer edges encapsulate > L2 or L3 over IP, with which, routing information within customers > is exchanged by customer provided VPN without requiring extra > overhead of maintaining customer local routing information by the > ISP. You mean like IP-in-IP or GRE? That already happens today, without any intervention from the ISP. > > If a customer want customer-specific SLA, it can be described > as SLA between customer edge routers, for which, intra-ISP MPLS > may or may not be used. l2vpn's and l3vpn's attract a higher SLA because the services are mostly provisioned on-net. If an off-net component exists, it would be via a trusted NNI partner. Regular IP or GRE tunnels don't come with these kinds of SLA's because the ISP isn't involved, and the B-end would very likely be off-net with no SLA guarantees between the A-end customer's ISP and the remote ISP hosting the B-end. > > For the ISP, it can be as profitable as PE-based VRF solutions, > because customers so relying on ISPs will let the ISP provide > and maintain customer edges. There are few ISP's who would be able to terminate an IP or GRE tunnel on-net, end-to-end. And even then, they might be reluctant to offer any SLA's because those tunnels are built on the CPE, typically outside of their control. > > The only difference should be on profitability for router makers, > which want to make routing system as complex as possible or even > a lot more than that to make backbone routers a lot profitable > product. If ISP's didn't make money from MPLS/VPN's, router vendors would not be as keen on adding the capability in their boxes. > > Label stack was there, because of, now recognized to be wrong, > statement of Yakov on day one and I can see no reason still to > keep it. Label stacking is fundamental to the "MP" part of MPLS. Whether your payload is IP, ATM, Ethernet, Frame Relay, PPP, HDLC, e.t.c., the ability to stack labels is what makes an MPLS network payload agnostic. There is value in that. Mark.
Re: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)
adamv0...@netconsultings.com wrote: But MPLS can be made flow driven (it can be made whatever the policy dictates), for instance DSCP driven… The point of Yakov on day one was that, flow driven approach of Ipsilon does not scale and is unacceptable. Though I agree with Yakov here, we must also eliminate all the flow driven approaches by MPLS or whatever. Masataka Ohta
RE: Devil's Advocate - Segment Routing, Why?
Hi Baldur, >From memory mx204 FIB is 10M (v4/v6) and RIB 30M for each v4 and v6. And remember the FIB is hierarchical -so it’s the next-hops per prefix you are referring to with BGP FRR. And also going from memory of past scaling testing, if pfx1+NH1 == x, then Pfx1+NH1+NH2 !== 2x, where x is used FIB space. adam From: NANOG On Behalf Of Baldur Norddahl Sent: Saturday, June 20, 2020 9:00 PM I can't speak for the year 2000 as I was not doing networking at this level at that time. But when I check the specs for the base mx204 it says something like 32 VRFs, 2 million routes in FIB and 6 million routes in RIB. Clearly those numbers are the total of routes across all VRFs otherwise you arrive at silly numbers (64 million FIB if you multiply, 128k FIB if you divide by 32). My conclusion is that scale wise you are ok as long you do not try to have more than one VRF with a complete copy of the DFZ. More worrying is that 2 million routes will soon not be enough to install all routes with a backup route, invalidating BGP FRR.
Re: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)
Mark Tinka wrote: So, with hierarchical routing, routing protocols can carry only rough information around destinations, from which, source side can not construct detailed (often purposelessly nested) labels required for MPLS. But hosts often point default to a clever router. The requirement from the E2E principle is that routers should be dumb and hosts should be clever or the entire system do not. scale reliably. In this case, such clever router can ever exist only near the destination unless very detailed routing information is flooded all over the network to all the possible sources. A router can't be clever on something, unless it is provided with very detailed information on all the possible destinations, which needs a lot of routing traffic making entire system not to scale. Masataka Ohta
Re: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)
Robert Raszuk wrote: Neither link wise nor host wise information is required to accomplish say L3VPN services. Imagine you have three sites which would like to interconnect each with 1000s of users. For a single customer of an ISP with 1000s of end users. OK. But, it should be noted that a single class B routing table entry often serves for an organization with 1s of users, which is at least our case here at titech.ac.jp. It should also be noted that, my concern is scalability in ISP side. Moreover if you have 1000 PEs and those three sites are attached only to 6 of them - only those 6 PEs will need to learn those routes (Hint: RTC - RFC4684) If you have 1000 PEs, you should be serving for somewhere around 1000 customers. And, if I understand BGP-MP correctly, all the routing information of all the customers is flooded by BGP-MP in the ISP. Then, it should be a lot better to let customer edges encapsulate L2 or L3 over IP, with which, routing information within customers is exchanged by customer provided VPN without requiring extra overhead of maintaining customer local routing information by the ISP. If a customer want customer-specific SLA, it can be described as SLA between customer edge routers, for which, intra-ISP MPLS may or may not be used. For the ISP, it can be as profitable as PE-based VRF solutions, because customers so relying on ISPs will let the ISP provide and maintain customer edges. The only difference should be on profitability for router makers, which want to make routing system as complex as possible or even a lot more than that to make backbone routers a lot profitable product. With nested labels, you don't need so much labels at certain nesting level, which was the point of Yakov, which does not mean you don't need so much information to create entire nested labels at or near the sources. Label stack is here from day one. Label stack was there, because of, now recognized to be wrong, statement of Yakov on day one and I can see no reason still to keep it. Masataka Ohta
RE: 60 ms cross-continent
> William Herrin > > Howdy, > > Why is latency between the east and west coasts so bad? Speed of light > accounts for about 15ms each direction for a 30ms round trip. Where does > the other 30ms come from and why haven't we gotten rid of it? > Wallstreet did :) https://www.wired.com/2012/08/ff_wallstreet_trading/ adam
RE: Devil's Advocate - Segment Routing, Why?
> From: NANOG On Behalf Of Masataka Ohta > Sent: Friday, June 19, 2020 5:01 PM > > Robert Raszuk wrote: > > > So I think Ohta-san's point is about scalability services not flat > > underlay RIB and FIB sizes. Many years ago we had requests to support > > 5M L3VPN routes while underlay was just 500K IPv4. > > That is certainly a problem. However, worse problem is to know label values > nested deeply in MPLS label chain. > > Even worse, if route near the destination expected to pop the label chain > goes down, how can the source knows that the router goes down and > choose alternative router near the destination? > Via IGP or controller, but for sub 50ms convergence there are edge node protection mechanisms, so the point is the source doesn't even need to know about for the restoration to happen. adam
RE: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)
> Masataka Ohta > Sent: Sunday, June 21, 2020 1:37 PM > > > Whether you do it manually or use a label distribution protocol, FEC's > > are pre-computed ahead of time. > > > > What am I missing? > > If all the link-wise (or, worse, host-wise) information of possible destinations > is distributed in advance to all the possible sources, it is not hierarchical but > flat (host) routing, which scales poorly. > > Right? > On the Internet yes in controlled environments no, as in these environments the set of possible destinations is well scoped. Take an MPLS enabled DC for instance, every VM does need to talk to only a small subset of all the VMs hosted in a DC. Hence each VM gets flow transport labels programmed via centralized end-to-end flow controllers on a need to know bases (not everything to everyone). (E.g. dear vm1 this is how you get your EF/BE flows via load-balancer and FW to backend VMs in your local pod, this is how you get via local pod fw to internet gw, etc..., done) Now that you have these neat "pipes" all over the place connecting VMs it's easy for the switching fabric controller to shuffle elephant and mice flows around in order to avoid any link saturation. And now imagine a bit further doing the same as above but with CPEs on a Service Provider network... yep, no PEs acting as chokepoints for MPLS label switch paths to flow assignment, needing massive FIBs and even bigger, just dumb MPLS switch fabric, all the "hard-work" is offloaded to centralized controllers (and CPEs for label stack imposition) -but only on a need to know bases (not everything to everyone). Now in both cases you're free to choose to what extent should the MPLS switch fabric be involved with the end-to-end flows by imposing hierarchies to the MPLS stack. In light of the above, does it suck to have just 20bits of MPLS label space? Absolutely. Adam
RE: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)
But MPLS can be made flow driven (it can be made whatever the policy dictates), for instance DSCP driven… adam From: NANOG On Behalf Of Robert Raszuk Sent: Saturday, June 20, 2020 4:13 PM To: Masataka Ohta Cc: North American Network Operators' Group Subject: Re: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?) The problem of MPLS, however, is that, it must also be flow driven, because detailed route information at the destination is necessary to prepare nested labels at the source, which costs a lot and should be attempted only for detected flows. MPLS is not flow driven. I sent some mail about it but perhaps it bounced. MPLS LDP or L3VPNs was NEVER flow driven. Since day one till today it was and still is purely destination based. Transport is using LSP to egress PE (dst IP). L3VPNs are using either per dst prefix, or per CE or per VRF labels. No implementation does anything upon "flow detection" - to prepare any nested labels. Even in FIBs all information is preprogrammed in hierarchical fashion well before any flow packet arrives. Thx, R. > there is the argument that switching MPLS is faster than IP; when the > pressure points i see are more at routing (BGP/LDP/RSVP/whatever), > recovery, and convergence. Routing table at IPv4 backbone today needs at most 16M entries to be looked up by simple SRAM, which is as fast as MPLS look up, which is one of a reason why we should obsolete IPv6. Though resource reserved flows need their own routing table entries, they should be charged proportional to duration of the reservation, which can scale to afford the cost to have the entries. Masataka Ohta