Re: Is there any data on packet duplication?

2020-06-22 Thread William Herrin
On Mon, Jun 22, 2020 at 10:21 PM Saku Ytti  wrote:
> On Tue, 23 Jun 2020 at 08:12, William Herrin  wrote:
> > That's what spanning tree and its compatriots are for. Otherwise,
> > ordinary broadcast traffic (like those arp packets) would travel in a
> > loop, flooding the network and it would just about instantly collapse
> > when you first turned it on.
>
> Metro: S1-S2-S3-S1
> PE1: S1
> PE2: S2
> Customer: S3
> STP blocking: ANY
>
> S3 sends frame, it is unknown unicast flooded, S1+S2 both get it
> (regardless of which metro port blocks), which will send it via PE to
> Internet.

There's a link in the chain you haven't explained. The packet which
entered at S3 has a unicast destination MAC address. That's what was
in the arp table. If they're following the standards, only one of PE1
and PE2 will accept packets with that destination mac address. The
other, recognizing that the packet is not addressed to it, drops it.

Recall that ethernet worked without duplicating packets back in the
days of hubs when all stations received all packets. This is how.


That having been said, I've seen vendors creatively breach the
boundary between L2 and L3 with some really peculiar results. AWS VPCs
for example. But then this ring configuration doesn't exist in an AWS
VPC and I've not particularly observed a lot of packet duplication out
of Amazon.

Regards,
Bill Herrin



-- 
William Herrin
b...@herrin.us
https://bill.herrin.us/


Re: Is there any data on packet duplication?

2020-06-22 Thread Saku Ytti
On Tue, 23 Jun 2020 at 09:32, Sabri Berisha  wrote:

> Aaah yes, fair point! Thanks $deity for default timers that make no sense.

Add low-traffic connection and default 1024s maxPoll of NTP and this
duplication is guaranteed to happen for 97.9% of packets.

-- 
  ++ytti


Re: Is there any data on packet duplication?

2020-06-22 Thread Sabri Berisha
- On Jun 22, 2020, at 11:21 PM, Saku Ytti s...@ytti.fi wrote:

Hi Saku,

> On Tue, 23 Jun 2020 at 09:15, Sabri Berisha  wrote:
> 
>> Yeah, except that unless you use static ARP entries, I can't come up
>> with a plausible scenario in which this would happen for NTP. Assuming
>> we're talking about a non-local NTP server, S3 will not send an NTP
>> packet without first sending an ARP. Yes, your ARP will be flooded,
>> but your NTP packet won't be transmitted until there is an ARP reply.
>> By that time MACs have been learned, and the NTP packet will not be
>> considered BUM traffic, right?
> 
> The plausible scenario is the one I explained. The crucial detail is
> MAC timeout (catalyst 300s) being shorter than ARP timeout (cisco 4h).
> So the device generating the packet knows the MAC address, the L2 does
> not.

Aaah yes, fair point! Thanks $deity for default timers that make no sense.

Thanks,

Sabri


Re: Is there any data on packet duplication?

2020-06-22 Thread Saku Ytti
On Tue, 23 Jun 2020 at 09:15, Sabri Berisha  wrote:

> Yeah, except that unless you use static ARP entries, I can't come up
> with a plausible scenario in which this would happen for NTP. Assuming
> we're talking about a non-local NTP server, S3 will not send an NTP
> packet without first sending an ARP. Yes, your ARP will be flooded,
> but your NTP packet won't be transmitted until there is an ARP reply.
> By that time MACs have been learned, and the NTP packet will not be
> considered BUM traffic, right?

The plausible scenario is the one I explained. The crucial detail is
MAC timeout (catalyst 300s) being shorter than ARP timeout (cisco 4h).
So the device generating the packet knows the MAC address, the L2 does
not.

Hope this helps!
-- 
  ++ytti


Re: Is there any data on packet duplication?

2020-06-22 Thread Sabri Berisha
- On Jun 22, 2020, at 10:21 PM, Saku Ytti s...@ytti.fi wrote:

Hi,

> Metro: S1-S2-S3-S1
> PE1: S1
> PE2: S2
> Customer: S3
> STP blocking: ANY
> 
> S3 sends frame, it is unknown unicast flooded, S1+S2 both get it
> (regardless of which metro port blocks), which will send it via PE to
> Internet.
> 
> STP doesn't help, at all. Hope this helps.

Yeah, except that unless you use static ARP entries, I can't come up
with a plausible scenario in which this would happen for NTP. Assuming
we're talking about a non-local NTP server, S3 will not send an NTP
packet without first sending an ARP. Yes, your ARP will be flooded, 
but your NTP packet won't be transmitted until there is an ARP reply.
By that time MACs have been learned, and the NTP packet will not be
considered BUM traffic, right? 

That said, I have seen packet duplication in L2 onlu networks that
I've worked on myself, but that was because I disregarded a lot of
rules from the imaginary networking handbook.

Thanks,

Sabri


Re: Is there any data on packet duplication?

2020-06-22 Thread Mark Tinka



On 23/Jun/20 07:52, Saku Ytti wrote:

>
> S1-S2-S3-S1 is operator L2 metro-ring, which connects customers and
> 2xPE routers. It VLAN backhauls customers to PE.

Okay.

In 2014, we hit a similar issue, although not in a ring.

Our previous architecture was to interconnect edge routers via
downstream, interconnected aggregation switches to which customers
connected in order to support VRRP.

Since customers do strange things, both edge routers received the same
traffic, which caused pain.

Since then, we don't support VRRP for customers any longer, nor do we
interconnect aggregation switches that map to different edge routers.

Your example scenario describes what we experienced back then.

Mark.


Re: Is there any data on packet duplication?

2020-06-22 Thread Saku Ytti
On Tue, 23 Jun 2020 at 08:36, Mark Tinka  wrote:

> To be clear, is the customer's device S3, or is S3 the ISP's device that
> terminates the customer's service?

S1-S2-S3-S1 is operator L2 metro-ring, which connects customers and
2xPE routers. It VLAN backhauls customers to PE.

-- 
  ++ytti


Re: Is there any data on packet duplication?

2020-06-22 Thread Mark Tinka



On 23/Jun/20 07:32, Saku Ytti wrote:

>
> Ring of 3 switches, minimum possible topology to explain the issue for
> people not familiar with L2.

To be clear, is the customer's device S3, or is S3 the ISP's device that
terminates the customer's service?

Mark.


Re: Is there any data on packet duplication?

2020-06-22 Thread Saku Ytti
On Tue, 23 Jun 2020 at 08:29, Mark Tinka  wrote:

> In the above, is S3 part of the Metro-E ring, or simply downstream of S1
> and S2?

Ring of 3 switches, minimum possible topology to explain the issue for
people not familiar with L2.

-- 
  ++ytti


Re: Is there any data on packet duplication?

2020-06-22 Thread Mark Tinka



On 23/Jun/20 07:21, Saku Ytti wrote:

> Metro: S1-S2-S3-S1
> PE1: S1
> PE2: S2
> Customer: S3
> STP blocking: ANY
>
> S3 sends frame, it is unknown unicast flooded, S1+S2 both get it
> (regardless of which metro port blocks), which will send it via PE to
> Internet.
>
> STP doesn't help, at all. Hope this helps.

In the above, is S3 part of the Metro-E ring, or simply downstream of S1
and S2?

Mark.


Re: Is there any data on packet duplication?

2020-06-22 Thread Saku Ytti
On Tue, 23 Jun 2020 at 08:12, William Herrin  wrote:

Hey Bill,

> That's what spanning tree and its compatriots are for. Otherwise,
> ordinary broadcast traffic (like those arp packets) would travel in a
> loop, flooding the network and it would just about instantly collapse
> when you first turned it on.

Metro: S1-S2-S3-S1
PE1: S1
PE2: S2
Customer: S3
STP blocking: ANY

S3 sends frame, it is unknown unicast flooded, S1+S2 both get it
(regardless of which metro port blocks), which will send it via PE to
Internet.

STP doesn't help, at all. Hope this helps.


-- 
  ++ytti


Re: Is there any data on packet duplication?

2020-06-22 Thread Mark Tinka



On 23/Jun/20 06:41, Saku Ytti wrote:

>
> I can't tell you how common it is, because that type of visibility is
> not easy to acquire, But I can explain at least one scenario when it
> occasionally happens.
>
> 1) Imagine a ring of L2 metro ethernet
> 2) Ring is connected to two PE routers, for redundancy
> 3) Customers are connected to ring ports and backhauled over VLAN to PE
>
> If there is very little traffic from Network=>Customer, the L2 metro
> forgets the MAC of customer subinterfaces (or VRRP) on the PE routers.
> Then when the client sends a packet to the Internet, the L2 floods it
> to all eligible ports, and it'll arrive to both PE routers, which will
> continue to forward it to the Internet.
> This requires an unfortunate (but typical) combination of ARP timeout
> and MAC timeout, so that sender still has ARP cache, while switch
> doesn't have MAC cache.
>
> In the opposite direction this same topology can cause loops, when PE
> routers still have a customer MAC in the ARP table, but L2 switch
> doesn't have the MAC.
>
> I wouldn't personally add code in applications to handle this case
> more gracefully.

My understanding of Layer 2-based Metro-E networks is that
multi-directional traffic would be prevented by way of Spanning Tree.

Mark.


Re: Is there any data on packet duplication?

2020-06-22 Thread William Herrin
On Mon, Jun 22, 2020 at 9:43 PM Saku Ytti  wrote:
> I can't tell you how common it is, because that type of visibility is
> not easy to acquire, But I can explain at least one scenario when it
> occasionally happens.
>
> 1) Imagine a ring of L2 metro ethernet
> 2) Ring is connected to two PE routers, for redundancy
> 3) Customers are connected to ring ports and backhauled over VLAN to PE
>
> If there is very little traffic from Network=>Customer, the L2 metro
> forgets the MAC of customer subinterfaces (or VRRP) on the PE routers.
> Then when the client sends a packet to the Internet, the L2 floods it
> to all eligible ports, and it'll arrive to both PE routers, which will
> continue to forward it to the Internet.

Hi Saku,

That's what spanning tree and its compatriots are for. Otherwise,
ordinary broadcast traffic (like those arp packets) would travel in a
loop, flooding the network and it would just about instantly collapse
when you first turned it on.

A slightly more likely scenario is a wifi link. 802.11 employs layer-2
retries across the wireless segment. When the packet is successfully
transmitted but the ack is garbled, the packet may be sent a second
time.

Even then I wouldn't expect duplicated packets to be more than a very
small fraction of a percent. Hal, if you're seeing a non-trivial
amount of identical packets, my best guess is that the client is
sending identical packets for some reason. NTP you say? How does
iburst work during initial sync up?

Regards,
Bill Herrin

-- 
William Herrin
b...@herrin.us
https://bill.herrin.us/


Re: Is there any data on packet duplication?

2020-06-22 Thread Saku Ytti
Hey Hal,

> How often do packets magically get duplicated within the network so that the
> target receives 2 copies?  That seems like something somebody at NANOG might
> have studied and given a talk on.

I can't tell you how common it is, because that type of visibility is
not easy to acquire, But I can explain at least one scenario when it
occasionally happens.

1) Imagine a ring of L2 metro ethernet
2) Ring is connected to two PE routers, for redundancy
3) Customers are connected to ring ports and backhauled over VLAN to PE

If there is very little traffic from Network=>Customer, the L2 metro
forgets the MAC of customer subinterfaces (or VRRP) on the PE routers.
Then when the client sends a packet to the Internet, the L2 floods it
to all eligible ports, and it'll arrive to both PE routers, which will
continue to forward it to the Internet.
This requires an unfortunate (but typical) combination of ARP timeout
and MAC timeout, so that sender still has ARP cache, while switch
doesn't have MAC cache.

In the opposite direction this same topology can cause loops, when PE
routers still have a customer MAC in the ARP table, but L2 switch
doesn't have the MAC.

I wouldn't personally add code in applications to handle this case
more gracefully.
-- 
  ++ytti


Is there any data on packet duplication?

2020-06-22 Thread Hal Murray


How often do packets magically get duplicated within the network so that the 
target receives 2 copies?  That seems like something somebody at NANOG might 
have studied and given a talk on.

Any suggestions for other places to look?

Context is NTP.  If a client gets an answer, should it keep the socket around 
for a short time so that any late responses or duplicates from the network 
don't turn into ICMP port unreachable back at the server.  Nothing critical, 
just general clutter reduction.

I have packet captures from a NTP server.  I'm trying to sort things out.  
There are a surprising (to me) number of duplicates that arrive back-to-back, 
sometimes the timestamp is the same microsecond.  They could come from buggy 
clients, but that seems like an unlikely sort of bug.

-- 
These are my opinions.  I hate spam.





Re: 60 ms cross-continent

2020-06-22 Thread Rod Beck
Microwave is used for long haul wireless transmission for the ultra-latency 
crowd. Free space laser has more bandwidth, but is sensitive to fog and at 
least until the last few years much less range. I sell ULL routes to financial 
players. A 10 meg microwave circuit CME/Secaucus Equinix ranges from $185K per 
month to $20K a month.


From: NANOG  on behalf 
of Joe Hamelin 
Sent: Saturday, June 20, 2020 10:19 PM
To: Alejandro Acosta 
Cc: NANOG list 
Subject: Re: 60 ms cross-continent

On Sat, Jun 20, 2020 at 12:56 PM Alejandro Acosta 
mailto:alejandroacostaal...@gmail.com>> wrote:

Hello,

  Taking advantage of this thread may I ask something?. I have heard of 
"wireless fiber optic", something like an antenna with a laser pointing from 
one building to the other, having said this I can assume this link with have 
lower RTT than a laser thru a fiber optic made of glass?

See: Terrabeam from about the year 2000.

--
Joe Hamelin, W7COM, Tulalip, WA, +1 (360) 474-7474





Re: 60 ms cross-continent

2020-06-22 Thread Rod Beck
Have you accounted for glass as opposed to vacuum? And the fact that fiber 
optic networks can't be straight lines if their purpose is to aggregate traffic 
along the way and they also need to follow some less-than-straight right of way.

Regards,

Roderick.


From: NANOG  on behalf 
of Stephen Satchell via NANOG 
Sent: Monday, June 22, 2020 9:37 PM
To: nanog@nanog.org 
Subject: Re: 60 ms cross-continent

On 6/22/20 12:59 AM, adamv0...@netconsultings.com wrote:
>> William Herrin
>>
>> Howdy,
>>
>> Why is latency between the east and west coasts so bad? Speed of light
>> accounts for about 15ms each direction for a 30ms round trip. Where does
>> the other 30ms come from and why haven't we gotten rid of it?
>>
> Wallstreet did :)
> https://www.wired.com/2012/08/ff_wallstreet_trading/

“Of course, you’d need a particle accelerator to make it work.”

So THAT'S why CERN wants to build an even bigger accelerator than the LHC!


Re: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)

2020-06-22 Thread Randy Bush
>> The requirement from the E2E principle is that routers should be
>> dumb and hosts should be clever or the entire system do not.
>> scale reliably.
> 
> And yet in the PTT world, it was the other way around. Clever switching
> and dumb telephone boxes.

how did that work out for the ptts?  :)


Re: 60 ms cross-continent

2020-06-22 Thread Stephen Satchell via NANOG

On 6/22/20 12:59 AM, adamv0...@netconsultings.com wrote:

William Herrin

Howdy,

Why is latency between the east and west coasts so bad? Speed of light
accounts for about 15ms each direction for a 30ms round trip. Where does
the other 30ms come from and why haven't we gotten rid of it?


Wallstreet did :)
https://www.wired.com/2012/08/ff_wallstreet_trading/


“Of course, you’d need a particle accelerator to make it work.”

So THAT'S why CERN wants to build an even bigger accelerator than the LHC!


infrastructure focused interviews and questions

2020-06-22 Thread Mehmet Akcin
hi there,

I've started a few weeks ago hosting a youtube/twitch/twitter live video
show (simultaneous stream) hosting key people who are involved in the
exec/operations/engineering of internet infrastructure companies either as
consumer or service providers.

my idea is to create a platform where questions/concerns can be asked
directly to executives/key decision-makers and hopefully get answers. Very
similar to Reddit AMA but with focus on
telecom/datacenter/infrastructure/DNS/etc.

For example, I will have execs from zayo, Windstream, aqua comms, packet
fabric this week as my guests where you will be able to ask
questions directly. There will be also a 1-1 discussion with APNIC's Chief
Scientist Geoff Houston about internet growth during lockdowns, and more.

you can watch the shows live, ask questions and be part of the conversation
from all these platforms.

example: https://www.youtube.com/watch?v=ff4x9IwBHEI starting in 15 minutes
with Windstream.

fblive https://www.facebook.com/infrapediainc/videos/270701217367462/

twitter https://twitter.com/mhm7kcn

if you have recommendation on speakers you want to see, ask questions, you
can contact me offlist.

I thought I would share this here, I am sorry if this is off-topic. Invite
people to participate, ask questions, and more.

mehmet


Re: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)

2020-06-22 Thread Mark Tinka



On 22/Jun/20 16:30, adamv0...@netconsultings.com wrote:

> Not quite,
> The routing information is flooded by default, but the receivers will cherry
> pick what they need and drop the rest. 
> And even if the default flooding of all and dropping most is a concern -it
> can be addressed where only the relevant subset of all the routing info is
> sent to each receiver.
> The key takeaway however is that no single entity in SP network, be it PE,
> or RR, or ASBR, ever needs everything, you can always slice and dice
> indefinitely.
> So to sum it up you simply can not run into any scaling ceiling with MP-BGP
> architecture.   

The only nodes in our network that have ALL the NLRI is our RR's.

Depending on the function of the egress/ingress router, the RR sends it
only what it needs for its function.

This is how we get away using communities in lieu of VRF's :-).

And as Adam points out, those RR's will swallow anything and everything,
and still remain asleep.

Mark.


Re: 60 ms cross-continent

2020-06-22 Thread Joe Hamelin
On Sat, Jun 20, 2020 at 12:56 PM Alejandro Acosta <
alejandroacostaal...@gmail.com> wrote:

> Hello,
>
>   Taking advantage of this thread may I ask something?. I have heard of
> "wireless fiber optic", something like an antenna with a laser pointing
> from one building to the other, having said this I can assume this link
> with have lower RTT than a laser thru a fiber optic made of glass?
>
>
See: Terrabeam from about the year 2000.

--
Joe Hamelin, W7COM, Tulalip, WA, +1 (360) 474-7474


RE: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)

2020-06-22 Thread adamv0025
> Masataka Ohta
> Sent: Monday, June 22, 2020 1:49 PM
> 
> Robert Raszuk wrote:
> 
> > Moreover if you have 1000 PEs and those three sites are attached only
> > to 6 of them - only those 6 PEs will need to learn those routes (Hint:
> > RTC -
> > RFC4684)
> 
> If you have 1000 PEs, you should be serving for somewhere around 1000
> customers.
> 
> And, if I understand BGP-MP correctly, all the routing information of all
the
> customers is flooded by BGP-MP in the ISP.
> 
Not quite,
The routing information is flooded by default, but the receivers will cherry
pick what they need and drop the rest. 
And even if the default flooding of all and dropping most is a concern -it
can be addressed where only the relevant subset of all the routing info is
sent to each receiver.
The key takeaway however is that no single entity in SP network, be it PE,
or RR, or ASBR, ever needs everything, you can always slice and dice
indefinitely.
So to sum it up you simply can not run into any scaling ceiling with MP-BGP
architecture.   

adam




Re: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)

2020-06-22 Thread Nick Hilliard

Masataka Ohta wrote on 22/06/2020 13:49:

But, it should be noted that a single class B routing table entry


"a single class B routing table entry"?  Did 1993 just call and ask for  
its addressing back? :-)



But, it should be noted that a single class B routing table entry
often serves for an organization with 1s of users, which is
at least our case here at titech.ac.jp.

It should also be noted that, my concern is scalability in ISP side. 


This entire conversation is puzzling: we already have "hierarchical  
routing" to a large degree, to the extent that the public DFZ only sees  
aggregate routes exported by ASNs.  Inside ASNs, there will be internal  
aggregation of individual routes (e.g. an ISP DHCP pool), and possibly  
multiple levels of aggregation, depending on how this is configured.  
Aggregation is usually continued right down to the end-host edge, e.g. a  
router might have a /26 assigned on an interface, but the hosts will be  
aggregated within this /26.



If you have 1000 PEs, you should be serving for somewhere around 1000
customers.

And, if I understand BGP-MP correctly, all the routing information of
all the customers is flooded by BGP-MP in the ISP.


Well, maybe.  Or maybe not.  This depend on lots of things.


Then, it should be a lot better to let customer edges encapsulate
L2 or L3 over IP, with which, routing information within customers
is exchanged by customer provided VPN without requiring extra
overhead of maintaining customer local routing information by the
ISP. 


If you have 1000 or even 1s of PEs, injecting simplistic  
non-aggregated routing information is unlikely to be an issue.  If you  
have 1,000,000 PEs, you'll probably need to rethink that position.


If your proposition is that the nature of the internet be changed so  
that route disaggregation is prevented, or that addressing policy be  
changed so that organisations are exclusively handed out IP address  
space by their upstream providers, then this is simple matter of  
misunderstanding of how impractical the proposition is: that horse  
bolted from the barn 30 years ago; no organisation would accept  
exclusive connectivity provided by a single upstream; and today's world  
of dense interconnection would be impossible on the terms you suggest.  
You may not like that there are lots of entries in the DFZ and many  
operators view this as a bit of a drag, but on today's technology, this  
can scale to significantly more than what we foresee in the medium-long  
term future.


Nick


Re: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)

2020-06-22 Thread Mark Tinka



On 22/Jun/20 15:17, Masataka Ohta wrote:

>  
>
> The point of Yakov on day one was that, flow driven approach of
> Ipsilon does not scale and is unacceptable.
>
> Though I agree with Yakov here, we must also eliminate all the
> flow driven approaches by MPLS or whatever.

I still don't see them in practice, even though they may have been proposed.

Mark.


Re: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)

2020-06-22 Thread Mark Tinka



On 22/Jun/20 15:08, Masataka Ohta wrote:

>  
> The requirement from the E2E principle is that routers should be
> dumb and hosts should be clever or the entire system do not.
> scale reliably.

And yet in the PTT world, it was the other way around. Clever switching
and dumb telephone boxes. How things have since evened out.

I can understand the concern about making the network smart. But even a
smart network is not as smart as a host. My laptop can do a lot of
things more cleverly than any of the routers in my network. It just
can't do them at scale, consistently, for a bunch of users. So the
responsibility gets to be shared, with the number of users being served
diminishing as you enter and exit the edge of the network.

It's probably not yet an ideal networking paradigm, but it's the one we
have now that is a reasonably fair compromise.


>
> In this case, such clever router can ever exist only near the
> destination unless very detailed routing information is flooded
> all over the network to all the possible sources.

I will admit that bloating router code over recent years to become
terribly smart (CGN, Acceleration, DoS mitigation, VPN's, SD-WAN, IDS,
Video Monitoring, e.t.c.) can become a run away problem. I've often
joked that with all the things being thrown into BGP, we may just see it
carrying DNS too, hehe.

Personally, the level of intelligence we have in routers now beyond
being just Layer 1, 2, 3 - and maybe 4 - crunching machines is just as
far as I'm willing to go. If, like me, you keep pushing back on vendors
trying to make your routers also clean your dishes, they'll take the
hint and stop bloating the code.

Are MPLS/VPN's overly clever? I think so. But considering the pay-off
and how much worse it could get, I'm willing to accept that.


>
> A router can't be clever on something, unless it is provided
> with very detailed information on all the possible destinations,
> which needs a lot of routing traffic making entire system not
> to scale.

Well, if you can propose a better way to locate hosts on a global
network not owned by anyone, in a connectionless manner, I'm sure we'd
all be interested.

Mark.



RE: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)

2020-06-22 Thread adamv0025
> From: Masataka Ohta 
> Sent: Monday, June 22, 2020 2:17 PM
> 
> adamv0...@netconsultings.com wrote:
> 
> > But MPLS can be made flow driven (it can be made whatever the policy
> > dictates), for instance DSCP driven.
> 
> The point of Yakov on day one was that, flow driven approach of Ipsilon
does
> not scale and is unacceptable.
> 
> Though I agree with Yakov here, we must also eliminate all the flow driven
> approaches by MPLS or whatever.
> 
First I'd need a definition of what flow means in this discussion are we
considering 5-tuple or 4-tuple or just SRC-IP & DST-IP, is DSCP marking part
of it? 
Second, although I agree that ~1M unique identifiers is not ideal, can you
provide examples of MPLS applications where 1M is limiting?
What particular aspect?
Is it 1M interfaces per MPLS switching fabric box?
Or 1M unique flows (or better flow groups) originated by a given
VM/Container/CPE?
Or 1M destination entities (IPs or apps on those IPs) that any particular
VM/Container/CPE needs to talk to?
Or 1M customer VPNs or 1M PE-CPE links, if PE acts as a bottleneck?

adam
  



Re: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)

2020-06-22 Thread Mark Tinka



On 22/Jun/20 14:49, Masataka Ohta wrote:

>  
> But, it should be noted that a single class B...

CIDR - let's not teach the kids old news :-).

 
> If you have 1000 PEs, you should be serving for somewhere around 1000
> customers.

It's not linear.

We probably have 1 edge router serving several-thousand customers.


>
> And, if I understand BGP-MP correctly, all the routing information of
> all the customers is flooded by BGP-MP in the ISP.

Yes, best practice is in iBGP.

Some operators may still be using an IGP for this. It would work, but
scales poorly.


>
> Then, it should be a lot better to let customer edges encapsulate
> L2 or L3 over IP, with which, routing information within customers
> is exchanged by customer provided VPN without requiring extra
> overhead of maintaining customer local routing information by the
> ISP.

You mean like IP-in-IP or GRE? That already happens today, without any
intervention from the ISP.


>
> If a customer want customer-specific SLA, it can be described
> as SLA between customer edge routers, for which, intra-ISP MPLS
> may or may not be used.

l2vpn's and l3vpn's attract a higher SLA because the services are mostly
provisioned on-net. If an off-net component exists, it would be via a
trusted NNI partner.

Regular IP or GRE tunnels don't come with these kinds of SLA's because
the ISP isn't involved, and the B-end would very likely be off-net with
no SLA guarantees between the A-end customer's ISP and the remote ISP
hosting the B-end.


>
> For the ISP, it can be as profitable as PE-based VRF solutions,
> because customers so relying on ISPs will let the ISP provide
> and maintain customer edges.

There are few ISP's who would be able to terminate an IP or GRE tunnel
on-net, end-to-end.

And even then, they might be reluctant to offer any SLA's because those
tunnels are built on the CPE, typically outside of their control.


>
> The only difference should be on profitability for router makers,
> which want to make routing system as complex as possible or even
> a lot more than that to make backbone routers a lot profitable
> product.

If ISP's didn't make money from MPLS/VPN's, router vendors would not be
as keen on adding the capability in their boxes.


>
> Label stack was there, because of, now recognized to be wrong,
> statement of Yakov on day one and I can see no reason still to
> keep it.

Label stacking is fundamental to the "MP" part of MPLS. Whether your
payload is IP, ATM, Ethernet, Frame Relay, PPP, HDLC, e.t.c., the
ability to stack labels is what makes an MPLS network payload agnostic.
There is value in that.

Mark.



Re: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)

2020-06-22 Thread Masataka Ohta

adamv0...@netconsultings.com wrote:


But MPLS can be made flow driven (it can be made whatever the policy
dictates), for instance DSCP driven…


The point of Yakov on day one was that, flow driven approach of
Ipsilon does not scale and is unacceptable.

Though I agree with Yakov here, we must also eliminate all the
flow driven approaches by MPLS or whatever.

Masataka Ohta


RE: Devil's Advocate - Segment Routing, Why?

2020-06-22 Thread adamv0025
Hi Baldur,

 

>From memory mx204 FIB is 10M (v4/v6) and RIB 30M for each v4 and v6.

And remember the FIB is hierarchical -so it’s the next-hops per prefix you are 
referring to with BGP FRR. And also going from memory of past scaling testing, 
if pfx1+NH1 == x, then  Pfx1+NH1+NH2 !== 2x, where x is used FIB space. 

  

adam

 

From: NANOG  On Behalf Of 
Baldur Norddahl
Sent: Saturday, June 20, 2020 9:00 PM



I can't speak for the year 2000 as I was not doing networking at this level at 
that time. But when I check the specs for the base mx204 it says something like 
32 VRFs, 2 million routes in FIB and 6 million routes in RIB. Clearly those 
numbers are the total of routes across all VRFs otherwise you arrive at silly 
numbers (64 million FIB if you multiply, 128k FIB if you divide by 32). My 
conclusion is that scale wise you are ok as long you do not try to have more 
than one VRF with a complete copy of the DFZ.

 

More worrying is that 2 million routes will soon not be enough to install all 
routes with a backup route, invalidating BGP FRR.

 



Re: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)

2020-06-22 Thread Masataka Ohta

Mark Tinka wrote:


So, with hierarchical routing, routing protocols can
carry only rough information around destinations, from
which, source side can not construct detailed (often
purposelessly nested) labels required for MPLS.


But hosts often point default to a clever router.

The requirement from the E2E principle is that routers should be
dumb and hosts should be clever or the entire system do not.
scale reliably.

In this case, such clever router can ever exist only near the
destination unless very detailed routing information is flooded
all over the network to all the possible sources.

A router can't be clever on something, unless it is provided
with very detailed information on all the possible destinations,
which needs a lot of routing traffic making entire system not
to scale.


Masataka Ohta


Re: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)

2020-06-22 Thread Masataka Ohta

Robert Raszuk wrote:


Neither link wise nor host wise information is required to accomplish say
L3VPN services. Imagine you have three sites which would like to
interconnect each with 1000s of users.


For a single customer of an ISP with 1000s of end users. OK.

But, it should be noted that a single class B routing table entry
often serves for an organization with 1s of users, which is
at least our case here at titech.ac.jp.

It should also be noted that, my concern is scalability in ISP side.


Moreover if you have 1000 PEs and those three sites are attached only to 6
of them - only those 6 PEs will need to learn those routes (Hint: RTC -
RFC4684)


If you have 1000 PEs, you should be serving for somewhere around 1000
customers.

And, if I understand BGP-MP correctly, all the routing information of
all the customers is flooded by BGP-MP in the ISP.

Then, it should be a lot better to let customer edges encapsulate
L2 or L3 over IP, with which, routing information within customers
is exchanged by customer provided VPN without requiring extra
overhead of maintaining customer local routing information by the
ISP.

If a customer want customer-specific SLA, it can be described
as SLA between customer edge routers, for which, intra-ISP MPLS
may or may not be used.

For the ISP, it can be as profitable as PE-based VRF solutions,
because customers so relying on ISPs will let the ISP provide
and maintain customer edges.

The only difference should be on profitability for router makers,
which want to make routing system as complex as possible or even
a lot more than that to make backbone routers a lot profitable
product.


With nested labels, you don't need so much labels at certain nesting
level, which was the point of Yakov, which does not mean you don't
need so much information to create entire nested labels at or near
the sources.



Label stack is here from day one.


Label stack was there, because of, now recognized to be wrong,
statement of Yakov on day one and I can see no reason still to
keep it.

Masataka Ohta


RE: 60 ms cross-continent

2020-06-22 Thread adamv0025
> William Herrin
> 
> Howdy,
> 
> Why is latency between the east and west coasts so bad? Speed of light
> accounts for about 15ms each direction for a 30ms round trip. Where does
> the other 30ms come from and why haven't we gotten rid of it?
> 
Wallstreet did :) 
https://www.wired.com/2012/08/ff_wallstreet_trading/

adam



RE: Devil's Advocate - Segment Routing, Why?

2020-06-22 Thread adamv0025
> From: NANOG  On Behalf Of Masataka Ohta
> Sent: Friday, June 19, 2020 5:01 PM
> 
> Robert Raszuk wrote:
> 
> > So I think Ohta-san's point is about scalability services not flat
> > underlay RIB and FIB sizes. Many years ago we had requests to support
> > 5M L3VPN routes while underlay was just 500K IPv4.
> 
> That is certainly a problem. However, worse problem is to know label
values
> nested deeply in MPLS label chain.
> 
> Even worse, if route near the destination expected to pop the label chain
> goes down, how can the source knows that the router goes down and
> choose alternative router near the destination?
> 
Via IGP or controller, but for sub 50ms convergence there are edge node
protection mechanisms, so the point is the source doesn't even need to know
about for the restoration to happen. 

adam
 



RE: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)

2020-06-22 Thread adamv0025
> Masataka Ohta
> Sent: Sunday, June 21, 2020 1:37 PM
> 
> > Whether you do it manually or use a label distribution protocol, FEC's
> > are pre-computed ahead of time.
> >
> > What am I missing?
> 
> If all the link-wise (or, worse, host-wise) information of possible
destinations
> is distributed in advance to all the possible sources, it is not
hierarchical but
> flat (host) routing, which scales poorly.
> 
> Right?
> 
On the Internet yes in controlled environments no, as in these environments
the set of possible destinations is well scoped. 

Take an MPLS enabled DC for instance, every VM does need to talk to only a
small subset of all the VMs hosted in a DC. 
Hence each VM gets flow transport labels programmed via centralized
end-to-end flow controllers on a need to know bases (not everything to
everyone).
(E.g. dear vm1 this is how you get your EF/BE flows via load-balancer and FW
to backend VMs in your local pod, this is how you get via local pod fw to
internet gw, etc..., done)
Now that you have these neat "pipes" all over the place connecting VMs it's
easy for the switching fabric controller to shuffle elephant and mice flows
around in order to avoid any link saturation. 

And now imagine a bit further doing the same as above but with CPEs on a
Service Provider network... yep, no PEs acting as chokepoints for MPLS label
switch paths to flow assignment, needing massive FIBs and even bigger, just
dumb MPLS switch fabric, all the "hard-work" is offloaded to centralized
controllers (and CPEs for label stack imposition) -but only on a need to
know bases (not everything to everyone).

Now in both cases you're free to choose to what extent should the MPLS
switch fabric be involved with the end-to-end flows by imposing hierarchies
to the MPLS stack.  

In light of the above, does it suck to have just 20bits of MPLS label space?
Absolutely.
 
Adam





RE: why am i in this handbasket? (was Devil's Advocate - Segment Routing, Why?)

2020-06-22 Thread adamv0025
But MPLS can be made flow driven (it can be made whatever the policy dictates), 
for instance DSCP driven…

 

adam

 

From: NANOG  On Behalf Of 
Robert Raszuk
Sent: Saturday, June 20, 2020 4:13 PM
To: Masataka Ohta 
Cc: North American Network Operators' Group 
Subject: Re: why am i in this handbasket? (was Devil's Advocate - Segment 
Routing, Why?)

 

 

The problem of MPLS, however, is that, it must also be flow driven,
because detailed route information at the destination is necessary
to prepare nested labels at the source, which costs a lot and should
be attempted only for detected flows.

 

MPLS is not flow driven. I sent some mail about it but perhaps it bounced. 

 

MPLS LDP or L3VPNs was NEVER flow driven. 

 

Since day one till today it was and still is purely destination based. 

 

Transport is using LSP to egress PE (dst IP). 

 

L3VPNs are using either per dst prefix, or per CE or per VRF labels. No 
implementation does anything upon "flow detection" - to prepare any nested 
labels. Even in FIBs all information is preprogrammed in hierarchical fashion 
well before any flow packet arrives. 

 

Thx,
R.

 

 

 


 > there is the argument that switching MPLS is faster than IP; when the
 > pressure points i see are more at routing (BGP/LDP/RSVP/whatever),
 > recovery, and convergence.

Routing table at IPv4 backbone today needs at most 16M entries to be
looked up by simple SRAM, which is as fast as MPLS look up, which is
one of a reason why we should obsolete IPv6.

Though resource reserved flows need their own routing table entries,
they should be charged proportional to duration of the reservation,
which can scale to afford the cost to have the entries.

Masataka Ohta