Re: Centurylink having a bad morning?

2020-09-02 Thread Mark Tinka


On 31/Aug/20 16:33, Tomas Lynch wrote:

> Maybe we are idealizing these so-called tier-1 carriers and we,
> tier-ns, should treat them as what they really are: another AS. Accept
> that they are going to fail and do our best to mitigate the impact on
> our own networks, i.e. more peering.

Bingo!

Mark.


Re: Does anyone actually like CenturyLink?

2020-09-02 Thread Mark Tinka



On 30/Aug/20 17:20, Matt Hoppes wrote:

> No clue. They’ve been progressively  getting worse since 2010. I have no idea 
> why anyone chooses them and they shouldn’t be considered a Tier1 carrier with 
> the level of issues they have. 

For us, the account management took a turn for the worse after CL picked
them up. But we only need them during contract renewals. So it's not a
drama.

We get transit from all the top 7 providers, so an issue with one of
them is never a problem.

It's moments like this where their offers to put all your eggs in their
one basket for all of your interconnect PoP's for a "marvelous" price,
should remind you to choose diversity of connectivity, rather.

Mark.


Re: Does anyone actually like CenturyLink?

2020-09-02 Thread Mark Tinka



On 30/Aug/20 17:15, vidister via NANOG wrote:

> Operating a CDN inside a Tier1 network is just shitty behaviour.

What's a Tier 1 network :-)?

Mark.


Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Jon Lewis

On Wed, 2 Sep 2020, Warren Kumari wrote:


The root issue here is that the *publicc* RFO is incomplete / unclear.
Something something flowspec something, blocked flowspec, no more
something does indeed explain that something bad happened, but not
what caused the lack of withdraws / cascading churn.
As with many interesting outages, I suspect that we will never get the
full story, and "Something bad happened, we fixed it and now it's all
better and will never happen ever again, trust us..." seems to be the
new normal for public postmortems...


It's possible Level3's people don't fully understand what happened or that 
the "bad flowspec rule" causing BGP sessions to repeatedly flap network 
wide triggered software bugs on their routers.  You've never seen rpd 
stuck at 100% CPU for hours or an MX960 advertise history routes to 
external peers, even after the internal session that had advertised the 
route to it has been cleared?


To quote Zaphod Beeblebrox "Listen, three eyes, don't you try to outweird 
me. I get stranger things than you free with my breakfast cereal."


Kick a BGP implementation hard enough, and weird shit is likely to happen.

--
 Jon Lewis, MCP :)   |  I route
 StackPath, Sr. Neteng   |  therefore you are
_ http://www.lewis.org/~jlewis/pgp for PGP public key_


Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Warren Kumari
On Wed, Sep 2, 2020 at 3:04 PM Vincent Bernat  wrote:
>
>  ❦  2 septembre 2020 16:35 +03, Saku Ytti:
>
> >> I am not buying it. No normal implementation of BGP stays online,
> >> replying to heart beat and accepting updates from ebgp peers, yet
> >> after 5 hours failed to process withdrawal from customers.
> >
> > I can imagine writing BGP implementation like this
> >
> >  a) own queue for keepalives, which i always serve first fully
> >  b) own queue for update, which i serve second
> >  c) own queue for withdraw, which i serve last
>
> Or maybe, graceful restart configured without a timeout on IPv4/IPv6?
> The flowspec rule severed the BGP session abruptly, stale routes are
> kept due to graceful restart (except flowspec rules), BGP sessions are
> reestablished but the flowspec rules is handled before before reaching
> EoR and we loop from there.

... or all routes are fed into some magic route optimization box which
is designed to keep things more stable and take advantage of cisco's
"step-10" to suck more traffic, or

The root issue here is that the *publicc* RFO is incomplete / unclear.
Something something flowspec something, blocked flowspec, no more
something does indeed explain that something bad happened, but not
what caused the lack of withdraws / cascading churn.
As with many interesting outages, I suspect that we will never get the
full story, and "Something bad happened, we fixed it and now it's all
better and will never happen ever again, trust us..." seems to be the
new normal for public postmortems...

W

> --
> Make sure your code "does nothing" gracefully.
> - The Elements of Programming Style (Kernighan & Plauger)



-- 
I don't think the execution is relevant when it was obviously a bad
idea in the first place.
This is like putting rabid weasels in your pants, and later expressing
regret at having chosen those particular rabid weasels and that pair
of pants.
   ---maf


Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Vincent Bernat
 ❦  2 septembre 2020 16:35 +03, Saku Ytti:

>> I am not buying it. No normal implementation of BGP stays online,
>> replying to heart beat and accepting updates from ebgp peers, yet
>> after 5 hours failed to process withdrawal from customers.
>
> I can imagine writing BGP implementation like this
>
>  a) own queue for keepalives, which i always serve first fully
>  b) own queue for update, which i serve second
>  c) own queue for withdraw, which i serve last

Or maybe, graceful restart configured without a timeout on IPv4/IPv6?
The flowspec rule severed the BGP session abruptly, stale routes are
kept due to graceful restart (except flowspec rules), BGP sessions are
reestablished but the flowspec rules is handled before before reaching
EoR and we loop from there.
-- 
Make sure your code "does nothing" gracefully.
- The Elements of Programming Style (Kernighan & Plauger)


Re: Centurylink having a bad morning?

2020-09-02 Thread Alarig Le Lay
https://www.youtube.com/watch?v=vQ5MA685ApE

On Wed 02 Sep 2020 20:40:35 GMT, Baldur Norddahl wrote:
> That is what the 5G router is for...
> 
> ons. 2. sep. 2020 19.47 skrev Michael Hallgren :
> 
> > While conserving connectivity? 😂
> >
> >
> > --
> > *De :* Shawn L via NANOG 
> > *Envoyé :* mercredi 2 septembre 2020 13:15
> > *À :* nanog
> > *Objet :* Re: Centurylink having a bad morning?
> >
> > We once moved a 3u server 30 miles between data centers this way.  Plug
> > redundant psu into a ups and 2 people carried it out and put them in a
> > vehicle.
> >
> >
> > Sent from my iPhone
> >
> > > On Sep 1, 2020, at 11:58 PM, Christopher Morrow 
> > wrote:
> > >
> > > On Tue, Sep 1, 2020 at 11:53 PM Alain Hebert 
> > wrote:
> > >>
> > >>As a coincidence...  I was *thinking* of moving a 90TB SAN (with
> > mechanical's) to another rack that way...  skateboard, long fibers and long
> > power cords =D
> > >>
> > >
> > > well, what you REALLY need is one of these:
> > >  https://www.cru-inc.com/products/wiebetech/hotplug_field_kit_product/
> > >
> > > and 2-3 UPS... swap to the UPS, then just roll the stack over, plug to
> > > utility and done. (minus network transfer)
> >


Re: AANP Akamai

2020-09-02 Thread Patrick W. Gilmore
netsupp...@akamai.com

-- 
TTFN,
patrick

> On Sep 2, 2020, at 2:40 PM, ahmed.dala...@hrins.net wrote:
> 
> Hello NANOG, 
> 
> Could somebody from Akamai AANP’s network team contact me off-list? I’ve 
> tried the peering and NOC and got no replies in months. 
> 
> Thanks
> Ahmed



AANP Akamai

2020-09-02 Thread ahmed.dala...@hrins.net
Hello NANOG, 

Could somebody from Akamai AANP’s network team contact me off-list? I’ve tried 
the peering and NOC and got no replies in months. 

Thanks
Ahmed 

Re: Centurylink having a bad morning?

2020-09-02 Thread Baldur Norddahl
That is what the 5G router is for...

ons. 2. sep. 2020 19.47 skrev Michael Hallgren :

> While conserving connectivity? 😂
>
>
> --
> *De :* Shawn L via NANOG 
> *Envoyé :* mercredi 2 septembre 2020 13:15
> *À :* nanog
> *Objet :* Re: Centurylink having a bad morning?
>
> We once moved a 3u server 30 miles between data centers this way.  Plug
> redundant psu into a ups and 2 people carried it out and put them in a
> vehicle.
>
>
> Sent from my iPhone
>
> > On Sep 1, 2020, at 11:58 PM, Christopher Morrow 
> wrote:
> >
> > On Tue, Sep 1, 2020 at 11:53 PM Alain Hebert 
> wrote:
> >>
> >>As a coincidence...  I was *thinking* of moving a 90TB SAN (with
> mechanical's) to another rack that way...  skateboard, long fibers and long
> power cords =D
> >>
> >
> > well, what you REALLY need is one of these:
> >  https://www.cru-inc.com/products/wiebetech/hotplug_field_kit_product/
> >
> > and 2-3 UPS... swap to the UPS, then just roll the stack over, plug to
> > utility and done. (minus network transfer)
>


Re: Centurylink having a bad morning?

2020-09-02 Thread Michael Hallgren
While conserving connectivity? 😂



De : Shawn L via NANOG 
Envoyé : mercredi 2 septembre 2020 13:15
À : nanog
Objet : Re: Centurylink having a bad morning?

We once moved a 3u server 30 miles between data centers this way.  Plug 
redundant psu into a ups and 2 people carried it out and put them in a vehicle. 
 


Sent from my iPhone 

> On Sep 1, 2020, at 11:58 PM, Christopher Morrow  
> wrote: 
> 
> On Tue, Sep 1, 2020 at 11:53 PM Alain Hebert  wrote: 
>> 
>>    As a coincidence...  I was *thinking* of moving a 90TB SAN (with 
>>mechanical's) to another rack that way...  skateboard, long fibers and long 
>>power cords =D 
>> 
> 
> well, what you REALLY need is one of these: 
>  https://www.cru-inc.com/products/wiebetech/hotplug_field_kit_product/ 
> 
> and 2-3 UPS... swap to the UPS, then just roll the stack over, plug to 
> utility and done. (minus network transfer) 


Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Luke Guillory
Detailed explanation can be found below.


https://blog.thousandeyes.com/centurylink-level-3-outage-analysis/





From: NANOG  on behalf of 
Baldur Norddahl 
Date: Wednesday, September 2, 2020 at 12:09 PM
To: "nanog@nanog.org" 
Subject: Re: [outages] Major Level3 (CenturyLink) Issues

*External Email: Use Caution*
I believe someone on this list reported that updates were also broken. They 
could not add prepending nor modify communities.

Anyway I am not saying it cannot happen because clearly something did happen. I 
just don't believe it is a simple case of overload. There has to be more to it.
ons. 2. sep. 2020 15.36 skrev Saku Ytti mailto:s...@ytti.fi>>:
On Wed, 2 Sep 2020 at 16:16, Baldur Norddahl 
mailto:baldur.nordd...@gmail.com>> wrote:

> I am not buying it. No normal implementation of BGP stays online, replying to 
> heart beat and accepting updates from ebgp peers, yet after 5 hours failed to 
> process withdrawal from customers.

I can imagine writing BGP implementation like this

 a) own queue for keepalives, which i always serve first fully
 b) own queue for update, which i serve second
 c) own queue for withdraw, which i serve last

Why I might think this makes sense, is perhaps I just received from
RR2 prefix I'm pulling from RR1, if I don't handle all my updates
first, I'm causing outage that should not happen, because I already
actually received the update telling I don't need to withdraw it.

Is this the right way to do it? Maybe not, but it's easy to imagine
why it might seem like a good idea.

How well BGP works in common cases and how it works in pathologically
scaled and busy cases are very different cases.

I know that even in stable states commonly run vendors on commonly run
hardware can take +2h to finish converging iBGP on initial turn-up.

--
  ++ytti


Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Baldur Norddahl
I believe someone on this list reported that updates were also broken. They
could not add prepending nor modify communities.

Anyway I am not saying it cannot happen because clearly something did
happen. I just don't believe it is a simple case of overload. There has to
be more to it.

ons. 2. sep. 2020 15.36 skrev Saku Ytti :

> On Wed, 2 Sep 2020 at 16:16, Baldur Norddahl 
> wrote:
>
> > I am not buying it. No normal implementation of BGP stays online,
> replying to heart beat and accepting updates from ebgp peers, yet after 5
> hours failed to process withdrawal from customers.
>
> I can imagine writing BGP implementation like this
>
>  a) own queue for keepalives, which i always serve first fully
>  b) own queue for update, which i serve second
>  c) own queue for withdraw, which i serve last
>
> Why I might think this makes sense, is perhaps I just received from
> RR2 prefix I'm pulling from RR1, if I don't handle all my updates
> first, I'm causing outage that should not happen, because I already
> actually received the update telling I don't need to withdraw it.
>
> Is this the right way to do it? Maybe not, but it's easy to imagine
> why it might seem like a good idea.
>
> How well BGP works in common cases and how it works in pathologically
> scaled and busy cases are very different cases.
>
> I know that even in stable states commonly run vendors on commonly run
> hardware can take +2h to finish converging iBGP on initial turn-up.
>
> --
>   ++ytti
>


Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Randy Bush
> we don't form disaster response plans by saying "well, we could think
> about what *could* happen for days, but we'll just wait for something
> to occur".

from an old talk of mine, if it was part of the “plan” it’s an “event,”
if it is not then it’s a “disaster.”



Re: Centurylink having a bad morning?

2020-09-02 Thread Warren Kumari
On Wed, Sep 2, 2020 at 12:00 AM Christopher Morrow
 wrote:
>
> On Tue, Sep 1, 2020 at 11:53 PM Alain Hebert  wrote:
> >
> > As a coincidence...  I was *thinking* of moving a 90TB SAN (with 
> > mechanical's) to another rack that way...  skateboard, long fibers and long 
> > power cords =D
> >
>
> well, what you REALLY need is one of these:
>   https://www.cru-inc.com/products/wiebetech/hotplug_field_kit_product/
>

Yeah, no... actually, hell no!

That setup scares me, and I'm surprised that it can be sold at all,
even with many warning labels and disclaimers...
After the first time I saw it (I suspect also due to Chris!) I tried
doing something similar -- I cut the ends off a power cord, attached
alligator clips and moved a lamp from one outlet to another -- this
was all on the same circuit (no UPS, no difference in potential, etc)
and so it doesn't need anything to switch between supplies. I checked
with a multimeter before making the connections (to triple check) that
I had live and neutral correct, had an in-circuit GFCI, and was
wearing rubber gloves. It *worked*, but having a plug with live,
exposed pins is not something I want to repeat

On a related note - my wife once spent much time trying to explain to
one of her clients why they cannot just plug the input of their power
strip into the output of the same powerstrip, and get free
'lectricity... "But power comes out ot the socket!!!" , "Well, yes,
but it has to get into the powerstrip" , "Yah! That's why I plug the
plug into it..."... I think that eventually she just demonstrated
(again!) that it doesn't work, and then muttered something about
"Magic"...

W

> and 2-3 UPS... swap to the UPS, then just roll the stack over, plug to
> utility and done. (minus network transfer)



-- 
I don't think the execution is relevant when it was obviously a bad
idea in the first place.
This is like putting rabid weasels in your pants, and later expressing
regret at having chosen those particular rabid weasels and that pair
of pants.
   ---maf


Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Aaron C. de Bruyn via NANOG
Sure.  But being good engineers, we love to exercise our brains by thinking
about possibilities and probabilities.
For example, we don't form disaster response plans by saying "well, we
could think about what *could* happen for days, but we'll just wait for
something to occur".

-A

On Wed, Sep 2, 2020 at 8:51 AM Randy Bush  wrote:

> creative engineers can conjecturbate for days on how some turtle in the
> pond might write code what did not withdraw for a month, or other
> delightful reasons CL might have had this really really bad behavior.
>
> the point is that the actual symptoms and cause really really should be
> in the RFO
>
> randy
>


Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Randy Bush
creative engineers can conjecturbate for days on how some turtle in the
pond might write code what did not withdraw for a month, or other
delightful reasons CL might have had this really really bad behavior.

the point is that the actual symptoms and cause really really should be
in the RFO

randy


Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Dantzig, Brian
Cisco had a bug a few years back that affected metro switches such that they 
would not withdraw routes upstream. We had an internal outage and one of my 
carriers kept advertising our prefixes even though we withdrew the routes. We 
tried downing the neighbor and even shutting down the physical interface to no 
avail. The carrier kept blackholing us until they shut down on their metro 
switch.


Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Tom Beecher
Yeah. This actually would be a fascinating study to understand exactly what
happened. The volume of BGP messages flying around because of the session
churn must have been absolutely massive, especially in a complex internal
infrastructure like 3356 has.

I would say the scale of such an event has to be many orders of magnitude
beyond what anyone ever designed for, so it doesn't shock me at all that
unexpected behavior occurred. But that's why we're engineers ; we want to
understand such things.

On Wed, Sep 2, 2020 at 9:37 AM Saku Ytti  wrote:

> On Wed, 2 Sep 2020 at 16:16, Baldur Norddahl 
> wrote:
>
> > I am not buying it. No normal implementation of BGP stays online,
> replying to heart beat and accepting updates from ebgp peers, yet after 5
> hours failed to process withdrawal from customers.
>
> I can imagine writing BGP implementation like this
>
>  a) own queue for keepalives, which i always serve first fully
>  b) own queue for update, which i serve second
>  c) own queue for withdraw, which i serve last
>
> Why I might think this makes sense, is perhaps I just received from
> RR2 prefix I'm pulling from RR1, if I don't handle all my updates
> first, I'm causing outage that should not happen, because I already
> actually received the update telling I don't need to withdraw it.
>
> Is this the right way to do it? Maybe not, but it's easy to imagine
> why it might seem like a good idea.
>
> How well BGP works in common cases and how it works in pathologically
> scaled and busy cases are very different cases.
>
> I know that even in stable states commonly run vendors on commonly run
> hardware can take +2h to finish converging iBGP on initial turn-up.
>
> --
>   ++ytti
>


Re: Centurylink having a bad morning?

2020-09-02 Thread Jason Kuehl
If the client pays me a shit ton of money to make sure the server
won't turn off, and they pay for the hardware to make it happen. I;d think
about it. It's a like a colo move on hardmode.

Its extremely stupid, and I would advise not doing it.

Hell even when I migrated e911 server, we had a 20 minutes outage to move
the physical server. If that server can't be shut off, something was built
wrong.

On Wed, Sep 2, 2020 at 9:33 AM Bryan Holloway  wrote:

>
> On 9/2/20 1:49 PM, Nick Hilliard wrote:
> > Shawn L via NANOG wrote on 02/09/2020 12:15:
> >> We once moved a 3u server 30 miles between data centers this way.
> >> Plug redundant psu into a ups and 2 people carried it out and put
> >> them in a vehicle.
> >
> > hopefully none of these server moves that people have been talking about
> > involved spinning disks.  If they did, kit damage is one of the likely
> > outcomes - you seriously do not want to bump active spindles:
> >
> > www.google.com/search?q=disk+platter+damage&tbm=isch
> >
> > SSDs are a different story. In that case it's just a bit odd as to why
> > you wouldn't want to power down a system to physically move it - in the
> > sense that if your service delivery model can't withstand periodic
> > maintenance and loss of availability of individual components,
> > rethinking the model might be productive.
> >
> > Nick
> >
>
> If it's your server, moving beyond (very) local facilities, and time is
> not of the essence, then sure: power down.
>
> If you're law-enforcement mid-raid, or trying to preserve your Frogger
> high-score, well, ...
>


-- 
Sincerely,

Jason W Kuehl
Cell 920-419-8983
jason.w.ku...@gmail.com


Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Saku Ytti
On Wed, 2 Sep 2020 at 16:16, Baldur Norddahl  wrote:

> I am not buying it. No normal implementation of BGP stays online, replying to 
> heart beat and accepting updates from ebgp peers, yet after 5 hours failed to 
> process withdrawal from customers.

I can imagine writing BGP implementation like this

 a) own queue for keepalives, which i always serve first fully
 b) own queue for update, which i serve second
 c) own queue for withdraw, which i serve last

Why I might think this makes sense, is perhaps I just received from
RR2 prefix I'm pulling from RR1, if I don't handle all my updates
first, I'm causing outage that should not happen, because I already
actually received the update telling I don't need to withdraw it.

Is this the right way to do it? Maybe not, but it's easy to imagine
why it might seem like a good idea.

How well BGP works in common cases and how it works in pathologically
scaled and busy cases are very different cases.

I know that even in stable states commonly run vendors on commonly run
hardware can take +2h to finish converging iBGP on initial turn-up.

-- 
  ++ytti


Re: Centurylink having a bad morning?

2020-09-02 Thread Bryan Holloway



On 9/2/20 1:49 PM, Nick Hilliard wrote:

Shawn L via NANOG wrote on 02/09/2020 12:15:

We once moved a 3u server 30 miles between data centers this way.
Plug redundant psu into a ups and 2 people carried it out and put
them in a vehicle.


hopefully none of these server moves that people have been talking about 
involved spinning disks.  If they did, kit damage is one of the likely 
outcomes - you seriously do not want to bump active spindles:


www.google.com/search?q=disk+platter+damage&tbm=isch

SSDs are a different story. In that case it's just a bit odd as to why 
you wouldn't want to power down a system to physically move it - in the 
sense that if your service delivery model can't withstand periodic 
maintenance and loss of availability of individual components, 
rethinking the model might be productive.


Nick



If it's your server, moving beyond (very) local facilities, and time is 
not of the essence, then sure: power down.


If you're law-enforcement mid-raid, or trying to preserve your Frogger 
high-score, well, ...


Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Baldur Norddahl
I am not buying it. No normal implementation of BGP stays online, replying
to heart beat and accepting updates from ebgp peers, yet after 5 hours
failed to process withdrawal from customers.


ons. 2. sep. 2020 14.11 skrev Saku Ytti :

> On Wed, 2 Sep 2020 at 14:40, Mike Hammett  wrote:
>
> > Sure, but I don't care how busy your router is, it shouldn't take hours
> to withdraw routes.
>
> Quite, discussion is less about how we feel about it and more about
> why it happens and what could be done to it.
>
> --
>   ++ytti
>


Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Saku Ytti
On Wed, 2 Sep 2020 at 14:40, Mike Hammett  wrote:

> Sure, but I don't care how busy your router is, it shouldn't take hours to 
> withdraw routes.

Quite, discussion is less about how we feel about it and more about
why it happens and what could be done to it.

-- 
  ++ytti


Re: Centurylink having a bad morning?

2020-09-02 Thread Nick Hilliard

Shawn L via NANOG wrote on 02/09/2020 12:15:

We once moved a 3u server 30 miles between data centers this way.
Plug redundant psu into a ups and 2 people carried it out and put
them in a vehicle.


hopefully none of these server moves that people have been talking about 
involved spinning disks.  If they did, kit damage is one of the likely 
outcomes - you seriously do not want to bump active spindles:


www.google.com/search?q=disk+platter+damage&tbm=isch

SSDs are a different story. In that case it's just a bit odd as to why 
you wouldn't want to power down a system to physically move it - in the 
sense that if your service delivery model can't withstand periodic 
maintenance and loss of availability of individual components, 
rethinking the model might be productive.


Nick



Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Mike Hammett
Sure, but I don't care how busy your router is, it shouldn't take hours to 
withdraw routes. 




- 
Mike Hammett 
Intelligent Computing Solutions 
http://www.ics-il.com 

Midwest-IX 
http://www.midwest-ix.com 

- Original Message -

From: "Saku Ytti"  
To: "Martijn Schmidt"  
Cc: "Outages" , "North American Network Operators' Group" 
 
Sent: Wednesday, September 2, 2020 2:15:46 AM 
Subject: Re: [outages] Major Level3 (CenturyLink) Issues 

On Wed, 2 Sep 2020 at 10:00, Martijn Schmidt via NANOG  wrote: 

> I suppose now would be a good time for everyone to re-open their Centurylink 
> ticket and ask why the RFO doesn't address the most important defect, e.g. 
> the inability to withdraw announcements even by shutting down the session? 

The more work the BGP process has the longer it takes to complete that 
work. You could try in your RFP/RFQ if some provider will commit on 
specific convergence time, which would improve your position 
contractually and might make you eligible for some compensations or 
termination of contract, but realistically every operator can run into 
a situation where you will see what most would agree pathologically 
long convergence times. 

The more BGP sessions, more RIB entries the higher the probability 
that these issues manifest. Perhaps protocol level work can be 
justified as well. BGP doesn't have concept of initial convergence, if 
you have lot of peers, your initial convergence contains massive 
amount of useless work, because you keep changing best route, while 
you keep receiving new best routes, the higher the scale the more 
useless work you do and the longer stability you require to eventually 
~converge. Practical devices operators run may require hours during 
_normal operation_ to do initial converge. 

RFC7313 might show us way to reduce amount of useless work. You might 
want to add signal that initial convergence is done, you might want to 
add signal that no installation or best path algo happens until all 
route are loaded, this would massively improve scaled convergence as 
you wouldn't do that throwaway work, which ultimately inflates your 
work queue and pushes your useful work far to the future. 

The main thing as a customer I would ask, how can we fix it faster 
than 5h in future. Did we lose access to control-plane? Could we 
reasonably avoid losing it? 
-- 
++ytti 



Re: Centurylink having a bad morning?

2020-09-02 Thread Shawn L via NANOG
We once moved a 3u server 30 miles between data centers this way.  Plug 
redundant psu into a ups and 2 people carried it out and put them in a vehicle. 
 


Sent from my iPhone

> On Sep 1, 2020, at 11:58 PM, Christopher Morrow  
> wrote:
> 
> On Tue, Sep 1, 2020 at 11:53 PM Alain Hebert  wrote:
>> 
>>As a coincidence...  I was *thinking* of moving a 90TB SAN (with 
>> mechanical's) to another rack that way...  skateboard, long fibers and long 
>> power cords =D
>> 
> 
> well, what you REALLY need is one of these:
>  https://www.cru-inc.com/products/wiebetech/hotplug_field_kit_product/
> 
> and 2-3 UPS... swap to the UPS, then just roll the stack over, plug to
> utility and done. (minus network transfer)


Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Saku Ytti
On Wed, 2 Sep 2020 at 12:50, Vincent Bernat  wrote:

> It seems BIRD contains an implementation for RFC7313. From the source
> code, it delays removal of stale route until EoRR, but it doesn't seem
> to delay the work on updating the kernel. Juniper doesn't seem to
> implement it. Cisco seems to implement it, but only on refresh, not on
> the initial connection. Is there some survey around this RFC?

Correct it doesn't do anything for initial, but I took it as an
example how we might approach the problem of initial convergence cost
at scaled environments.

-- 
  ++ytti


Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Vincent Bernat
 ❦  2 septembre 2020 10:15 +03, Saku Ytti:

> RFC7313 might show us way to reduce amount of useless work. You might
> want to add signal that initial convergence is done, you might want to
> add signal that no installation or best path algo happens until all
> route are loaded, this would massively improve scaled convergence as
> you wouldn't do that throwaway work, which ultimately inflates your
> work queue and pushes your useful work far to the future.

It seems BIRD contains an implementation for RFC7313. From the source
code, it delays removal of stale route until EoRR, but it doesn't seem
to delay the work on updating the kernel. Juniper doesn't seem to
implement it. Cisco seems to implement it, but only on refresh, not on
the initial connection. Is there some survey around this RFC?
-- 
Don't patch bad code - rewrite it.
- The Elements of Programming Style (Kernighan & Plauger)


Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Saku Ytti
On Wed, 2 Sep 2020 at 10:00, Martijn Schmidt via NANOG  wrote:

> I suppose now would be a good time for everyone to re-open their Centurylink 
> ticket and ask why the RFO doesn't address the most important defect, e.g. 
> the inability to withdraw announcements even by shutting down the session?

The more work the BGP process has the longer it takes to complete that
work. You could try in your RFP/RFQ if some provider will commit on
specific convergence time, which would improve your position
contractually and might make you eligible for some compensations or
termination of contract, but realistically every operator can run into
a situation where you will see what most would agree pathologically
long convergence times.

The more BGP sessions, more RIB entries the higher the probability
that these issues manifest. Perhaps protocol level work can be
justified as well. BGP doesn't have concept of initial convergence, if
you have lot of peers, your initial convergence contains massive
amount of useless work, because you keep changing best route, while
you keep receiving new best routes, the higher the scale the more
useless work you do and the longer stability you require to eventually
~converge. Practical devices operators run may require hours during
_normal operation_ to do initial converge.

RFC7313 might show us way to reduce amount of useless work. You might
want to add signal that initial convergence is done, you might want to
add signal that no installation or best path algo happens until all
route are loaded, this would massively improve scaled convergence as
you wouldn't do that throwaway work, which ultimately inflates your
work queue and pushes your useful work far to the future.

The main thing as a customer I would ask, how can we fix it faster
than 5h in future. Did we lose access to control-plane? Could we
reasonably avoid losing it?
-- 
  ++ytti