Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-03 Thread Robert Raszuk
And just to add just a little bit of fuel to this fire let me share that the base principle of BGP spec mandating to withdraw the routes when the session goes down could be in the glory of IETF soon a history :( It started with the proposal to make BGP state "persistent": https://tools.ietf.org/ht

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-03 Thread Mark Tinka
On 2/Sep/20 15:12, Baldur Norddahl wrote: > I am not buying it. No normal implementation of BGP stays online, > replying to heart beat and accepting updates from ebgp peers, yet > after 5 hours failed to process withdrawal from customers. A BGP RFC spec. is not the same thing as a vendor trans

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-03 Thread Mark Tinka
On 2/Sep/20 13:36, Mike Hammett wrote: > Sure, but I don't care how busy your router is, it shouldn't take > hours to withdraw routes. If only routers had feelings... Mark.

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Jon Lewis
On Wed, 2 Sep 2020, Warren Kumari wrote: The root issue here is that the *publicc* RFO is incomplete / unclear. Something something flowspec something, blocked flowspec, no more something does indeed explain that something bad happened, but not what caused the lack of withdraws / cascading churn

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Warren Kumari
On Wed, Sep 2, 2020 at 3:04 PM Vincent Bernat wrote: > > ❦ 2 septembre 2020 16:35 +03, Saku Ytti: > > >> I am not buying it. No normal implementation of BGP stays online, > >> replying to heart beat and accepting updates from ebgp peers, yet > >> after 5 hours failed to process withdrawal from c

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Vincent Bernat
❦ 2 septembre 2020 16:35 +03, Saku Ytti: >> I am not buying it. No normal implementation of BGP stays online, >> replying to heart beat and accepting updates from ebgp peers, yet >> after 5 hours failed to process withdrawal from customers. > > I can imagine writing BGP implementation like this

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Luke Guillory
Detailed explanation can be found below. https://blog.thousandeyes.com/centurylink-level-3-outage-analysis/ From: NANOG on behalf of Baldur Norddahl Date: Wednesday, September 2, 2020 at 12:09 PM To: "nanog@nanog.org" Subject: Re: [outages] Major Level3 (CenturyLink) Issues

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Baldur Norddahl
I believe someone on this list reported that updates were also broken. They could not add prepending nor modify communities. Anyway I am not saying it cannot happen because clearly something did happen. I just don't believe it is a simple case of overload. There has to be more to it. ons. 2. sep.

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Randy Bush
> we don't form disaster response plans by saying "well, we could think > about what *could* happen for days, but we'll just wait for something > to occur". from an old talk of mine, if it was part of the “plan” it’s an “event,” if it is not then it’s a “disaster.”

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Aaron C. de Bruyn via NANOG
Sure. But being good engineers, we love to exercise our brains by thinking about possibilities and probabilities. For example, we don't form disaster response plans by saying "well, we could think about what *could* happen for days, but we'll just wait for something to occur". -A On Wed, Sep 2,

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Randy Bush
creative engineers can conjecturbate for days on how some turtle in the pond might write code what did not withdraw for a month, or other delightful reasons CL might have had this really really bad behavior. the point is that the actual symptoms and cause really really should be in the RFO randy

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Dantzig, Brian
Cisco had a bug a few years back that affected metro switches such that they would not withdraw routes upstream. We had an internal outage and one of my carriers kept advertising our prefixes even though we withdrew the routes. We tried downing the neighbor and even shutting down the physical in

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Tom Beecher
Yeah. This actually would be a fascinating study to understand exactly what happened. The volume of BGP messages flying around because of the session churn must have been absolutely massive, especially in a complex internal infrastructure like 3356 has. I would say the scale of such an event has t

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Saku Ytti
On Wed, 2 Sep 2020 at 16:16, Baldur Norddahl wrote: > I am not buying it. No normal implementation of BGP stays online, replying to > heart beat and accepting updates from ebgp peers, yet after 5 hours failed to > process withdrawal from customers. I can imagine writing BGP implementation like

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Baldur Norddahl
I am not buying it. No normal implementation of BGP stays online, replying to heart beat and accepting updates from ebgp peers, yet after 5 hours failed to process withdrawal from customers. ons. 2. sep. 2020 14.11 skrev Saku Ytti : > On Wed, 2 Sep 2020 at 14:40, Mike Hammett wrote: > > > Sure,

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Saku Ytti
On Wed, 2 Sep 2020 at 14:40, Mike Hammett wrote: > Sure, but I don't care how busy your router is, it shouldn't take hours to > withdraw routes. Quite, discussion is less about how we feel about it and more about why it happens and what could be done to it. -- ++ytti

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Mike Hammett
uot;Martijn Schmidt" Cc: "Outages" , "North American Network Operators' Group" Sent: Wednesday, September 2, 2020 2:15:46 AM Subject: Re: [outages] Major Level3 (CenturyLink) Issues On Wed, 2 Sep 2020 at 10:00, Martijn Schmidt via NANOG wrote: > I supp

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Saku Ytti
On Wed, 2 Sep 2020 at 12:50, Vincent Bernat wrote: > It seems BIRD contains an implementation for RFC7313. From the source > code, it delays removal of stale route until EoRR, but it doesn't seem > to delay the work on updating the kernel. Juniper doesn't seem to > implement it. Cisco seems to im

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Vincent Bernat
❦ 2 septembre 2020 10:15 +03, Saku Ytti: > RFC7313 might show us way to reduce amount of useless work. You might > want to add signal that initial convergence is done, you might want to > add signal that no installation or best path algo happens until all > route are loaded, this would massively

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-02 Thread Saku Ytti
On Wed, 2 Sep 2020 at 10:00, Martijn Schmidt via NANOG wrote: > I suppose now would be a good time for everyone to re-open their Centurylink > ticket and ask why the RFO doesn't address the most important defect, e.g. > the inability to withdraw announcements even by shutting down the session?

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-01 Thread Martijn Schmidt via NANOG
NANOG on behalf of Randy Bush Sent: 02 September 2020 08:17 To: Outages ; North American Network Operators' Group Subject: Re: [outages] Major Level3 (CenturyLink) Issues the RFO is making the rounds http://seele.lamehost.it/~m

Re: [outages] Major Level3 (CenturyLink) Issues

2020-09-01 Thread Randy Bush
the RFO is making the rounds http://seele.lamehost.it/~marco/blind/Network_Event_Formal_RFO_Multiple_Markets_19543671_19544042_30_August.pdf it kinda explains the flowspec issue but completely ignores the stuck routes, which imiho was the more damaging problem. randy