And just to add just a little bit of fuel to this fire let me share that
the base principle of BGP spec mandating to withdraw the routes when the
session goes down could be in the glory of IETF soon a history :(
It started with the proposal to make BGP state "persistent":
https://tools.ietf.org/ht
On 2/Sep/20 15:12, Baldur Norddahl wrote:
> I am not buying it. No normal implementation of BGP stays online,
> replying to heart beat and accepting updates from ebgp peers, yet
> after 5 hours failed to process withdrawal from customers.
A BGP RFC spec. is not the same thing as a vendor trans
On 2/Sep/20 13:36, Mike Hammett wrote:
> Sure, but I don't care how busy your router is, it shouldn't take
> hours to withdraw routes.
If only routers had feelings...
Mark.
On Wed, 2 Sep 2020, Warren Kumari wrote:
The root issue here is that the *publicc* RFO is incomplete / unclear.
Something something flowspec something, blocked flowspec, no more
something does indeed explain that something bad happened, but not
what caused the lack of withdraws / cascading churn
On Wed, Sep 2, 2020 at 3:04 PM Vincent Bernat wrote:
>
> ❦ 2 septembre 2020 16:35 +03, Saku Ytti:
>
> >> I am not buying it. No normal implementation of BGP stays online,
> >> replying to heart beat and accepting updates from ebgp peers, yet
> >> after 5 hours failed to process withdrawal from c
❦ 2 septembre 2020 16:35 +03, Saku Ytti:
>> I am not buying it. No normal implementation of BGP stays online,
>> replying to heart beat and accepting updates from ebgp peers, yet
>> after 5 hours failed to process withdrawal from customers.
>
> I can imagine writing BGP implementation like this
Detailed explanation can be found below.
https://blog.thousandeyes.com/centurylink-level-3-outage-analysis/
From: NANOG on behalf of
Baldur Norddahl
Date: Wednesday, September 2, 2020 at 12:09 PM
To: "nanog@nanog.org"
Subject: Re: [outages] Major Level3 (CenturyLink) Issues
I believe someone on this list reported that updates were also broken. They
could not add prepending nor modify communities.
Anyway I am not saying it cannot happen because clearly something did
happen. I just don't believe it is a simple case of overload. There has to
be more to it.
ons. 2. sep.
> we don't form disaster response plans by saying "well, we could think
> about what *could* happen for days, but we'll just wait for something
> to occur".
from an old talk of mine, if it was part of the “plan” it’s an “event,”
if it is not then it’s a “disaster.”
Sure. But being good engineers, we love to exercise our brains by thinking
about possibilities and probabilities.
For example, we don't form disaster response plans by saying "well, we
could think about what *could* happen for days, but we'll just wait for
something to occur".
-A
On Wed, Sep 2,
creative engineers can conjecturbate for days on how some turtle in the
pond might write code what did not withdraw for a month, or other
delightful reasons CL might have had this really really bad behavior.
the point is that the actual symptoms and cause really really should be
in the RFO
randy
Cisco had a bug a few years back that affected metro switches such that they
would not withdraw routes upstream. We had an internal outage and one of my
carriers kept advertising our prefixes even though we withdrew the routes. We
tried downing the neighbor and even shutting down the physical in
Yeah. This actually would be a fascinating study to understand exactly what
happened. The volume of BGP messages flying around because of the session
churn must have been absolutely massive, especially in a complex internal
infrastructure like 3356 has.
I would say the scale of such an event has t
On Wed, 2 Sep 2020 at 16:16, Baldur Norddahl wrote:
> I am not buying it. No normal implementation of BGP stays online, replying to
> heart beat and accepting updates from ebgp peers, yet after 5 hours failed to
> process withdrawal from customers.
I can imagine writing BGP implementation like
I am not buying it. No normal implementation of BGP stays online, replying
to heart beat and accepting updates from ebgp peers, yet after 5 hours
failed to process withdrawal from customers.
ons. 2. sep. 2020 14.11 skrev Saku Ytti :
> On Wed, 2 Sep 2020 at 14:40, Mike Hammett wrote:
>
> > Sure,
On Wed, 2 Sep 2020 at 14:40, Mike Hammett wrote:
> Sure, but I don't care how busy your router is, it shouldn't take hours to
> withdraw routes.
Quite, discussion is less about how we feel about it and more about
why it happens and what could be done to it.
--
++ytti
uot;Martijn Schmidt"
Cc: "Outages" , "North American Network Operators' Group"
Sent: Wednesday, September 2, 2020 2:15:46 AM
Subject: Re: [outages] Major Level3 (CenturyLink) Issues
On Wed, 2 Sep 2020 at 10:00, Martijn Schmidt via NANOG wrote:
> I supp
On Wed, 2 Sep 2020 at 12:50, Vincent Bernat wrote:
> It seems BIRD contains an implementation for RFC7313. From the source
> code, it delays removal of stale route until EoRR, but it doesn't seem
> to delay the work on updating the kernel. Juniper doesn't seem to
> implement it. Cisco seems to im
❦ 2 septembre 2020 10:15 +03, Saku Ytti:
> RFC7313 might show us way to reduce amount of useless work. You might
> want to add signal that initial convergence is done, you might want to
> add signal that no installation or best path algo happens until all
> route are loaded, this would massively
On Wed, 2 Sep 2020 at 10:00, Martijn Schmidt via NANOG wrote:
> I suppose now would be a good time for everyone to re-open their Centurylink
> ticket and ask why the RFO doesn't address the most important defect, e.g.
> the inability to withdraw announcements even by shutting down the session?
NANOG on behalf of Randy
Bush
Sent: 02 September 2020 08:17
To: Outages ; North American Network Operators' Group
Subject: Re: [outages] Major Level3 (CenturyLink) Issues
the RFO is making the rounds
http://seele.lamehost.it/~m
the RFO is making the rounds
http://seele.lamehost.it/~marco/blind/Network_Event_Formal_RFO_Multiple_Markets_19543671_19544042_30_August.pdf
it kinda explains the flowspec issue but completely ignores the stuck
routes, which imiho was the more damaging problem.
randy
22 matches
Mail list logo