Re: Centurylink having a bad morning?
On 31/Aug/20 16:33, Tomas Lynch wrote: > Maybe we are idealizing these so-called tier-1 carriers and we, > tier-ns, should treat them as what they really are: another AS. Accept > that they are going to fail and do our best to mitigate the impact on > our own networks, i.e. more peering. Bingo! Mark.
Re: Does anyone actually like CenturyLink?
On 30/Aug/20 17:20, Matt Hoppes wrote: > No clue. They’ve been progressively getting worse since 2010. I have no idea > why anyone chooses them and they shouldn’t be considered a Tier1 carrier with > the level of issues they have. For us, the account management took a turn for the worse after CL picked them up. But we only need them during contract renewals. So it's not a drama. We get transit from all the top 7 providers, so an issue with one of them is never a problem. It's moments like this where their offers to put all your eggs in their one basket for all of your interconnect PoP's for a "marvelous" price, should remind you to choose diversity of connectivity, rather. Mark.
Re: Does anyone actually like CenturyLink?
On 30/Aug/20 17:15, vidister via NANOG wrote: > Operating a CDN inside a Tier1 network is just shitty behaviour. What's a Tier 1 network :-)? Mark.
Re: [outages] Major Level3 (CenturyLink) Issues
On Wed, 2 Sep 2020, Warren Kumari wrote: The root issue here is that the *publicc* RFO is incomplete / unclear. Something something flowspec something, blocked flowspec, no more something does indeed explain that something bad happened, but not what caused the lack of withdraws / cascading churn. As with many interesting outages, I suspect that we will never get the full story, and "Something bad happened, we fixed it and now it's all better and will never happen ever again, trust us..." seems to be the new normal for public postmortems... It's possible Level3's people don't fully understand what happened or that the "bad flowspec rule" causing BGP sessions to repeatedly flap network wide triggered software bugs on their routers. You've never seen rpd stuck at 100% CPU for hours or an MX960 advertise history routes to external peers, even after the internal session that had advertised the route to it has been cleared? To quote Zaphod Beeblebrox "Listen, three eyes, don't you try to outweird me. I get stranger things than you free with my breakfast cereal." Kick a BGP implementation hard enough, and weird shit is likely to happen. -- Jon Lewis, MCP :) | I route StackPath, Sr. Neteng | therefore you are _ http://www.lewis.org/~jlewis/pgp for PGP public key_
Re: [outages] Major Level3 (CenturyLink) Issues
On Wed, Sep 2, 2020 at 3:04 PM Vincent Bernat wrote: > > ❦ 2 septembre 2020 16:35 +03, Saku Ytti: > > >> I am not buying it. No normal implementation of BGP stays online, > >> replying to heart beat and accepting updates from ebgp peers, yet > >> after 5 hours failed to process withdrawal from customers. > > > > I can imagine writing BGP implementation like this > > > > a) own queue for keepalives, which i always serve first fully > > b) own queue for update, which i serve second > > c) own queue for withdraw, which i serve last > > Or maybe, graceful restart configured without a timeout on IPv4/IPv6? > The flowspec rule severed the BGP session abruptly, stale routes are > kept due to graceful restart (except flowspec rules), BGP sessions are > reestablished but the flowspec rules is handled before before reaching > EoR and we loop from there. ... or all routes are fed into some magic route optimization box which is designed to keep things more stable and take advantage of cisco's "step-10" to suck more traffic, or The root issue here is that the *publicc* RFO is incomplete / unclear. Something something flowspec something, blocked flowspec, no more something does indeed explain that something bad happened, but not what caused the lack of withdraws / cascading churn. As with many interesting outages, I suspect that we will never get the full story, and "Something bad happened, we fixed it and now it's all better and will never happen ever again, trust us..." seems to be the new normal for public postmortems... W > -- > Make sure your code "does nothing" gracefully. > - The Elements of Programming Style (Kernighan & Plauger) -- I don't think the execution is relevant when it was obviously a bad idea in the first place. This is like putting rabid weasels in your pants, and later expressing regret at having chosen those particular rabid weasels and that pair of pants. ---maf
Re: [outages] Major Level3 (CenturyLink) Issues
❦ 2 septembre 2020 16:35 +03, Saku Ytti: >> I am not buying it. No normal implementation of BGP stays online, >> replying to heart beat and accepting updates from ebgp peers, yet >> after 5 hours failed to process withdrawal from customers. > > I can imagine writing BGP implementation like this > > a) own queue for keepalives, which i always serve first fully > b) own queue for update, which i serve second > c) own queue for withdraw, which i serve last Or maybe, graceful restart configured without a timeout on IPv4/IPv6? The flowspec rule severed the BGP session abruptly, stale routes are kept due to graceful restart (except flowspec rules), BGP sessions are reestablished but the flowspec rules is handled before before reaching EoR and we loop from there. -- Make sure your code "does nothing" gracefully. - The Elements of Programming Style (Kernighan & Plauger)
Re: Centurylink having a bad morning?
https://www.youtube.com/watch?v=vQ5MA685ApE On Wed 02 Sep 2020 20:40:35 GMT, Baldur Norddahl wrote: > That is what the 5G router is for... > > ons. 2. sep. 2020 19.47 skrev Michael Hallgren : > > > While conserving connectivity? 😂 > > > > > > -- > > *De :* Shawn L via NANOG > > *Envoyé :* mercredi 2 septembre 2020 13:15 > > *À :* nanog > > *Objet :* Re: Centurylink having a bad morning? > > > > We once moved a 3u server 30 miles between data centers this way. Plug > > redundant psu into a ups and 2 people carried it out and put them in a > > vehicle. > > > > > > Sent from my iPhone > > > > > On Sep 1, 2020, at 11:58 PM, Christopher Morrow > > wrote: > > > > > > On Tue, Sep 1, 2020 at 11:53 PM Alain Hebert > > wrote: > > >> > > >>As a coincidence... I was *thinking* of moving a 90TB SAN (with > > mechanical's) to another rack that way... skateboard, long fibers and long > > power cords =D > > >> > > > > > > well, what you REALLY need is one of these: > > > https://www.cru-inc.com/products/wiebetech/hotplug_field_kit_product/ > > > > > > and 2-3 UPS... swap to the UPS, then just roll the stack over, plug to > > > utility and done. (minus network transfer) > >
Re: AANP Akamai
netsupp...@akamai.com -- TTFN, patrick > On Sep 2, 2020, at 2:40 PM, ahmed.dala...@hrins.net wrote: > > Hello NANOG, > > Could somebody from Akamai AANP’s network team contact me off-list? I’ve > tried the peering and NOC and got no replies in months. > > Thanks > Ahmed
AANP Akamai
Hello NANOG, Could somebody from Akamai AANP’s network team contact me off-list? I’ve tried the peering and NOC and got no replies in months. Thanks Ahmed
Re: Centurylink having a bad morning?
That is what the 5G router is for... ons. 2. sep. 2020 19.47 skrev Michael Hallgren : > While conserving connectivity? 😂 > > > -- > *De :* Shawn L via NANOG > *Envoyé :* mercredi 2 septembre 2020 13:15 > *À :* nanog > *Objet :* Re: Centurylink having a bad morning? > > We once moved a 3u server 30 miles between data centers this way. Plug > redundant psu into a ups and 2 people carried it out and put them in a > vehicle. > > > Sent from my iPhone > > > On Sep 1, 2020, at 11:58 PM, Christopher Morrow > wrote: > > > > On Tue, Sep 1, 2020 at 11:53 PM Alain Hebert > wrote: > >> > >>As a coincidence... I was *thinking* of moving a 90TB SAN (with > mechanical's) to another rack that way... skateboard, long fibers and long > power cords =D > >> > > > > well, what you REALLY need is one of these: > > https://www.cru-inc.com/products/wiebetech/hotplug_field_kit_product/ > > > > and 2-3 UPS... swap to the UPS, then just roll the stack over, plug to > > utility and done. (minus network transfer) >
Re: Centurylink having a bad morning?
While conserving connectivity? 😂 De : Shawn L via NANOG Envoyé : mercredi 2 septembre 2020 13:15 À : nanog Objet : Re: Centurylink having a bad morning? We once moved a 3u server 30 miles between data centers this way. Plug redundant psu into a ups and 2 people carried it out and put them in a vehicle. Sent from my iPhone > On Sep 1, 2020, at 11:58 PM, Christopher Morrow > wrote: > > On Tue, Sep 1, 2020 at 11:53 PM Alain Hebert wrote: >> >> As a coincidence... I was *thinking* of moving a 90TB SAN (with >>mechanical's) to another rack that way... skateboard, long fibers and long >>power cords =D >> > > well, what you REALLY need is one of these: > https://www.cru-inc.com/products/wiebetech/hotplug_field_kit_product/ > > and 2-3 UPS... swap to the UPS, then just roll the stack over, plug to > utility and done. (minus network transfer)
Re: [outages] Major Level3 (CenturyLink) Issues
Detailed explanation can be found below. https://blog.thousandeyes.com/centurylink-level-3-outage-analysis/ From: NANOG on behalf of Baldur Norddahl Date: Wednesday, September 2, 2020 at 12:09 PM To: "nanog@nanog.org" Subject: Re: [outages] Major Level3 (CenturyLink) Issues *External Email: Use Caution* I believe someone on this list reported that updates were also broken. They could not add prepending nor modify communities. Anyway I am not saying it cannot happen because clearly something did happen. I just don't believe it is a simple case of overload. There has to be more to it. ons. 2. sep. 2020 15.36 skrev Saku Ytti mailto:s...@ytti.fi>>: On Wed, 2 Sep 2020 at 16:16, Baldur Norddahl mailto:baldur.nordd...@gmail.com>> wrote: > I am not buying it. No normal implementation of BGP stays online, replying to > heart beat and accepting updates from ebgp peers, yet after 5 hours failed to > process withdrawal from customers. I can imagine writing BGP implementation like this a) own queue for keepalives, which i always serve first fully b) own queue for update, which i serve second c) own queue for withdraw, which i serve last Why I might think this makes sense, is perhaps I just received from RR2 prefix I'm pulling from RR1, if I don't handle all my updates first, I'm causing outage that should not happen, because I already actually received the update telling I don't need to withdraw it. Is this the right way to do it? Maybe not, but it's easy to imagine why it might seem like a good idea. How well BGP works in common cases and how it works in pathologically scaled and busy cases are very different cases. I know that even in stable states commonly run vendors on commonly run hardware can take +2h to finish converging iBGP on initial turn-up. -- ++ytti
Re: [outages] Major Level3 (CenturyLink) Issues
I believe someone on this list reported that updates were also broken. They could not add prepending nor modify communities. Anyway I am not saying it cannot happen because clearly something did happen. I just don't believe it is a simple case of overload. There has to be more to it. ons. 2. sep. 2020 15.36 skrev Saku Ytti : > On Wed, 2 Sep 2020 at 16:16, Baldur Norddahl > wrote: > > > I am not buying it. No normal implementation of BGP stays online, > replying to heart beat and accepting updates from ebgp peers, yet after 5 > hours failed to process withdrawal from customers. > > I can imagine writing BGP implementation like this > > a) own queue for keepalives, which i always serve first fully > b) own queue for update, which i serve second > c) own queue for withdraw, which i serve last > > Why I might think this makes sense, is perhaps I just received from > RR2 prefix I'm pulling from RR1, if I don't handle all my updates > first, I'm causing outage that should not happen, because I already > actually received the update telling I don't need to withdraw it. > > Is this the right way to do it? Maybe not, but it's easy to imagine > why it might seem like a good idea. > > How well BGP works in common cases and how it works in pathologically > scaled and busy cases are very different cases. > > I know that even in stable states commonly run vendors on commonly run > hardware can take +2h to finish converging iBGP on initial turn-up. > > -- > ++ytti >
Re: [outages] Major Level3 (CenturyLink) Issues
> we don't form disaster response plans by saying "well, we could think > about what *could* happen for days, but we'll just wait for something > to occur". from an old talk of mine, if it was part of the “plan” it’s an “event,” if it is not then it’s a “disaster.”
Re: Centurylink having a bad morning?
On Wed, Sep 2, 2020 at 12:00 AM Christopher Morrow wrote: > > On Tue, Sep 1, 2020 at 11:53 PM Alain Hebert wrote: > > > > As a coincidence... I was *thinking* of moving a 90TB SAN (with > > mechanical's) to another rack that way... skateboard, long fibers and long > > power cords =D > > > > well, what you REALLY need is one of these: > https://www.cru-inc.com/products/wiebetech/hotplug_field_kit_product/ > Yeah, no... actually, hell no! That setup scares me, and I'm surprised that it can be sold at all, even with many warning labels and disclaimers... After the first time I saw it (I suspect also due to Chris!) I tried doing something similar -- I cut the ends off a power cord, attached alligator clips and moved a lamp from one outlet to another -- this was all on the same circuit (no UPS, no difference in potential, etc) and so it doesn't need anything to switch between supplies. I checked with a multimeter before making the connections (to triple check) that I had live and neutral correct, had an in-circuit GFCI, and was wearing rubber gloves. It *worked*, but having a plug with live, exposed pins is not something I want to repeat On a related note - my wife once spent much time trying to explain to one of her clients why they cannot just plug the input of their power strip into the output of the same powerstrip, and get free 'lectricity... "But power comes out ot the socket!!!" , "Well, yes, but it has to get into the powerstrip" , "Yah! That's why I plug the plug into it..."... I think that eventually she just demonstrated (again!) that it doesn't work, and then muttered something about "Magic"... W > and 2-3 UPS... swap to the UPS, then just roll the stack over, plug to > utility and done. (minus network transfer) -- I don't think the execution is relevant when it was obviously a bad idea in the first place. This is like putting rabid weasels in your pants, and later expressing regret at having chosen those particular rabid weasels and that pair of pants. ---maf
Re: [outages] Major Level3 (CenturyLink) Issues
Sure. But being good engineers, we love to exercise our brains by thinking about possibilities and probabilities. For example, we don't form disaster response plans by saying "well, we could think about what *could* happen for days, but we'll just wait for something to occur". -A On Wed, Sep 2, 2020 at 8:51 AM Randy Bush wrote: > creative engineers can conjecturbate for days on how some turtle in the > pond might write code what did not withdraw for a month, or other > delightful reasons CL might have had this really really bad behavior. > > the point is that the actual symptoms and cause really really should be > in the RFO > > randy >
Re: [outages] Major Level3 (CenturyLink) Issues
creative engineers can conjecturbate for days on how some turtle in the pond might write code what did not withdraw for a month, or other delightful reasons CL might have had this really really bad behavior. the point is that the actual symptoms and cause really really should be in the RFO randy
Re: [outages] Major Level3 (CenturyLink) Issues
Cisco had a bug a few years back that affected metro switches such that they would not withdraw routes upstream. We had an internal outage and one of my carriers kept advertising our prefixes even though we withdrew the routes. We tried downing the neighbor and even shutting down the physical interface to no avail. The carrier kept blackholing us until they shut down on their metro switch.
Re: [outages] Major Level3 (CenturyLink) Issues
Yeah. This actually would be a fascinating study to understand exactly what happened. The volume of BGP messages flying around because of the session churn must have been absolutely massive, especially in a complex internal infrastructure like 3356 has. I would say the scale of such an event has to be many orders of magnitude beyond what anyone ever designed for, so it doesn't shock me at all that unexpected behavior occurred. But that's why we're engineers ; we want to understand such things. On Wed, Sep 2, 2020 at 9:37 AM Saku Ytti wrote: > On Wed, 2 Sep 2020 at 16:16, Baldur Norddahl > wrote: > > > I am not buying it. No normal implementation of BGP stays online, > replying to heart beat and accepting updates from ebgp peers, yet after 5 > hours failed to process withdrawal from customers. > > I can imagine writing BGP implementation like this > > a) own queue for keepalives, which i always serve first fully > b) own queue for update, which i serve second > c) own queue for withdraw, which i serve last > > Why I might think this makes sense, is perhaps I just received from > RR2 prefix I'm pulling from RR1, if I don't handle all my updates > first, I'm causing outage that should not happen, because I already > actually received the update telling I don't need to withdraw it. > > Is this the right way to do it? Maybe not, but it's easy to imagine > why it might seem like a good idea. > > How well BGP works in common cases and how it works in pathologically > scaled and busy cases are very different cases. > > I know that even in stable states commonly run vendors on commonly run > hardware can take +2h to finish converging iBGP on initial turn-up. > > -- > ++ytti >
Re: Centurylink having a bad morning?
If the client pays me a shit ton of money to make sure the server won't turn off, and they pay for the hardware to make it happen. I;d think about it. It's a like a colo move on hardmode. Its extremely stupid, and I would advise not doing it. Hell even when I migrated e911 server, we had a 20 minutes outage to move the physical server. If that server can't be shut off, something was built wrong. On Wed, Sep 2, 2020 at 9:33 AM Bryan Holloway wrote: > > On 9/2/20 1:49 PM, Nick Hilliard wrote: > > Shawn L via NANOG wrote on 02/09/2020 12:15: > >> We once moved a 3u server 30 miles between data centers this way. > >> Plug redundant psu into a ups and 2 people carried it out and put > >> them in a vehicle. > > > > hopefully none of these server moves that people have been talking about > > involved spinning disks. If they did, kit damage is one of the likely > > outcomes - you seriously do not want to bump active spindles: > > > > www.google.com/search?q=disk+platter+damage&tbm=isch > > > > SSDs are a different story. In that case it's just a bit odd as to why > > you wouldn't want to power down a system to physically move it - in the > > sense that if your service delivery model can't withstand periodic > > maintenance and loss of availability of individual components, > > rethinking the model might be productive. > > > > Nick > > > > If it's your server, moving beyond (very) local facilities, and time is > not of the essence, then sure: power down. > > If you're law-enforcement mid-raid, or trying to preserve your Frogger > high-score, well, ... > -- Sincerely, Jason W Kuehl Cell 920-419-8983 jason.w.ku...@gmail.com
Re: [outages] Major Level3 (CenturyLink) Issues
On Wed, 2 Sep 2020 at 16:16, Baldur Norddahl wrote: > I am not buying it. No normal implementation of BGP stays online, replying to > heart beat and accepting updates from ebgp peers, yet after 5 hours failed to > process withdrawal from customers. I can imagine writing BGP implementation like this a) own queue for keepalives, which i always serve first fully b) own queue for update, which i serve second c) own queue for withdraw, which i serve last Why I might think this makes sense, is perhaps I just received from RR2 prefix I'm pulling from RR1, if I don't handle all my updates first, I'm causing outage that should not happen, because I already actually received the update telling I don't need to withdraw it. Is this the right way to do it? Maybe not, but it's easy to imagine why it might seem like a good idea. How well BGP works in common cases and how it works in pathologically scaled and busy cases are very different cases. I know that even in stable states commonly run vendors on commonly run hardware can take +2h to finish converging iBGP on initial turn-up. -- ++ytti
Re: Centurylink having a bad morning?
On 9/2/20 1:49 PM, Nick Hilliard wrote: Shawn L via NANOG wrote on 02/09/2020 12:15: We once moved a 3u server 30 miles between data centers this way. Plug redundant psu into a ups and 2 people carried it out and put them in a vehicle. hopefully none of these server moves that people have been talking about involved spinning disks. If they did, kit damage is one of the likely outcomes - you seriously do not want to bump active spindles: www.google.com/search?q=disk+platter+damage&tbm=isch SSDs are a different story. In that case it's just a bit odd as to why you wouldn't want to power down a system to physically move it - in the sense that if your service delivery model can't withstand periodic maintenance and loss of availability of individual components, rethinking the model might be productive. Nick If it's your server, moving beyond (very) local facilities, and time is not of the essence, then sure: power down. If you're law-enforcement mid-raid, or trying to preserve your Frogger high-score, well, ...
Re: [outages] Major Level3 (CenturyLink) Issues
I am not buying it. No normal implementation of BGP stays online, replying to heart beat and accepting updates from ebgp peers, yet after 5 hours failed to process withdrawal from customers. ons. 2. sep. 2020 14.11 skrev Saku Ytti : > On Wed, 2 Sep 2020 at 14:40, Mike Hammett wrote: > > > Sure, but I don't care how busy your router is, it shouldn't take hours > to withdraw routes. > > Quite, discussion is less about how we feel about it and more about > why it happens and what could be done to it. > > -- > ++ytti >
Re: [outages] Major Level3 (CenturyLink) Issues
On Wed, 2 Sep 2020 at 14:40, Mike Hammett wrote: > Sure, but I don't care how busy your router is, it shouldn't take hours to > withdraw routes. Quite, discussion is less about how we feel about it and more about why it happens and what could be done to it. -- ++ytti
Re: Centurylink having a bad morning?
Shawn L via NANOG wrote on 02/09/2020 12:15: We once moved a 3u server 30 miles between data centers this way. Plug redundant psu into a ups and 2 people carried it out and put them in a vehicle. hopefully none of these server moves that people have been talking about involved spinning disks. If they did, kit damage is one of the likely outcomes - you seriously do not want to bump active spindles: www.google.com/search?q=disk+platter+damage&tbm=isch SSDs are a different story. In that case it's just a bit odd as to why you wouldn't want to power down a system to physically move it - in the sense that if your service delivery model can't withstand periodic maintenance and loss of availability of individual components, rethinking the model might be productive. Nick
Re: [outages] Major Level3 (CenturyLink) Issues
Sure, but I don't care how busy your router is, it shouldn't take hours to withdraw routes. - Mike Hammett Intelligent Computing Solutions http://www.ics-il.com Midwest-IX http://www.midwest-ix.com - Original Message - From: "Saku Ytti" To: "Martijn Schmidt" Cc: "Outages" , "North American Network Operators' Group" Sent: Wednesday, September 2, 2020 2:15:46 AM Subject: Re: [outages] Major Level3 (CenturyLink) Issues On Wed, 2 Sep 2020 at 10:00, Martijn Schmidt via NANOG wrote: > I suppose now would be a good time for everyone to re-open their Centurylink > ticket and ask why the RFO doesn't address the most important defect, e.g. > the inability to withdraw announcements even by shutting down the session? The more work the BGP process has the longer it takes to complete that work. You could try in your RFP/RFQ if some provider will commit on specific convergence time, which would improve your position contractually and might make you eligible for some compensations or termination of contract, but realistically every operator can run into a situation where you will see what most would agree pathologically long convergence times. The more BGP sessions, more RIB entries the higher the probability that these issues manifest. Perhaps protocol level work can be justified as well. BGP doesn't have concept of initial convergence, if you have lot of peers, your initial convergence contains massive amount of useless work, because you keep changing best route, while you keep receiving new best routes, the higher the scale the more useless work you do and the longer stability you require to eventually ~converge. Practical devices operators run may require hours during _normal operation_ to do initial converge. RFC7313 might show us way to reduce amount of useless work. You might want to add signal that initial convergence is done, you might want to add signal that no installation or best path algo happens until all route are loaded, this would massively improve scaled convergence as you wouldn't do that throwaway work, which ultimately inflates your work queue and pushes your useful work far to the future. The main thing as a customer I would ask, how can we fix it faster than 5h in future. Did we lose access to control-plane? Could we reasonably avoid losing it? -- ++ytti
Re: Centurylink having a bad morning?
We once moved a 3u server 30 miles between data centers this way. Plug redundant psu into a ups and 2 people carried it out and put them in a vehicle. Sent from my iPhone > On Sep 1, 2020, at 11:58 PM, Christopher Morrow > wrote: > > On Tue, Sep 1, 2020 at 11:53 PM Alain Hebert wrote: >> >>As a coincidence... I was *thinking* of moving a 90TB SAN (with >> mechanical's) to another rack that way... skateboard, long fibers and long >> power cords =D >> > > well, what you REALLY need is one of these: > https://www.cru-inc.com/products/wiebetech/hotplug_field_kit_product/ > > and 2-3 UPS... swap to the UPS, then just roll the stack over, plug to > utility and done. (minus network transfer)
Re: [outages] Major Level3 (CenturyLink) Issues
On Wed, 2 Sep 2020 at 12:50, Vincent Bernat wrote: > It seems BIRD contains an implementation for RFC7313. From the source > code, it delays removal of stale route until EoRR, but it doesn't seem > to delay the work on updating the kernel. Juniper doesn't seem to > implement it. Cisco seems to implement it, but only on refresh, not on > the initial connection. Is there some survey around this RFC? Correct it doesn't do anything for initial, but I took it as an example how we might approach the problem of initial convergence cost at scaled environments. -- ++ytti
Re: [outages] Major Level3 (CenturyLink) Issues
❦ 2 septembre 2020 10:15 +03, Saku Ytti: > RFC7313 might show us way to reduce amount of useless work. You might > want to add signal that initial convergence is done, you might want to > add signal that no installation or best path algo happens until all > route are loaded, this would massively improve scaled convergence as > you wouldn't do that throwaway work, which ultimately inflates your > work queue and pushes your useful work far to the future. It seems BIRD contains an implementation for RFC7313. From the source code, it delays removal of stale route until EoRR, but it doesn't seem to delay the work on updating the kernel. Juniper doesn't seem to implement it. Cisco seems to implement it, but only on refresh, not on the initial connection. Is there some survey around this RFC? -- Don't patch bad code - rewrite it. - The Elements of Programming Style (Kernighan & Plauger)
Re: [outages] Major Level3 (CenturyLink) Issues
On Wed, 2 Sep 2020 at 10:00, Martijn Schmidt via NANOG wrote: > I suppose now would be a good time for everyone to re-open their Centurylink > ticket and ask why the RFO doesn't address the most important defect, e.g. > the inability to withdraw announcements even by shutting down the session? The more work the BGP process has the longer it takes to complete that work. You could try in your RFP/RFQ if some provider will commit on specific convergence time, which would improve your position contractually and might make you eligible for some compensations or termination of contract, but realistically every operator can run into a situation where you will see what most would agree pathologically long convergence times. The more BGP sessions, more RIB entries the higher the probability that these issues manifest. Perhaps protocol level work can be justified as well. BGP doesn't have concept of initial convergence, if you have lot of peers, your initial convergence contains massive amount of useless work, because you keep changing best route, while you keep receiving new best routes, the higher the scale the more useless work you do and the longer stability you require to eventually ~converge. Practical devices operators run may require hours during _normal operation_ to do initial converge. RFC7313 might show us way to reduce amount of useless work. You might want to add signal that initial convergence is done, you might want to add signal that no installation or best path algo happens until all route are loaded, this would massively improve scaled convergence as you wouldn't do that throwaway work, which ultimately inflates your work queue and pushes your useful work far to the future. The main thing as a customer I would ask, how can we fix it faster than 5h in future. Did we lose access to control-plane? Could we reasonably avoid losing it? -- ++ytti