Re: [j-nsp] Segment Routing Real World Deployment (was: VPC mc-lag)

Alexandre Guimaraes Sun, 08 Jul 2018 15:17:31 -0700

Adam,

Important observation, I prefer keep my pw working even a lot of segments of 
the network are affected by fiber cut and so on...


When I migrate my BGP VPLS services to l2circuits, my problems today is almost 
Zero.

No matter what happens, business order for everyone is to keep everything 
running 24/7/365 with zero downtime no matter what.... planned maintenance 
doesn’t count, since is planned.

VPLS services, as I said before, cause two outages in one year due l2 loop 
caused by operation team, after hours with no progress to find the loop origin, 
I was called (escalated) to solve the problem.

That’s is I want to mean with my experience, uptime, availability, quality of 
services and so on....

I was a Cisco CCxx for many years with blind eyes in one vendor only.... even 
this vendor cause downtime “with brand”! My Cisco env goes down!!? Oh yes, it’s 
a Cisco!!! I am ok with that? Not anymore! I want peace, happy customers, sell 
more.

With the time that I have today, I can study new tech, make some lab tests, 
asking for this or for that with different vendors.

Today, I can sleep well without that fear if, someone will loop something, if 
some equipment will crash due cpu/memory problems.

And yes, I am a Network Warrior! But now.... a warrior tech. Like Call Of Duty 
Infinity Warfare! 

:)

att
Alexandre

Em 8 de jul de 2018, à(s) 17:58, "adamv0...@netconsultings.com" 
<adamv0...@netconsultings.com> escreveu:

>> From: James Bensley [mailto:jwbens...@gmail.com]
>> Sent: Friday, July 06, 2018 2:04 PM
>> 
>> 
>> 
>> On 5 July 2018 09:56:40 BST, adamv0...@netconsultings.com wrote:
>>>> Of James Bensley
>>>> Sent: Thursday, July 05, 2018 9:15 AM
>>>> 
>>>> - 100% rFLA coverage: TI-LA covers the "black spots" we currently
>>> have.
>>>> 
>>> Yeah that's an interesting use case you mentioned, that I haven't
>>> considered, that is no TE need but FRR need.
>>> But I guess if it was business critical to get those blind spots
>>> FRR-protected then you would have done something about it already
>>> right?
>> 
>> Hi Adam,
>> 
>> Yeah correct, no mission critical services are effected by this for us, so 
>> the
>> business obviously hasn't allocated resource to do anything about it. If it 
>> was
>> a major issue, it should be as simple as adding an extra back haul link to a
>> node or shifting existing ones around (to reshape the P space and Q space to
>> "please" the FRR algorithm).
>> 
>>> So I guess it's more like it would be nice to have,  now is it enough
>>> to expose the business to additional risk?
>>> Like for instance yes you'd test the feature to death to make sure it
>>> works under any circumstances (it's the very heart of the network after
>>> all if that breaks everything breaks), but the problem I see is then
>>> going to a next release couple of years later -since SR is a new thing
>>> it would have a ton of new stuff added to it by then resulting in
>>> higher potential for regression bugs with comparison to LDP or RSVP
>>> which have been around since
>>> ever and every new release to these two is basically just bug fixes.
>> 
>> Good point, I think its worth breaking that down into two separate
>> points/concerns:
>> 
>> Initial deployment bugs:
>> We've done stuff like pay for a CPoC with Cisco, then deployed, then had it
>> all blow up, then paod Cisco AS to asses the situation only to be told it's 
>> not a
>> good design :D So we just assume a default/safe view now that no amount
>> of testing will protect us. We ensure we have backout plans if something
>> immediately blows up, and heightened reporting for issues that take 72
>> hours to show up, and change freezes to cover issues that take a week to
>> show up etc. etc. So I think as far as an initial SR deployment goes, all we 
>> can
>> do is our best with regards to being cautious, just as we would with any
>> major core changes. So I don't see the initial deployment as any more risky
>> than other core projects we've undertaken like changing vendors, entire
>> chassis replacements, code upgrades between major versions etc.
>> 
>> Regression bugs:
>> My opinion is that in the case of something like SR which is being deployed
>> based on early drafts, regression bugs is potentially a bigger issue than an
>> initial deployment. I hadn't considered this. Again though I think its
>> something we can reasonably prepare for. Depending on the potential
>> impact to the business you could go as far as standing up a new chassis next
>> to an existing one, but on the newer code version, run them in parallel,
>> migrating services over slowly, keep the old one up for a while before you
>> take it down. You could just do something as simple and physically replace
>> the routing engine, keep the old one on site for a bit so you can quickly 
>> swap
>> back. Or just drain the links in the IGP, downgraded the code, and then un-
>> drain the links, if you've got some single homed services on there. If you
>> have OOB access and plan all the rollback config in advance, we can
>> operationally support the risks, no differently to any other major core
>> change.
>> 
>> Probably the hardest part is assessing what the risk actually is? How to know
>> what level of additional support, monitoring, people, you will need. If you
>> under resource a rollback of a major failure, and fuck the rollback too, you
>> might need some new pants :)
>> 
> Well yes I suppose one could actually look at it as on any other major 
> project like upgrade to a new SW release, or migration from LDP to RSVP-TE or 
> adding a second plane -or all 3 together. 
> And apart from the tedious and rigorous testing (god there's got to be a 
> better way of doing SW validation testing) you made me think about scoping 
> the fallback and contingency options in case things down work out.
> These huge projects are always carried out in number of stages each broken 
> down to several individual steps all this is to ease out the deployment but 
> also to scope the fallout in case things go south.  
> Like in migrations from LDP to RSVP you go intra-pop first then inter-pop 
> between a pair of POPs and so on using small incremental steps and all this 
> time the fallback option is the good old LDP maybe even well after the 
> project is done until the operational confidence is high enough or till the 
> next code upgrade. And I think a similar approach can be used to de-risk an 
> SR rollout. 
> 
> 
> adam   
> 
> netconsultings.com
> ::carrier-class solutions for the telecommunications industry::
> 
> 
> _______________________________________________
> juniper-nsp mailing list juniper-nsp@puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp

Re: [j-nsp] Segment Routing Real World Deployment (was: VPC mc-lag)

Reply via email to