Re: [j-nsp] MX480 troubles.

Jared Mauch Wed, 13 Apr 2011 10:57:55 -0700

On Apr 13, 2011, at 1:27 PM, Chris Evans wrote:

> Question to you all...
> 
> It seems like alot of folks run bleeding edge code with some if these major
> bugs popping up.I also get the impression that a lot of shops don't test
> code before they deploy.
> 
> I'm just curious how this works for you. In my company we would get
> seriously reprimanded for deploying software that is not tested and any time
> we have outages we have to go through big hoops to understand why how to fix
> etc.. so we do the best we can to deploy architectures/platforms/code that
> wont have issues.
> 
> I couldn't imagine being bleeding edge in a service provider environment,
> its just a concept I can't fathom being in the environment I'm in..
> 
> Looking for input...


This is something that requires a delicate balance.  While one can spend 
millions of dollars to test every possible thing, it's also not practical to do 
so for each and every release.  The hardware required to replicate these 
environments gets quite expensive, as does the test gear and other things 
necessary to "pull it off".

I've always believed in a "system test", vs "unit test" (UUT/unit under test is 
what you will see the vendors call it).  You need to prove out the entire 
thing, vs the void that they regularly operate in.  There are some bugs you 
just won't see unless you have a link flap in a 4x Bundle and one side is doing 
LACP fast mode, etc.. 

Testing all these cases can be problematic/impossible.  What you want to do is 
get a nice baseline going, that simulates your real network as close as 
possible then you can work with the new code.  Simulate many iterations of what 
happens (eg: if rancid logs in 1x an hour, make it login in a loop).  We found 
some bug that only existed on one device due to rancid logging in each hour 
from a host with a specific latency and IP stack.  It impacted only *one* 
device, but caused a kernel core.

This is not something you can easily simulate in addition to all the other 
features you want to test.  Many things one can do start to create mutually 
exclusive test environments.  Imagine being a tester and you have to test 
"BGP".  What does that mean?  All route reflectors, or a full mesh?  what size? 
 how many clusters?  what about bgp confederations?  2-byte, 4-byte ASNs and 
as-paths.  IPv4 and IPv6 in the same transport for each NLRI, or a IPv4 session 
for those routes, and IPv6 native session for that NLRI?

I'm sure you can start to see how these create a mutually exclusive set of 
choices, just within the scope of BGP, even before you get to the routing 
policy you want to test on top of that.

The best test I've found is dogfooding, put your office lan behind the test 
router, or at least your lab.  This way if it breaks, you will be compelled to 
research it.  Some people will argue against this as their "business critical" 
stuff can't be impacted by the testing, but your customers could have the same 
problem as well.  (Also, try to load up your test router with as many of the 
real world variations you have that can co-exist, be it l2vpn, ipv6, mpls, 
rsvp, pim, etc.. Don't deviate because you think something isn't related, 
unless you're in the "isolation" mode).

I would always load the code on a test device, then on a device that *my* 
connection was on.

I certainly don't want an outage any more than any customer does, so be 
understanding when they do happen.  We can push the vendors for fixes, but 
sometimes only "so-hard" and some cases while we may hit them often, are very 
difficult to reproduce.  The developers have also become insulated from the 
"real world" in many cases, either hard to reach via a TAC, or otherwise.  This 
is to protect them, but also makes it hard as we don't open cases to "cry wolf" 
either.

Either way, testing needs to be a true partnership between you and your vendor. 
 Don't take hardware from them if you are not going to participate.  Don't yell 
and scream when test code is broken, but try to understand and improve the 
process.  Sometimes yelling is necessary, but i've found it's rarely 
productive.  If the vendor refuses to understand the severity of your 
environment, escalate.  The head of JTAC is a good guy, he's trying to do the 
right thing.  The same is true for Cisco TAC, and if you talk to the managers, 
etc.. involved, they have always tried to help.  Explain your constraints and 
why their solutions are unacceptable, yet be reasonable at the same time.

I do wish that places like Cisco and Juniper had a more open beta/feedback 
process, such as other vendors (eg: UBNT).  Registering for their program is 
easy, and they deliver far more "stillborn" code than C/J ever have, yet are 
highly responsive to reports and engage.

Hopefully this makes sense and helps you understand why it's hard on both sides.

- Jared
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp

Re: [j-nsp] MX480 troubles.

Reply via email to