On Mon, 27 Jan 2020 at 00:18, Robert Raszuk <rob...@raszuk.net> wrote:
> The other one is actually of keeping your network running. Imagine router > maintaining entire control plane perfectly fine, imagine BFD working fine to > the box from peers but dropping between line cards via fabric from 20% to 80% > traffic. Unfortunately this is not a theory but real world :( > > Without proper automation in place going way above basic IGP, BGP, LDP, BFD > etc ... you need a bit of clever automation to detect it and either alarm noc > or if they are really smart take such router out of the SPF network wide. If > not you sit and wait till pissed customers call - which is already a failure. Automation and monitoring to me are a very different subjects. Everyone has war stories of those long tail problems when something utterly weird is happening in the network and how problematic it was to find. But this particular example is fairly easy, either you are polling drop counter which shows the drops or your packets in - packets out+drop delta is off. But there will always be massive amount of long tail risks which your nms won't know about, things break in a very creative and complex ways. And you can monitor these very carefully, you can screenscrape all NPU counters and your network is behaving _right now_ suboptimally, you see NPU exceptions/trapstats increasing which should not and you can spend months figuring out 1 issue out of hundred you have, all of which are real issues, but which might affect one packet in a billion. Is it worth knowing these? We are screenscraping and graphing all NPU counters, as these typically are not available in GUI in case of JunOS they are not even modelled because they are PFE counters. We rarely proactively tend to them, because fixing them causes more outages than letting them be. But often when strange issues do happen at scale which customers care about, these counters reduce MTTR. So if you think you don't have active issues, you're not monitoring well enough. When you do monitor well enough you have to decide which issues to fix and which to let be. -- ++ytti _______________________________________________ juniper-nsp mailing list juniper-nsp@puck.nether.net https://puck.nether.net/mailman/listinfo/juniper-nsp