Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey
> One method is collecting lookup exceptions. We scrape these: > > npu_triton_trapstats.py:command = "start shell sh command \"for > fpc in $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}'); > do echo FPC$fpc; vty -c 'show cda trapstats' fpc$fpc; done\"" > ptx1k_trapstats.py:command = "start shell sh command \"for fpc in > $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}'); do > echo FPC$fpc; vty -c 'show pechip trapstats' fpc$fpc; done\"" > asr9k_npu_counters.py:command = "show controllers np counters all" > junos_trio_exceptions.py:command = "show pfe statistics exceptions" > > No need for ML or AI, as trivial algorithms like 'what counter is > incrementing which isn't incrementing elsewhere' or 'what counter is > not incrementing is incrementing elsewhere' shows a lot of real > problems, and capturing those exceptions and reviewing confirms them. > > We do not use these to proactively find problems, as it would yield to > poorer overall availability. But we regularly use them to expedite > time to resolution. Thanks for sharing! I guess this process working means the counters are "standard" / close enough across vendors to allow for comparisons? > Very recently we had Tomahawk (EZchip) reset the whole linecard and > looking at counters identifying counter which is incrementing but > likely should not yielded the problem. Customer was sending us IP > packets, where ethernet header and IP header until total length was > missing on the wire, this accidentally fuzzed the NPU ucode > periodically triggering NPU bug, which causes total LC reload when it > happens often enough. > >>> Networks also routinely mangle packets in-memory which are not visible >>> to FCS check. >> >> Added to the list... Thanks! > > The only way I know how to try to find these memory corruptions is to > look at egress PE device backbone facing interface and see if there > are IP checksum errors.
Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey
Hi Jörg, Thanks for sharing your gray failure! With a few years of lifespan, it might well be the oldest gray failure ever monitored continuously :-) I'm pretty sure you guys exhausted all options already but... did you check for micro-bursts that may cause sudden buffer overflow? Or perhaps is your probing traffic already high priority? Best, Laurent > On 8 Jul 2021, at 15:58, Jörg Kost wrote: > > We have a similar gray issue, where switches in a virtual chassis > configuration with layer3-configuration seem to lose transit ICMP messages > like echo or echo-reply randomly. Once we estimated it around 0.00012% ( let > alone variances, or errors in measuring ). > > We noticed this when we replaced Nagios with some more bursting, > trigger-happy monitoring software a few years back. Since then, it's > reporting false positives from time to time, and this can become annoying. > > Besides spending a lot of time debugging this, we never had a breakthrough in > finding the root cause, just looking to replace things in the next year. > > On 8 Jul 2021, at 15:28, Mark Tinka wrote: > >> On 7/8/21 15:22, Vanbever Laurent wrote: >> >>> Did you folks manage to understand what was causing the gray issue in the >>> first place? >> >> Nope, still chasing it. We suspect a FIB issue on a transit device, but >> currently building a test to confirm. >> >> Mark.
Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey
> On 8 Jul 2021, at 14:59, Mark Tinka wrote: > > On 7/8/21 14:29, Saku Ytti wrote: > >> Network experiences gray failures all the time, and I almost never >> care, unless a customer does. If there is a network which does not >> experience these, then it's likely due to lack of visibility rather >> than issues not existing. >> >> Fixing these can take months of working with vendors and attempts to >> remedy will usually cause planned or unplanned outages. So it rarely >> makes sense to try to fix as they usually impact a trivial amount of >> traffic. >> >> Networks also routinely mangle packets in-memory which are not visible >> to FCS check. > > I was going to say the exact same thing. > > +1. > > It's all par for the course, which is why we get up everyday :-). :-) > I'm currently dealing with an issue that will forward a customer's traffic > to/from one /24, but not the rest of their IPv4 space, including the larger > allocation from which the /24 is born. It was a gray issue while the customer > partially activated, and then caused us to care when they tried to fully > swing over. Did you folks manage to understand what was causing the gray issue in the first place? > We've had an issue that has lasted over a year but only manifested recently, > where someone wrote a static route pointing to an indirect next-hop, > mistakenly. The router ended up resolving it and forwarding traffic, but in > the process, was spiking CPU in a manner that was not immediately evident > from the NMS. Fixing the next-hop resolved the issue, as would improving > service provisioning and troubleshooting manuals :-). Interesting. I can see how hard this one is to debug as even a relatively small of traffic pointing at the static route would be enough to make the CPU spikes. > Like Saku says, there's always something, and attention to it will be granted > depending on how much visible pain it causes. Yep. Makes absolute sense. Best, Laurent
Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey
> On 8 Jul 2021, at 14:29, Saku Ytti wrote: > > On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent wrote: > >> Detecting whole-link and node failures is relatively easy nowadays (e.g., >> using BFD). But what about detecting gray failures that only affect a >> *subset* of the traffic, e.g. a router randomly dropping 0.1% of the >> packets? Does your network often experience these gray failures? Are they >> problematic? Do you care? And can we (network researchers) do anything about >> it?” > > Network experiences gray failures all the time, and I almost never > care, unless a customer does. If there is a network which does not > experience these, then it's likely due to lack of visibility rather > than issues not existing. > > Fixing these can take months of working with vendors and attempts to > remedy will usually cause planned or unplanned outages. So it rarely > makes sense to try to fix as they usually impact a trivial amount of > traffic. Thanks for chiming in. That's also my feeling: a *lot* of gray failures routinely happen, a small percentage of which end up being really damaging (the ones hitting customer traffic, as you pointed out). For this small percentage though, I can imagine being able to detect / locate them rapidly (i.e. before the customer submit a ticket) would be interesting? Even if fixing the root cause might take up months (since it is up to the vendors), one could still hope to remediate to the situation transiently by rerouting traffic combined with the traditional rebooting of the affected resources? > Networks also routinely mangle packets in-memory which are not visible > to FCS check. Added to the list... Thanks! Best, Laurent
Do you care about "gray" failures? Can we (network academics) help? A 10-min survey
Dear NANOG, Detecting whole-link and node failures is relatively easy nowadays (e.g., using BFD). But what about detecting gray failures that only affect a *subset* of the traffic, e.g. a router randomly dropping 0.1% of the packets? Does your network often experience these gray failures? Are they problematic? Do you care? And can we (network researchers) do anything about it?” Please help us out to find out by answering our short (<10 minutes) anonymous survey. Survey URL: https://forms.gle/v99mBNEPrLjcFCEu8 ## Context: When we think about network failures, we often think about a link or a network device going down. These failures are "obvious" in that *all* the traffic crossing the corresponding resource is dropped. But network failures can also be more subtle and only affect a *subset* of the traffic (e.g. 0.01% of the packets crossing a link/router). These failures are commonly referred to as "gray" failures. Because they don't drop *all* the traffic, gray failures are much harder to detect. Many studies revealed that cloud and datacenter networks routinely suffer from gray failures and, as such, many techniques exist to track them down in these environments (see e.g. this study from Microsoft Azure https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf). What is less known though is how much gray failures affect *other* types of networks such as Internet Service Providers (ISPs), Wide Area Networks (WAN), or Enterprise networks. While the bug reports submitted to popular routing vendors (Cisco, Juniper, etc.) suggest that gray failures are pervasive and hard to catch for all networks, we would love to know more about first-hand experiences. ## About the survey: The questionnaire is intended for network operators. It has a total of 15 questions and should take at most 10 minutes to complete. The survey and the collected data are totally anonymous (so please do not include information that may help to identify you or your organization). All questions are optional, so if you don't like a question or don't know the answer, just skip it. Thank you so much in advance, and we look forward to read your responses! Laurent Vanbever, ETH Zurich PS: Of course, we would be extremely grateful if you could forward this email to any operator you might know who may not read NANOG ( assuming those even exist? :-) )!
Re: Your opinion on network analysis in the presence of uncertain events
Hi Adam/Mel, Thanks for chiming in! My understanding was that the tool will combine historic data with the MTBF datapoints form all components involved in a given link in order to try and estimate a likelihood of a link failure. Yep. This could be one way indeed. This likelihood could also be taking the form of intervals in which you expect the true value to lies (again, based on historical data). This could be done both for link/devices failures but also for external inputs such as BGP announcements (to consider the likelihood that you receive a route for X in, say, NEWY). The tool would then to run the deterministic routing protocols (not accounting for ‘features’ such as prefer-oldest-route for a sec.) on these probabilistic inputs so as to infer the different possible forwarding outcomes and their relative probabilities. For now we had something like this in mind. One can of course make the model more and more complex by e.g. also taking into account data plane status (to model gray failures). Intuitively though, the more complex the model, the more complex the inference process is. Heck I imagine if one would stream a heap load of data at a ML algorithm it might draw some very interesting conclusions indeed -i.e. draw unforeseen patterns across huge datasets while trying to understand the overall system (network) behaviour. Such a tool might teach us something new about our networks. Next level would be recommendations on how to best address some of the potential pitfalls it found. Yes. I believe some variants of this exist already. I’m not sure how much they are used in practice though. AFAICT, false positives/negatives is still a big problem. Non-trivial recommendation system will require a model of the network behavior that can somehow be inverted easily which is probably something academics should spend some time on :-) Maybe in closed systems like IP networks, with use of streaming telemetry from SFPs/NPUs/LC-CPUs/Protocols/etc.., we’ll be able to feed the analytics tool with enough data to allow it to make fairly accurate predictions (i.e. unlike in weather or markets prediction tools where the datasets (or search space -as not all attributes are equally relevant) is virtually endless). I’m with you. I also believe that better (even programmable) telemetry will unlock powerful analysis tools. Best, Laurent PS: Thanks a lot to those who have already answered our survey! For those who haven’t yet: https://goo.gl/forms/HdYNp3DkKkeEcexs2 (it only takes a couple of minutes).
Re: Your opinion on network analysis in the presence of uncertain events
> I took the survey. It’s short and sweet — well done! Thanks a lot, Mel! Highly appreciated! > I do have a question. You ask "Are there any good?” Any good what? I just meant whether existing network analysis tools were any good (or good enough) at reasoning about probabilistic behaviors that people care about (if any). All the best, Laurent
Your opinion on network analysis in the presence of uncertain events
Hi NANOG, Networks evolve in uncertain environments. Links and devices randomly fail; external BGP announcements unpredictably appear/disappear leading to unforeseen traffic shifts; traffic demands vary, etc. Reasoning about network behaviors under such uncertainties is hard and yet essential to ensure Service Level Agreements. We're reaching out to the NANOG community as we (researchers) are trying to better understand the practical requirements behind "probabilistic" network reasoning. Some of our questions include: Are uncertain behaviors problematic? Do you care about such things at all? Are you already using tools to ensure the compliance of your network design under uncertainty? Are there any good? We designed a short anonymous survey to collect operators answers. It is composed of 14 optional questions, most of which (13/14) are closed-ended. It should take less than 10 minutes to complete. We expect the findings to help the research community in designing more powerful network analysis tools. Among others, we intend to present the aggregate results in a scientific article later this year. It would be *terrific* if you could help us out! Survey URL: https://goo.gl/forms/HdYNp3DkKkeEcexs2 Thanks much! Laurent Vanbever, ETH Zürich PS: It goes without saying that we would also be extremely grateful if you could forward this email to any operator you know and who may not read NANOG.