Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Vanbever Laurent
> One method is collecting lookup exceptions. We scrape these:
> 
> npu_triton_trapstats.py:command = "start shell sh command \"for
> fpc in $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}');
> do echo FPC$fpc; vty -c 'show cda trapstats' fpc$fpc; done\""
> ptx1k_trapstats.py:command = "start shell sh command \"for fpc in
> $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}'); do
> echo FPC$fpc; vty -c 'show pechip trapstats' fpc$fpc; done\""
> asr9k_npu_counters.py:command = "show controllers np counters all"
> junos_trio_exceptions.py:command = "show pfe statistics exceptions"
> 
> No need for ML or AI, as trivial algorithms like 'what counter is
> incrementing which isn't incrementing elsewhere' or 'what counter is
> not incrementing is incrementing elsewhere' shows a lot of real
> problems, and capturing those exceptions and reviewing confirms them.
> 
> We do not use these to proactively find problems, as it would yield to
> poorer overall availability. But we regularly use them to expedite
> time to resolution.

Thanks for sharing! I guess this process working means the counters are 
"standard" / close enough across vendors to allow for comparisons?

> Very recently we had Tomahawk (EZchip) reset the whole linecard and
> looking at counters identifying counter which is incrementing but
> likely should not yielded the problem. Customer was sending us IP
> packets, where ethernet header and IP header until total length was
> missing on the wire, this accidentally fuzzed the NPU ucode
> periodically triggering NPU bug, which causes total LC reload when it
> happens often enough.

> 
>>> Networks also routinely mangle packets in-memory which are not visible
>>> to FCS check.
>> 
>> Added to the list... Thanks!
> 
> The only way I know how to try to find these memory corruptions is to
> look at egress PE device backbone facing interface and see if there
> are IP checksum errors.




Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Vanbever Laurent
Hi Jörg,

Thanks for sharing your gray failure! With a few years of lifespan, it might 
well be the oldest gray failure ever monitored continuously :-) I'm pretty sure 
you guys exhausted all options already but... did you check for micro-bursts 
that may cause sudden buffer overflow? Or perhaps is your probing traffic 
already high priority?

Best,
Laurent

> On 8 Jul 2021, at 15:58, Jörg Kost  wrote:
> 
> We have a similar gray issue, where switches in a virtual chassis 
> configuration with layer3-configuration seem to lose transit ICMP messages 
> like echo or echo-reply randomly. Once we estimated it around 0.00012% ( let 
> alone variances, or errors in measuring ).
> 
> We noticed this when we replaced Nagios with some more bursting, 
> trigger-happy monitoring software a few years back. Since then, it's 
> reporting false positives from time to time, and this can become annoying.
> 
> Besides spending a lot of time debugging this, we never had a breakthrough in 
> finding the root cause, just looking to replace things in the next year.
> 
> On 8 Jul 2021, at 15:28, Mark Tinka wrote:
> 
>> On 7/8/21 15:22, Vanbever Laurent wrote:
>> 
>>> Did you folks manage to understand what was causing the gray issue in the 
>>> first place?
>> 
>> Nope, still chasing it. We suspect a FIB issue on a transit device, but 
>> currently building a test to confirm.
>> 
>> Mark.



Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Vanbever Laurent


> On 8 Jul 2021, at 14:59, Mark Tinka  wrote:
> 
> On 7/8/21 14:29, Saku Ytti wrote:
> 
>> Network experiences gray failures all the time, and I almost never
>> care, unless a customer does. If there is a network which does not
>> experience these, then it's likely due to lack of visibility rather
>> than issues not existing.
>> 
>> Fixing these can take months of working with vendors and attempts to
>> remedy will usually cause planned or unplanned outages. So it rarely
>> makes sense to try to fix as they usually impact a trivial amount of
>> traffic.
>> 
>> Networks also routinely mangle packets in-memory which are not visible
>> to FCS check.
> 
> I was going to say the exact same thing.
> 
> +1.
> 
> It's all par for the course, which is why we get up everyday :-).

:-)

> I'm currently dealing with an issue that will forward a customer's traffic 
> to/from one /24, but not the rest of their IPv4 space, including the larger 
> allocation from which the /24 is born. It was a gray issue while the customer 
> partially activated, and then caused us to care when they tried to fully 
> swing over.

Did you folks manage to understand what was causing the gray issue in the first 
place?

> We've had an issue that has lasted over a year but only manifested recently, 
> where someone wrote a static route pointing to an indirect next-hop, 
> mistakenly. The router ended up resolving it and forwarding traffic, but in 
> the process, was spiking CPU in a manner that was not immediately evident 
> from the NMS. Fixing the next-hop resolved the issue, as would improving 
> service provisioning and troubleshooting manuals :-).

Interesting. I can see how hard this one is to debug as even a relatively small 
of traffic pointing at the static route would be enough to make the CPU spikes.

> Like Saku says, there's always something, and attention to it will be granted 
> depending on how much visible pain it causes.

Yep. Makes absolute sense.

Best,
Laurent

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Vanbever Laurent

> On 8 Jul 2021, at 14:29, Saku Ytti  wrote:
> 
> On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent  wrote:
> 
>> Detecting whole-link and node failures is relatively easy nowadays (e.g., 
>> using BFD). But what about detecting gray failures that only affect a 
>> *subset* of the traffic, e.g. a router randomly dropping 0.1% of the 
>> packets? Does your network often experience these gray failures? Are they 
>> problematic? Do you care? And can we (network researchers) do anything about 
>> it?”
> 
> Network experiences gray failures all the time, and I almost never
> care, unless a customer does. If there is a network which does not
> experience these, then it's likely due to lack of visibility rather
> than issues not existing.
> 
> Fixing these can take months of working with vendors and attempts to
> remedy will usually cause planned or unplanned outages. So it rarely
> makes sense to try to fix as they usually impact a trivial amount of
> traffic.

Thanks for chiming in. That's also my feeling: a *lot* of gray failures 
routinely happen, a small percentage of which end up being really damaging (the 
ones hitting customer traffic, as you pointed out). For this small percentage 
though, I can imagine being able to detect / locate them rapidly (i.e. before 
the customer submit a ticket) would be interesting? Even if fixing the root 
cause might take up months (since it is up to the vendors), one could still 
hope to remediate to the situation transiently by rerouting traffic combined 
with the traditional rebooting of the affected resources?

> Networks also routinely mangle packets in-memory which are not visible
> to FCS check.

Added to the list... Thanks!

Best,
Laurent

Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Vanbever Laurent
Dear NANOG,

Detecting whole-link and node failures is relatively easy nowadays (e.g., using 
BFD). But what about detecting gray failures that only affect a *subset* of the 
traffic, e.g. a router randomly dropping 0.1% of the packets? Does your network 
often experience these gray failures? Are they problematic? Do you care? And 
can we (network researchers) do anything about it?”

Please help us out to find out by answering our short (<10 minutes) anonymous 
survey.

Survey URL: https://forms.gle/v99mBNEPrLjcFCEu8

## Context:

When we think about network failures, we often think about a link or a network 
device going down. These failures are "obvious" in that *all* the traffic 
crossing the corresponding resource is dropped. But network failures can also 
be more subtle and only affect a *subset* of the traffic (e.g. 0.01% of the 
packets crossing a link/router). These failures are commonly referred to as 
"gray" failures. Because they don't drop *all* the traffic, gray failures are 
much harder to detect.

Many studies revealed that cloud and datacenter networks routinely suffer from 
gray failures and, as such, many techniques exist to track them down in these 
environments (see e.g. this study from Microsoft Azure 
https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf).
 What is less known though is how much gray failures affect *other* types of 
networks such as Internet Service Providers (ISPs), Wide Area Networks (WAN), 
or Enterprise networks. While the bug reports submitted to popular routing 
vendors (Cisco, Juniper, etc.) suggest that gray failures are pervasive and 
hard to catch for all networks, we would love to know more about first-hand 
experiences.

## About the survey:

The questionnaire is intended for network operators. It has a total of 15 
questions and should take at most 10 minutes to complete. The survey and the 
collected data are totally anonymous (so please do not include information that 
may help to identify you or your organization). All questions are optional, so 
if you don't like a question or don't know the answer, just skip it.

Thank you so much in advance, and we look forward to read your responses!

Laurent Vanbever, ETH Zurich

PS: Of course, we would be extremely grateful if you could forward this email 
to any operator you might know who may not read NANOG ( assuming those even 
exist? :-) )!


Re: Your opinion on network analysis in the presence of uncertain events

2019-01-17 Thread Vanbever Laurent
Hi Adam/Mel,

Thanks for chiming in!

My understanding was that the tool will combine historic data with the MTBF 
datapoints form all components involved in a given link in order to try and 
estimate a likelihood of a link failure.

Yep. This could be one way indeed. This likelihood could also be taking the 
form of intervals in which you expect the true value to lies (again, based on 
historical data). This could be done both for link/devices failures but also 
for external inputs such as BGP announcements (to consider the likelihood that 
you receive a route for X in, say, NEWY). The tool would then to run the 
deterministic routing protocols (not accounting for ‘features’ such as 
prefer-oldest-route for a sec.) on these probabilistic inputs so as to infer 
the different possible forwarding outcomes and their relative probabilities. 
For now we had something like this in mind.

One can of course make the model more and more complex by e.g. also taking into 
account data plane status (to model gray failures). Intuitively though, the 
more complex the model, the more complex the inference process is.

Heck I imagine if one would stream a heap load of data at a ML algorithm it 
might draw some very interesting conclusions indeed -i.e. draw unforeseen 
patterns across huge datasets while trying to understand the overall system 
(network) behaviour. Such a tool might teach us something new about our 
networks.
Next level would be recommendations on how to best address some of the 
potential pitfalls it found.

Yes. I believe some variants of this exist already. I’m not sure how much they 
are used in practice though. AFAICT, false positives/negatives is still a big 
problem. Non-trivial recommendation system will require a model of the network 
behavior that can somehow be inverted easily which is probably something 
academics should spend some time on :-)

Maybe in closed systems like IP networks, with use of streaming telemetry from 
SFPs/NPUs/LC-CPUs/Protocols/etc.., we’ll be able to feed the analytics tool 
with enough data to allow it to make fairly accurate predictions (i.e. unlike 
in weather or markets prediction tools where the datasets (or search space -as 
not all attributes are equally relevant) is virtually endless).

I’m with you. I also believe that better (even programmable) telemetry will 
unlock powerful analysis tools.

Best,
Laurent


PS: Thanks a lot to those who have already answered our survey! For those who 
haven’t yet: https://goo.gl/forms/HdYNp3DkKkeEcexs2 (it only takes a couple of 
minutes).


Re: Your opinion on network analysis in the presence of uncertain events

2019-01-15 Thread Vanbever Laurent

> I took the survey. It’s short and sweet — well done!

Thanks a lot, Mel! Highly appreciated!

> I do have a question. You ask "Are there any good?” Any good what?

I just meant whether existing network analysis tools were any good (or good 
enough) at reasoning about probabilistic behaviors that people care about (if 
any).

All the best,
Laurent



Your opinion on network analysis in the presence of uncertain events

2019-01-15 Thread Vanbever Laurent
Hi NANOG,

Networks evolve in uncertain environments. Links and devices randomly fail; 
external BGP announcements unpredictably appear/disappear leading to unforeseen 
traffic shifts; traffic demands vary, etc. Reasoning about network behaviors 
under such uncertainties is hard and yet essential to ensure Service Level 
Agreements.

We're reaching out to the NANOG community as we (researchers) are trying to 
better understand the practical requirements behind "probabilistic" network 
reasoning. Some of our questions include: Are uncertain behaviors problematic? 
Do you care about such things at all? Are you already using tools to ensure the 
compliance of your network design under uncertainty? Are there any good?

We designed a short anonymous survey to collect operators answers. It is 
composed of 14 optional questions, most of which (13/14) are closed-ended. It 
should take less than 10 minutes to complete. We expect the findings to help 
the research community in designing more powerful network analysis tools. Among 
others, we intend to present the aggregate results in a scientific article 
later this year.

It would be *terrific* if you could help us out!

Survey URL: https://goo.gl/forms/HdYNp3DkKkeEcexs2

Thanks much!

Laurent Vanbever, ETH Zürich


PS: It goes without saying that we would also be extremely grateful if you 
could forward this email to any operator you know and who may not read NANOG.