Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-09 Thread Yang Yu
On Thu, Jul 8, 2021 at 4:03 PM William Herrin wrote: > > On Thu, Jul 8, 2021 at 5:31 AM Saku Ytti wrote: > > Network experiences gray failures all the time, and I almost never > > care, unless a customer does. > > I would suggest that your customer does care, but as there is no > simple test to

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-09 Thread Warren Kumari
On Thu, Jul 8, 2021 at 5:04 PM William Herrin wrote: > > On Thu, Jul 8, 2021 at 5:31 AM Saku Ytti wrote: > > Network experiences gray failures all the time, and I almost never > > care, unless a customer does. > > Greetings, > > I would suggest that your customer does care, but as there is no >

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-09 Thread Chriztoffer Hansen
On Thu, 8 Jul 2021 at 22:10, Baldur Norddahl wrote: > We had a line card that would drop any IPv6 packet with bit #65 in the > destination address set to 1. Turns out that only a few hosts have this bit > set to 1 in the address, so nobody noticed until some Debian mirrors started > to become

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Saku Ytti
On Fri, 9 Jul 2021 at 00:01, William Herrin wrote: > I would suggest that your customer does care, but as there is no Most don't. Somewhat recently we were dropping a non-trivial amount of packets from a well-known book store due to DMAC failure. This was unexpected, considering it was an L3 to

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread William Herrin
On Thu, Jul 8, 2021 at 5:31 AM Saku Ytti wrote: > Network experiences gray failures all the time, and I almost never > care, unless a customer does. Greetings, I would suggest that your customer does care, but as there is no simple test to demonstrate gray failures, your customer rarely makes

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Baldur Norddahl
We had a line card that would drop any IPv6 packet with bit #65 in the destination address set to 1. Turns out that only a few hosts have this bit set to 1 in the address, so nobody noticed until some debian mirrors started to become unreachable. Also webbrowser are very good at switching to IPv4

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Warren Kumari
On Thu, Jul 8, 2021 at 8:32 AM Saku Ytti wrote: > > On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent wrote: > > > Detecting whole-link and node failures is relatively easy nowadays (e.g., > > using BFD). But what about detecting gray failures that only affect a > > *subset* of the traffic, e.g. a

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Saku Ytti
On Thu, 8 Jul 2021 at 19:25, Lukas Tribus wrote: > More generally speaking, single link overloads causing PL or even full > blackholing affecting single links (and therefore in a load-balanced > environment: specific tuples) is something that is very frustrating to > troubleshoot and it

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Lukas Tribus
Hello, there is a large eyeball ASN in Southern Europe, single homed to a Tier1 running under the same corporate umbrella, which for about a decade suffered from periodic blackholing of specific src/dst tuples. The issue occurred every 6 - 18 months, completely breaking specific production

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Saku Ytti
On Thu, 8 Jul 2021 at 17:59, Vanbever Laurent wrote: > Thanks for sharing! I guess this process working means the counters are > "standard" / close enough across vendors to allow for comparisons? Not at all I'm afraid, and not intended for user consumption so generally not available via SNMP

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Vanbever Laurent
> One method is collecting lookup exceptions. We scrape these: > > npu_triton_trapstats.py:command = "start shell sh command \"for > fpc in $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}'); > do echo FPC$fpc; vty -c 'show cda trapstats' fpc$fpc; done\"" > ptx1k_trapstats.py:

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Vanbever Laurent
Hi Jörg, Thanks for sharing your gray failure! With a few years of lifespan, it might well be the oldest gray failure ever monitored continuously :-) I'm pretty sure you guys exhausted all options already but... did you check for micro-bursts that may cause sudden buffer overflow? Or perhaps

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Tom Beecher
> > If there is a network which does not > experience these, then it's likely due to lack of visibility rather > than issues not existing. > This. Full stop. I believe there are very few, if any, production networks in existence in which have a 0% rate of drops or 'weird shit'. Monitoring for

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Jörg Kost
We have a similar gray issue, where switches in a virtual chassis configuration with layer3-configuration seem to lose transit ICMP messages like echo or echo-reply randomly. Once we estimated it around 0.00012% ( let alone variances, or errors in measuring ). We noticed this when we replaced

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Saku Ytti
On Thu, 8 Jul 2021 at 16:13, Vanbever Laurent wrote: > Thanks for chiming in. That's also my feeling: a *lot* of gray failures > routinely happen, a small percentage of which end up being really damaging > (the ones hitting customer traffic, as you pointed out). For this small > percentage

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Mark Tinka
On 7/8/21 15:22, Vanbever Laurent wrote: Did you folks manage to understand what was causing the gray issue in the first place? Nope, still chasing it. We suspect a FIB issue on a transit device, but currently building a test to confirm. Mark.

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread colin johnston
Uucp using tcp does work to overcome packet size problems but limited usage but did work in the past Col

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Vanbever Laurent
> On 8 Jul 2021, at 14:59, Mark Tinka wrote: > > On 7/8/21 14:29, Saku Ytti wrote: > >> Network experiences gray failures all the time, and I almost never >> care, unless a customer does. If there is a network which does not >> experience these, then it's likely due to lack of visibility

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Vanbever Laurent
> On 8 Jul 2021, at 14:29, Saku Ytti wrote: > > On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent wrote: > >> Detecting whole-link and node failures is relatively easy nowadays (e.g., >> using BFD). But what about detecting gray failures that only affect a >> *subset* of the traffic, e.g. a

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Mark Tinka
On 7/8/21 14:29, Saku Ytti wrote: Network experiences gray failures all the time, and I almost never care, unless a customer does. If there is a network which does not experience these, then it's likely due to lack of visibility rather than issues not existing. Fixing these can take months

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Mark Tinka
On 7/8/21 14:29, Saku Ytti wrote: Network experiences gray failures all the time, and I almost never care, unless a customer does. If there is a network which does not experience these, then it's likely due to lack of visibility rather than issues not existing. Fixing these can take months

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Saku Ytti
On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent wrote: > Detecting whole-link and node failures is relatively easy nowadays (e.g., > using BFD). But what about detecting gray failures that only affect a > *subset* of the traffic, e.g. a router randomly dropping 0.1% of the packets? > Does your