Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-09 Thread Yang Yu
On Thu, Jul 8, 2021 at 4:03 PM William Herrin  wrote:
>
> On Thu, Jul 8, 2021 at 5:31 AM Saku Ytti  wrote:
> > Network experiences gray failures all the time, and I almost never
> > care, unless a customer does.
>
> I would suggest that your customer does care, but as there is no
> simple test to demonstrate gray failures, your customer rarely makes
> it past first tier support to bring the issue to your attention and
> gives up trying. Indeed, name the networks with the worst reputations
> around here and many of them have those reputations because of a
> routine, uncorrected state of gray failure.

Networks originating/receiving the traffic tend to have more
incentives to resolve these issues, which might be not so rare

If you have connection/application level health metrics (e.g. TLS
handshake failures, TCP retransmits), identifying a problem exists is
not too difficult. Having health metrics associated with network paths
can greatly simplify repro. Then it's mostly troubleshooting datapath
issues on your favorite platform.

It takes quite some effort to figure out/collect relevant metrics and
present them in a usable way. Something like connections from PoP A to
destination ASN/prefix (via interface X) had TLS handshake failure
rate increased from 0.02% to 1% is a good starting point for
troubleshooting (may or may not be a network issue, the
origin/receiver probably wants to fix it regardless).

Things can get more complicated when traffic crosses network
boundaries with things you don't have visibility into (IX fabric,
remote peering, another networks' optical systems, complicated setups
like stateful firewall / MC-LAG)


Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-09 Thread Warren Kumari
On Thu, Jul 8, 2021 at 5:04 PM William Herrin  wrote:
>
> On Thu, Jul 8, 2021 at 5:31 AM Saku Ytti  wrote:
> > Network experiences gray failures all the time, and I almost never
> > care, unless a customer does.
>
> Greetings,
>
> I would suggest that your customer does care, but as there is no
> simple test to demonstrate gray failures, your customer rarely makes
> it past first tier support to bring the issue to your attention and
> gives up trying. Indeed, name the networks with the worst reputations
> around here and many of them have those reputations because of a
> routine, uncorrected state of gray failure.
>
> To answer Laurent 's question:
>
> Yes, gray failures are a regular problem. Yes, most of us care. And
> for the most part we don't have particularly good ways to detect and
> isolate the problems, let alone fix them.

Depending on the actual failure mode, and the architecture of the
device itself, one technique is to run test traffic through the
box/path/whatever while twiddling the source and destination ports,
and sometimes the source IP as well.
This sometimes helps find the issue if there is a bad interface in a
LAG, or in a device which sprays packets/cells across an internal
fabric, etc. If you are really lucky you can convince the vendor to
share how they spray/hash (or, at least demonstrate deterministic
failure and hopefully they can hash and tell which of the N fabric
cards is broken)

Hopefully you noticed the number of weasel words in there...

W



>  When it's not a clean
> failure we really are driven by: customer says blank is broken, often
> followed by grueling manual effort just to duplicate the problem
> within our view.
>
> Can network researchers do anything about it? Maybe. Because of the
> end to end principle, only the endpoints understand the state of the
> connection and they don't know the difference capacity and error. They
> mostly process that information locally sharing only limited
> information with the other endpoint. Which means there's not much
> passing over the wire for the middle to examine and learn that there's
> a problem... and when there is it often takes correlating multiple
> packets to understand that a problem exists which, in the stateless
> middle with asymmetric routing, is not usable. The middle can only
> look at its immediate link stats which, when there's a bug, are
> misleading.
>
> What would you change to dig us out of this hole?
>
> Regards,
> Bill Herrin
>
>
> --
> William Herrin
> b...@herrin.us
> https://bill.herrin.us/



-- 
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
  -- E. W. Dijkstra


Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-09 Thread Chriztoffer Hansen
On Thu, 8 Jul 2021 at 22:10, Baldur Norddahl  wrote:
> We had a line card that would drop any IPv6 packet with bit #65 in the 
> destination address set to 1. Turns out that only a few hosts have this bit 
> set to 1 in the address, so nobody noticed until some Debian mirrors started 
> to become unreachable. Also, web browsers are very good at switching to IPv4 
> in case of IPv6 timeouts, so nobody would notice web hosts with the problem. 
> And then we had to live with the problem for a while because the device was 
> out of warranty and marked to be replaced, but you do not just replace a 
> router in a hurry unless you absolutely need to.

Grey failures, ugh. Heard of a colleague at prior employment who did
troubleshooting of an issue with an extended line. Where packets to
select IPv4 dest addresses would be dropped by the extended line card.
Took time plus inserting middle-boxes from Vendor Y (packet capture
for evidence) to confirm and convince the vendor of their code had
problems. 



Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Saku Ytti
On Fri, 9 Jul 2021 at 00:01, William Herrin  wrote:

> I would suggest that your customer does care, but as there is no

Most don't. Somewhat recently we were dropping a non-trivial amount of
packets from a well-known book store due to DMAC failure. This was
unexpected, considering it was an L3 to L3 connection. This was a LACP
bundle with a large number of interfaces and this issue affected just
one interface in the bundle. After we informed the customer about the
problem, while it was still occurring, they could not observe it, they
looked at their stats and whatever it was dropping was being drowned
in the noise, it was not an actionable signal to them. Customer wasn't
willing to remove the broken interface from the bundle, as they could
not observe the problem.

We did migrate that port to a working port and after 3 months we
agreed with the vendor to stop troubleshooting it, vendor can see that
they had misprogrammed their hardware, but they were not able to
figure out why and therefore it is not fixed. Very large amount of
cycles were spent at the vendor and operator, and a small amount of
work (checking TCP resends etc) at customers trying to solve it.

The reason we contacted the customer is because there were quite a
large number of packets we were dropping, I can easily find 100 real
smaller problems we have in the network immediately.

Customer was /not/ wrong, the customer did the exact right thing.
There are a lot of problems, and you can go deep into the rabbit hole
trying to fix problems which are real but don't affect a sufficient
amount of packets to have a meaningful impact on the product quality.



-- 
  ++ytti


Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread William Herrin
On Thu, Jul 8, 2021 at 5:31 AM Saku Ytti  wrote:
> Network experiences gray failures all the time, and I almost never
> care, unless a customer does.

Greetings,

I would suggest that your customer does care, but as there is no
simple test to demonstrate gray failures, your customer rarely makes
it past first tier support to bring the issue to your attention and
gives up trying. Indeed, name the networks with the worst reputations
around here and many of them have those reputations because of a
routine, uncorrected state of gray failure.

To answer Laurent 's question:

Yes, gray failures are a regular problem. Yes, most of us care. And
for the most part we don't have particularly good ways to detect and
isolate the problems, let alone fix them. When it's not a clean
failure we really are driven by: customer says blank is broken, often
followed by grueling manual effort just to duplicate the problem
within our view.

Can network researchers do anything about it? Maybe. Because of the
end to end principle, only the endpoints understand the state of the
connection and they don't know the difference capacity and error. They
mostly process that information locally sharing only limited
information with the other endpoint. Which means there's not much
passing over the wire for the middle to examine and learn that there's
a problem... and when there is it often takes correlating multiple
packets to understand that a problem exists which, in the stateless
middle with asymmetric routing, is not usable. The middle can only
look at its immediate link stats which, when there's a bug, are
misleading.

What would you change to dig us out of this hole?

Regards,
Bill Herrin


-- 
William Herrin
b...@herrin.us
https://bill.herrin.us/


Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Baldur Norddahl
We had a line card that would drop any IPv6 packet with bit #65 in the
destination address set to 1. Turns out that only a few hosts have this bit
set to 1 in the address, so nobody noticed until some debian mirrors
started to become unreachable. Also webbrowser are very good at switching
to IPv4 in case of IPv6 timeouts, so nobody would notice web hosts with the
problem. And then we had to live with the problem for a while because the
device was out of warranty and marked to be replaced, but you do not just
replace a router in a hurry unless you absolutely need to.

You do not expect this kind of issue and a lot of time was spent trying to
find an alternate explanation for the problem.

Regards,

Baldur


Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Warren Kumari
On Thu, Jul 8, 2021 at 8:32 AM Saku Ytti  wrote:
>
> On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent  wrote:
>
> > Detecting whole-link and node failures is relatively easy nowadays (e.g., 
> > using BFD). But what about detecting gray failures that only affect a 
> > *subset* of the traffic, e.g. a router randomly dropping 0.1% of the 
> > packets? Does your network often experience these gray failures? Are they 
> > problematic? Do you care? And can we (network researchers) do anything 
> > about it?”
>
> Network experiences gray failures all the time, and I almost never
> care, unless a customer does. If there is a network which does not
> experience these, then it's likely due to lack of visibility rather
> than issues not existing.
>

I think that some of it depends on the type of failure -- for example,
some devices hash packets across an internal switch fabric, and so the
failure manifests as persistent issues to a specific 5-tuple (or
between a pair of 5-tuples). If this affects one in a thousand flows
it is likely more annoying than one in a thousand random packets being
dropped.

But, yes, all networks drop some set of packets some percentage of the
time (cue the "SEU caused by cosmic rays" response :-))

W


> Fixing these can take months of working with vendors and attempts to
> remedy will usually cause planned or unplanned outages. So it rarely
> makes sense to try to fix as they usually impact a trivial amount of
> traffic.
>
> Networks also routinely mangle packets in-memory which are not visible
> to FCS check.
>
> --
>   ++ytti



-- 
The computing scientist’s main challenge is not to get confused by the
complexities of his own making.
  -- E. W. Dijkstra


Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Saku Ytti
On Thu, 8 Jul 2021 at 19:25, Lukas Tribus  wrote:

> More generally speaking, single link overloads causing PL or even full 
> blackholing affecting single links (and therefore in a load-balanced 
> environment: specific tuples) is something that is very frustrating to 
> troubleshoot and it happens quite a lot in the DFZ. It

Ask your vendor to implement RFC5837, so that in addition to the
bundle interface having the L3 address, traceroute also returns the
actual physical interface that received the packet. This would
expedite troubleshooting issues where elephant flows congest specific
links.
Juniper and Nokia support adaptive load balancing, dynamically
adjusting hash=>interface mapping table, to deal with elephant flows
without congesting one link.

-- 
  ++ytti


Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Lukas Tribus
Hello,

there is a large eyeball ASN in Southern Europe, single homed to a Tier1
running under the same corporate umbrella, which for about a decade
suffered from periodic blackholing of specific src/dst tuples. The issue
occurred every 6 - 18 months, completely breaking specific production
traffic *for multiple days* (think dead, mission-critical IPsec VPNs for
example). It was never acknowledged on the record, some say this was about
stalled 100G cards. I believe at this point the HW was faced out, but this
was one of the rather infuriating experiences ...

More generally speaking, single link overloads causing PL or even full
blackholing affecting single links (and therefore in a load-balanced
environment: specific tuples) is something that is very frustrating to
troubleshoot and it happens quite a lot in the DFZ. It doesn't show on
monitoring systems, and it is difficult to get past the first level support
in bigger networks because load-balancing decisions and hashing are
difficult concepts for the uninitiated and they will generally refuse to
escalate issues they are unable to reproduce from their specific system
(WORKSFORME). At some point I had a router with an entire /24 configured on
a loopback, just to ping destinations from the same device with different
source IP's, to establish whether there is a load-balancing induced issue
with packet-loss, latency, or full blackholing towards a particular
destination.

Tooling (for troubleshooting), monitoring and education is lacking in this
regard unfortunately.


- lukas


Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Saku Ytti
On Thu, 8 Jul 2021 at 17:59, Vanbever Laurent  wrote:

> Thanks for sharing! I guess this process working means the counters are 
> "standard" / close enough across vendors to allow for comparisons?

Not at all I'm afraid, and not intended for user consumption so
generally not available via SNMP or streaming.

-- 
  ++ytti


Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Vanbever Laurent
> One method is collecting lookup exceptions. We scrape these:
> 
> npu_triton_trapstats.py:command = "start shell sh command \"for
> fpc in $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}');
> do echo FPC$fpc; vty -c 'show cda trapstats' fpc$fpc; done\""
> ptx1k_trapstats.py:command = "start shell sh command \"for fpc in
> $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}'); do
> echo FPC$fpc; vty -c 'show pechip trapstats' fpc$fpc; done\""
> asr9k_npu_counters.py:command = "show controllers np counters all"
> junos_trio_exceptions.py:command = "show pfe statistics exceptions"
> 
> No need for ML or AI, as trivial algorithms like 'what counter is
> incrementing which isn't incrementing elsewhere' or 'what counter is
> not incrementing is incrementing elsewhere' shows a lot of real
> problems, and capturing those exceptions and reviewing confirms them.
> 
> We do not use these to proactively find problems, as it would yield to
> poorer overall availability. But we regularly use them to expedite
> time to resolution.

Thanks for sharing! I guess this process working means the counters are 
"standard" / close enough across vendors to allow for comparisons?

> Very recently we had Tomahawk (EZchip) reset the whole linecard and
> looking at counters identifying counter which is incrementing but
> likely should not yielded the problem. Customer was sending us IP
> packets, where ethernet header and IP header until total length was
> missing on the wire, this accidentally fuzzed the NPU ucode
> periodically triggering NPU bug, which causes total LC reload when it
> happens often enough.

> 
>>> Networks also routinely mangle packets in-memory which are not visible
>>> to FCS check.
>> 
>> Added to the list... Thanks!
> 
> The only way I know how to try to find these memory corruptions is to
> look at egress PE device backbone facing interface and see if there
> are IP checksum errors.




Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Vanbever Laurent
Hi Jörg,

Thanks for sharing your gray failure! With a few years of lifespan, it might 
well be the oldest gray failure ever monitored continuously :-) I'm pretty sure 
you guys exhausted all options already but... did you check for micro-bursts 
that may cause sudden buffer overflow? Or perhaps is your probing traffic 
already high priority?

Best,
Laurent

> On 8 Jul 2021, at 15:58, Jörg Kost  wrote:
> 
> We have a similar gray issue, where switches in a virtual chassis 
> configuration with layer3-configuration seem to lose transit ICMP messages 
> like echo or echo-reply randomly. Once we estimated it around 0.00012% ( let 
> alone variances, or errors in measuring ).
> 
> We noticed this when we replaced Nagios with some more bursting, 
> trigger-happy monitoring software a few years back. Since then, it's 
> reporting false positives from time to time, and this can become annoying.
> 
> Besides spending a lot of time debugging this, we never had a breakthrough in 
> finding the root cause, just looking to replace things in the next year.
> 
> On 8 Jul 2021, at 15:28, Mark Tinka wrote:
> 
>> On 7/8/21 15:22, Vanbever Laurent wrote:
>> 
>>> Did you folks manage to understand what was causing the gray issue in the 
>>> first place?
>> 
>> Nope, still chasing it. We suspect a FIB issue on a transit device, but 
>> currently building a test to confirm.
>> 
>> Mark.



Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Tom Beecher
>
>  If there is a network which does not
> experience these, then it's likely due to lack of visibility rather
> than issues not existing.
>

This. Full stop.

I believe there are very few, if any, production networks in existence in
which have a 0% rate of drops or 'weird shit'.

Monitoring for said drops and weird shit is important, and knowing your
traffic profiles is also important so that when there is an intersection of
'stuff' and 'stuff that noticeably impacts traffic' , you can get to the
bottom of it quickly and figure out what to do.

On Thu, Jul 8, 2021 at 8:31 AM Saku Ytti  wrote:

> On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent  wrote:
>
> > Detecting whole-link and node failures is relatively easy nowadays
> (e.g., using BFD). But what about detecting gray failures that only affect
> a *subset* of the traffic, e.g. a router randomly dropping 0.1% of the
> packets? Does your network often experience these gray failures? Are they
> problematic? Do you care? And can we (network researchers) do anything
> about it?”
>
> Network experiences gray failures all the time, and I almost never
> care, unless a customer does. If there is a network which does not
> experience these, then it's likely due to lack of visibility rather
> than issues not existing.
>
> Fixing these can take months of working with vendors and attempts to
> remedy will usually cause planned or unplanned outages. So it rarely
> makes sense to try to fix as they usually impact a trivial amount of
> traffic.
>
> Networks also routinely mangle packets in-memory which are not visible
> to FCS check.
>
> --
>   ++ytti
>


Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Jörg Kost
We have a similar gray issue, where switches in a virtual chassis 
configuration with layer3-configuration seem to lose transit ICMP 
messages like echo or echo-reply randomly. Once we estimated it around 
0.00012% ( let alone variances, or errors in measuring ).


We noticed this when we replaced Nagios with some more bursting, 
trigger-happy monitoring software a few years back. Since then, it's 
reporting false positives from time to time, and this can become 
annoying.


Besides spending a lot of time debugging this, we never had a 
breakthrough in finding the root cause, just looking to replace things 
in the next year.


On 8 Jul 2021, at 15:28, Mark Tinka wrote:


On 7/8/21 15:22, Vanbever Laurent wrote:

Did you folks manage to understand what was causing the gray issue in 
the first place?


Nope, still chasing it. We suspect a FIB issue on a transit device, 
but currently building a test to confirm.


Mark.


Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Saku Ytti
On Thu, 8 Jul 2021 at 16:13, Vanbever Laurent  wrote:

> Thanks for chiming in. That's also my feeling: a *lot* of gray failures 
> routinely happen, a small percentage of which end up being really damaging 
> (the ones hitting customer traffic, as you pointed out). For this small 
> percentage though, I can imagine being able to detect / locate them rapidly 
> (i.e. before the customer submit a ticket) would be interesting? Even if 
> fixing the root cause might take up months (since it is up to the vendors), 
> one could still hope to remediate to the situation transiently by rerouting 
> traffic combined with the traditional rebooting of the affected resources?

One method is collecting lookup exceptions. We scrape these:

npu_triton_trapstats.py:command = "start shell sh command \"for
fpc in $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}');
do echo FPC$fpc; vty -c 'show cda trapstats' fpc$fpc; done\""
ptx1k_trapstats.py:command = "start shell sh command \"for fpc in
$(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}'); do
echo FPC$fpc; vty -c 'show pechip trapstats' fpc$fpc; done\""
asr9k_npu_counters.py:command = "show controllers np counters all"
junos_trio_exceptions.py:command = "show pfe statistics exceptions"

No need for ML or AI, as trivial algorithms like 'what counter is
incrementing which isn't incrementing elsewhere' or 'what counter is
not incrementing is incrementing elsewhere' shows a lot of real
problems, and capturing those exceptions and reviewing confirms them.

We do not use these to proactively find problems, as it would yield to
poorer overall availability. But we regularly use them to expedite
time to resolution.
Very recently we had Tomahawk (EZchip) reset the whole linecard and
looking at counters identifying counter which is incrementing but
likely should not yielded the problem. Customer was sending us IP
packets, where ethernet header and IP header until total length was
missing on the wire, this accidentally fuzzed the NPU ucode
periodically triggering NPU bug, which causes total LC reload when it
happens often enough.

> > Networks also routinely mangle packets in-memory which are not visible
> > to FCS check.
>
> Added to the list... Thanks!

The only way I know how to try to find these memory corruptions is to
look at egress PE device backbone facing interface and see if there
are IP checksum errors.

--
  ++ytti


Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Mark Tinka




On 7/8/21 15:22, Vanbever Laurent wrote:


Did you folks manage to understand what was causing the gray issue in the first 
place?


Nope, still chasing it. We suspect a FIB issue on a transit device, but 
currently building a test to confirm.


Mark.


Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread colin johnston


Uucp using tcp does work to overcome packet size problems but limited usage but 
did work in the past

Col

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Vanbever Laurent


> On 8 Jul 2021, at 14:59, Mark Tinka  wrote:
> 
> On 7/8/21 14:29, Saku Ytti wrote:
> 
>> Network experiences gray failures all the time, and I almost never
>> care, unless a customer does. If there is a network which does not
>> experience these, then it's likely due to lack of visibility rather
>> than issues not existing.
>> 
>> Fixing these can take months of working with vendors and attempts to
>> remedy will usually cause planned or unplanned outages. So it rarely
>> makes sense to try to fix as they usually impact a trivial amount of
>> traffic.
>> 
>> Networks also routinely mangle packets in-memory which are not visible
>> to FCS check.
> 
> I was going to say the exact same thing.
> 
> +1.
> 
> It's all par for the course, which is why we get up everyday :-).

:-)

> I'm currently dealing with an issue that will forward a customer's traffic 
> to/from one /24, but not the rest of their IPv4 space, including the larger 
> allocation from which the /24 is born. It was a gray issue while the customer 
> partially activated, and then caused us to care when they tried to fully 
> swing over.

Did you folks manage to understand what was causing the gray issue in the first 
place?

> We've had an issue that has lasted over a year but only manifested recently, 
> where someone wrote a static route pointing to an indirect next-hop, 
> mistakenly. The router ended up resolving it and forwarding traffic, but in 
> the process, was spiking CPU in a manner that was not immediately evident 
> from the NMS. Fixing the next-hop resolved the issue, as would improving 
> service provisioning and troubleshooting manuals :-).

Interesting. I can see how hard this one is to debug as even a relatively small 
of traffic pointing at the static route would be enough to make the CPU spikes.

> Like Saku says, there's always something, and attention to it will be granted 
> depending on how much visible pain it causes.

Yep. Makes absolute sense.

Best,
Laurent

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Vanbever Laurent

> On 8 Jul 2021, at 14:29, Saku Ytti  wrote:
> 
> On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent  wrote:
> 
>> Detecting whole-link and node failures is relatively easy nowadays (e.g., 
>> using BFD). But what about detecting gray failures that only affect a 
>> *subset* of the traffic, e.g. a router randomly dropping 0.1% of the 
>> packets? Does your network often experience these gray failures? Are they 
>> problematic? Do you care? And can we (network researchers) do anything about 
>> it?”
> 
> Network experiences gray failures all the time, and I almost never
> care, unless a customer does. If there is a network which does not
> experience these, then it's likely due to lack of visibility rather
> than issues not existing.
> 
> Fixing these can take months of working with vendors and attempts to
> remedy will usually cause planned or unplanned outages. So it rarely
> makes sense to try to fix as they usually impact a trivial amount of
> traffic.

Thanks for chiming in. That's also my feeling: a *lot* of gray failures 
routinely happen, a small percentage of which end up being really damaging (the 
ones hitting customer traffic, as you pointed out). For this small percentage 
though, I can imagine being able to detect / locate them rapidly (i.e. before 
the customer submit a ticket) would be interesting? Even if fixing the root 
cause might take up months (since it is up to the vendors), one could still 
hope to remediate to the situation transiently by rerouting traffic combined 
with the traditional rebooting of the affected resources?

> Networks also routinely mangle packets in-memory which are not visible
> to FCS check.

Added to the list... Thanks!

Best,
Laurent

Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Mark Tinka




On 7/8/21 14:29, Saku Ytti wrote:


Network experiences gray failures all the time, and I almost never
care, unless a customer does. If there is a network which does not
experience these, then it's likely due to lack of visibility rather
than issues not existing.

Fixing these can take months of working with vendors and attempts to
remedy will usually cause planned or unplanned outages. So it rarely
makes sense to try to fix as they usually impact a trivial amount of
traffic.

Networks also routinely mangle packets in-memory which are not visible
to FCS check.


I was going to say the exact same thing.

+1.

It's all par for the course, which is why we get up everyday :-).

I'm currently dealing with an issue that will forward a customer's 
traffic to/from one /24, but not the rest of their IPv4 space, including 
the larger allocation from which the /24 is born. It was a gray issue 
while the customer partially activated, and then caused us to care when 
they tried to fully swing over.


We've had an issue that has lasted over a year but only manifested 
recently, where someone wrote a static route pointing to an indirect 
next-hop, mistakenly. The router ended up resolving it and forwarding 
traffic, but in the process, was spiking CPU in a manner that was not 
immediately evident from the NMS. Fixing the next-hop resolved the 
issue, as would improving service provisioning and troubleshooting 
manuals :-).


Like Saku says, there's always something, and attention to it will be 
granted depending on how much visible pain it causes.


Mark.



Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Mark Tinka




On 7/8/21 14:29, Saku Ytti wrote:


Network experiences gray failures all the time, and I almost never
care, unless a customer does. If there is a network which does not
experience these, then it's likely due to lack of visibility rather
than issues not existing.

Fixing these can take months of working with vendors and attempts to
remedy will usually cause planned or unplanned outages. So it rarely
makes sense to try to fix as they usually impact a trivial amount of
traffic.

Networks also routinely mangle packets in-memory which are not visible
to FCS check.


I was going to say the exact same thing.

+1.

It's all par for the course, which is why we get up everyday :-).

I'm currently dealing with an issue that will forward a customer's 
traffic to/from one /24, but not the rest of their IPv4 space, including 
the larger allocation from which the /24 is born. It was a gray issue 
while the customer partially activated, and then caused us to care when 
they tried fully swing over.


Mark.


Re: Do you care about "gray" failures? Can we (network academics) help? A 10-min survey

2021-07-08 Thread Saku Ytti
On Thu, 8 Jul 2021 at 15:00, Vanbever Laurent  wrote:

> Detecting whole-link and node failures is relatively easy nowadays (e.g., 
> using BFD). But what about detecting gray failures that only affect a 
> *subset* of the traffic, e.g. a router randomly dropping 0.1% of the packets? 
> Does your network often experience these gray failures? Are they problematic? 
> Do you care? And can we (network researchers) do anything about it?”

Network experiences gray failures all the time, and I almost never
care, unless a customer does. If there is a network which does not
experience these, then it's likely due to lack of visibility rather
than issues not existing.

Fixing these can take months of working with vendors and attempts to
remedy will usually cause planned or unplanned outages. So it rarely
makes sense to try to fix as they usually impact a trivial amount of
traffic.

Networks also routinely mangle packets in-memory which are not visible
to FCS check.

-- 
  ++ytti