On Thu, Jul 8, 2021 at 5:04 PM William Herrin <b...@herrin.us> wrote: > > On Thu, Jul 8, 2021 at 5:31 AM Saku Ytti <s...@ytti.fi> wrote: > > Network experiences gray failures all the time, and I almost never > > care, unless a customer does. > > Greetings, > > I would suggest that your customer does care, but as there is no > simple test to demonstrate gray failures, your customer rarely makes > it past first tier support to bring the issue to your attention and > gives up trying. Indeed, name the networks with the worst reputations > around here and many of them have those reputations because of a > routine, uncorrected state of gray failure. > > To answer Laurent 's question: > > Yes, gray failures are a regular problem. Yes, most of us care. And > for the most part we don't have particularly good ways to detect and > isolate the problems, let alone fix them.
Depending on the actual failure mode, and the architecture of the device itself, one technique is to run test traffic through the box/path/whatever while twiddling the source and destination ports, and sometimes the source IP as well. This sometimes helps find the issue if there is a bad interface in a LAG, or in a device which sprays packets/cells across an internal fabric, etc. If you are really lucky you can convince the vendor to share how they spray/hash (or, at least demonstrate deterministic failure and hopefully they can hash and tell which of the N fabric cards is broken) Hopefully you noticed the number of weasel words in there... W > When it's not a clean > failure we really are driven by: customer says blank is broken, often > followed by grueling manual effort just to duplicate the problem > within our view. > > Can network researchers do anything about it? Maybe. Because of the > end to end principle, only the endpoints understand the state of the > connection and they don't know the difference capacity and error. They > mostly process that information locally sharing only limited > information with the other endpoint. Which means there's not much > passing over the wire for the middle to examine and learn that there's > a problem... and when there is it often takes correlating multiple > packets to understand that a problem exists which, in the stateless > middle with asymmetric routing, is not usable. The middle can only > look at its immediate link stats which, when there's a bug, are > misleading. > > What would you change to dig us out of this hole? > > Regards, > Bill Herrin > > > -- > William Herrin > b...@herrin.us > https://bill.herrin.us/ -- The computing scientist’s main challenge is not to get confused by the complexities of his own making. -- E. W. Dijkstra