> One method is collecting lookup exceptions. We scrape these: > > npu_triton_trapstats.py: command = "start shell sh command \"for > fpc in $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}'); > do echo FPC$fpc; vty -c 'show cda trapstats' fpc$fpc; done\"" > ptx1k_trapstats.py: command = "start shell sh command \"for fpc in > $(cli -c 'show chassis fpc' | grep Online | awk '{print $1;}'); do > echo FPC$fpc; vty -c 'show pechip trapstats' fpc$fpc; done\"" > asr9k_npu_counters.py: command = "show controllers np counters all" > junos_trio_exceptions.py: command = "show pfe statistics exceptions" > > No need for ML or AI, as trivial algorithms like 'what counter is > incrementing which isn't incrementing elsewhere' or 'what counter is > not incrementing is incrementing elsewhere' shows a lot of real > problems, and capturing those exceptions and reviewing confirms them. > > We do not use these to proactively find problems, as it would yield to > poorer overall availability. But we regularly use them to expedite > time to resolution.
Thanks for sharing! I guess this process working means the counters are "standard" / close enough across vendors to allow for comparisons? > Very recently we had Tomahawk (EZchip) reset the whole linecard and > looking at counters identifying counter which is incrementing but > likely should not yielded the problem. Customer was sending us IP > packets, where ethernet header and IP header until total length was > missing on the wire, this accidentally fuzzed the NPU ucode > periodically triggering NPU bug, which causes total LC reload when it > happens often enough. > >>> Networks also routinely mangle packets in-memory which are not visible >>> to FCS check. >> >> Added to the list... Thanks! > > The only way I know how to try to find these memory corruptions is to > look at egress PE device backbone facing interface and see if there > are IP checksum errors.