Hi! On 18/02/2019 13:59, Will Deacon wrote: > [+James, who knows how to decode these things]
Decode is a strong term! This stuff is printed by Cavium's secure-world software. All I'm doing is spotting the bits that vary between the out we've seen! > On Mon, Feb 18, 2019 at 02:56:47PM +0100, Dmitry Vyukov wrote: >> On Mon, Feb 18, 2019 at 2:27 PM Qian Cai <c...@lca.pw> wrote: >>> On 2/17/19 2:30 AM, Dmitry Vyukov wrote: >>>> On Sun, Feb 17, 2019 at 5:34 AM Qian Cai <c...@lca.pw> wrote: >>>>> >>>>> Enabling function tracer with CONFIG_KASAN_SW_TAGS=y (hwasan) tracer >>>>> causes the whole system frozen on ThunderX2 systems with 256 CPUs, >>>>> because there is a burst of too much pointer access, and then KASAN will >>>>> dereference each byte of the shadow address for the tag checking which >>>>> will kill all the CPUs. >>>> >>>> Could you please elaborate what exactly happens and who/why kills >>>> CPUs? Number of memory accesses should not make any difference. >>>> With hardware support (MTE) it won't be possible to disable >>>> instrumentation (loads and stores check tags themselves), so it would >>>> be useful to keep track of exact reasons we disable instrumentation to >>>> know how to deal with them with hardware support. >>>> It would be useful to keep this info in the comment in the Makefile. >>> >>> It turns out sometimes it will trigger a hardware error. >> >> Please add this to the comment that there is that error, reason is >> unknown, happens from time to time. >> "Too much pointer access" is confusing and does not seem to be the >> root cause (there are lots of source files that cause lots of pointer >> accesses). > I don't think this is directly related to KASAN, as I'm sure we've seen this > RAS error before. Not quite like this. I've had one choke on some PCIe transaction[0]. This looks like corruption detected in a cache associated with a CPU. 'Write back' and 'Physical Address' suggests its the data cache: >>> Node 0 NBU 0 Error report : >>> NBU BAR Error [..] >>> Physical Address : 0x40011ff00 >>> >>> NBU BAR Error : Decoded info : >>> Agent info : CPU >>> Core ID : 21 >>> Thread ID : 1 >>> Requ: type : 4 : Write Back >>> Node 0 NBU 1 Error report : >>> NBU BAR Error [..] >>> Physical Address : 0x40011ff40 >>> >>> NBU BAR Error : Decoded info : >>> Agent info : CPU >>> Core ID : 21 >>> Thread ID : 1 >>> Requ: type : 4 : Write Back >>> Node 0 NBU 2 Error report : >>> NBU BAR Error [..] >>> Physical Address : 0x40011ff80 If you can reproduce it, and it always affects Core:21,Thread:1 I'd suggest offline-ing all the threads/CPUs in that core. It may be one cache is close to some threshold, and you can offline the core that its part of. Thanks, James [0] For comparison, I've had one of these during kexec: # NBU BAR Error : Decoded info : # Agent info : IO # : PCIE0 # Requ: type : 2 : Read