Apologies - had meant to report back on this thread as I found a workaround (for my setup at least).
I found the solution here: https://wiki.archlinux.org/title/Intel_graphics#Crash/freeze_on_low_power_Intel_CPUs After setting enable_dc=0 in my kernel boot parameters I have no longer experienced the issue on LTS 6.1.x kernel on my Intel NUC 8i3BEH running Arch linux. I also tried upgrading to kernel 6.4.x where the problem also seems to be resolved. Regards, James. On Fri, 18 Aug 2023 20:08:09 +0100 James Hutchinson <jahutchinso...@googlemail.com> wrote: > I am also affected by this, running Arch Linux on my Intel Nuc 8i3beh. I've > seen these same random mce broadcast error kernel panics (only capturable > via netconsole) ever since upgrading from the 5.15.x lts kernel series to > the 6.1.x series - latest I've tried is 6.1.45 and currently back to the > 5.15.x branch for stability. > > I update my Arch Linux installation on a rolling weekly basis so am right > upto date for all packages including intel-microcode. As others have > experienced, the problem seems more prominent (though not exclusively) when > the machine is Idle. > > >>Maybe lowering "check_interval" or "monarch_timeout" in machinecheck will > cause the bug to strike more often, so a git bisect could be possible!? Or > raising those values may workaround the problem!? > > I had similar thoughts and stumbled upon > > /sys/kernel/debug/mce/fake_panic > > Writing 1 to here will cause a fake panic such that the mce event will be > logged to dmesg but panic+reboot will not occur. > > Interestingly we then get a couple more messages that possibly suggest that > the core lockup is somehow related to i915 as others suspect > > [77775.848032] mce: CPUs not responding to MCE broadcast (may include false > positives): 1,3 > [77775.848032] mce: CPUs not responding to MCE broadcast (may include false > positives): 1,3 > [77775.848035] mce: [Hardware Error]: Fake kernel panic: Timeout: Not all > CPUs entered broadcast exception handler > [77775.848039] Disabling lock debugging due to kernel taint > [77775.885355] mce: [Hardware Error]: Machine check events logged > [77775.888283] mce: [Hardware Error]: CPU 2: Machine Check Exception: 5 > Bank 4: ba00000011000402 > [77775.892145] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffffc071678d> > {fwtable_read32+0x7d/0x220 [i915]} > [77775.897167] mce: [Hardware Error]: TSC d44e32bae41d > Might be interesting to see if the > RIP !INEXACT! 10:<ffffffffc071678d> {fwtable_read32+0x7d/0x220 [i915]} > message occurs for others with fake_panic enabled. > > Unfortunately, fake_panic does not appear to be a workaround from my > experience; since the cores reported in the mce event become locked up > thereafter; such that any task scheduled onto those cores becomes locked-up > - for example I ran the sensors command which hung and eventually..... > > 77798.629123] watchdog: BUG: soft lockup - CPU#2 stuck for 21s! > [sensors:1229265] > [77798.631037] Modules linked in: coretemp drivetemp netconsole > xt_conntrack ipt_REJECT nf_reject_ipv4 xt_connmark xt_mark iptable_mangle > xt_comment xt_addrtype iptable_raw wireguard curve25519_x86_64 > libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic > libchacha ip6_udp_tunnel udp_tunnel rfcomm uinput xt_nat xt_tcpudp > iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 > libcrc32c iptable_filter veth ts2020 snd_sof_pci_intel_cnl > snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation > soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof > snd_sof_utils soundwire_bus snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core > intel_rapl_msr intel_rapl_common snd_soc_sst_ipc intel_tcc_cooling