On Thu, 09 Aug 2018 00:56:00 +1000 Michael Ellerman <m...@ellerman.id.au> wrote:
> Mahesh J Salgaonkar <mah...@linux.vnet.ibm.com> writes: > > From: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com> > > > > Introduce recovery action for recovered memory errors (MCEs). There are > > soft memory errors like SLB Multihit, which can be a result of a bad > > hardware OR software BUG. Kernel can easily recover from these soft errors > > by flushing SLB contents. After the recovery kernel can still continue to > > function without any issue. But in some scenario's we may keep getting > > these soft errors until the root cause is fixed. To be able to analyze and > > find the root cause, best way is to gather enough data and system state at > > the time of MCE. Hence this patch introduces a sysctl knob where user can > > decide either to continue after recovery or panic the kernel to capture the > > dump. > > I'm not convinced we want this. > > As we've discovered it's often not possible to reconstruct what happened > based on a dump anyway. > > The key thing you need is the content of the SLB and that's not included > in a dump. > > So I think we should dump the SLB content when we get the MCE (which > this series does) and any other useful info, and then if we can recover > we should. Yeah it's a lot of knobs that administrators can hardly be expected to tune. Hypervisor or firmware should really eventually make the MCE unrecoverable if we aren't making progress. That said, x86 has a bunch of options, and for debugging a rare crash or specialised installations it might be useful. But we should follow the normal format, /proc/sys/kernel/panic_on_mce. Thanks, Nick