Hi, On 1/25/2024 1:19 PM, Luck, Tony wrote: >>> The first patch adds PPIN (Protected Processor Inventory Number) field to >>> the tracepoint. >>> >>> The second patch adds the microcode field (Microcode Revision) to the >>> tracepoint. >> >> This is a lot of static information to add to *every* MCE. > > 8 bytes for PPIN, 4 more for microcode. > > Number of recoverable machine checks per system .... I hope the monthly rate > should > be countable on my fingers. If a system is getting more than that, then > people should > be looking at fixing the underlying problem. > > Corrected errors are much more common. Though Linux takes action to limit the > rate when storms occur. So maybe hundreds or small numbers of thousands of > error trace records? Increase in trace buffer consumption still measured in > Kbytes > not Mbytes. Server systems that do machine check reporting now start at tens > of > GBytes memory. > >> And where does it end? Stick full dmesg in the tracepoint too? > > Seems like overkill. > >> What is the real-life use case here? > > Systems using rasdaemon to track errors will be able to track both of these > (I assume that Naik has plans to update rasdaemon to capture and save these > new fields). > Yes, I do intend to submit a pull request to the rasdaemon to parse and log these new fields.
> PPIN is useful when talking to the CPU vendor about patterns of similar errors > seen across a cluster. > > MICROCODE - gives a fast path to root cause problems that have already > been fixed in a microcode update. > > -Tony -- Thanks, Avadhut Naik