On Tue, 11 Nov 2025 20:55:17 +1000 Gavin Shan <[email protected]> wrote:
> Hi Jonathan, > > On 11/11/25 8:07 PM, Jonathan Cameron wrote: > > On Tue, 11 Nov 2025 14:08:13 +1000 > > Gavin Shan <[email protected]> wrote: > >> On 11/11/25 12:49 AM, Igor Mammedov wrote: > >>> On Thu, 6 Nov 2025 13:15:52 +1000 > >>> Gavin Shan <[email protected]> wrote: > >>>> On 11/6/25 12:14 AM, Jonathan Cameron wrote: > >>>>> On Wed, 5 Nov 2025 21:44:49 +1000 > >>>>> Gavin Shan <[email protected]> wrote: > >>>>> > >>>>>> In the situation where host and guest has 64KiB and 4KiB page sizes, > >>>>>> one problematic host page affects 16 guest pages. we need to send 16 > >>>>>> consective errors in this specific case. > >>>>>> > >>>>>> Extend acpi_ghes_memory_errors() to support multiple CPERs after the > >>>>>> hunk of code to generate the GHES error status is pulled out from > >>>>>> ghes_gen_err_data_uncorrectable_recoverable(). The status field of > >>>>>> generic error status block is also updated accordingly if multiple > >>>>>> error data entries are contained in the generic error status block. > >>>>>> > >>>>>> Signed-off-by: Gavin Shan <[email protected]> > >>>>> Hi Gavin, > >>>>> > >>>>> Mostly fine, but a few comments on the defines added and a > >>>>> question on what the multiple things are meant to mean? > >>>>> > >>>> > >>>> Thanks for your review and comments, replies as below. > >>>> > >>>>>> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c > >>>>>> index a9c08e73c0..527b85c8d8 100644 > >>>>>> --- a/hw/acpi/ghes.c > >>>>>> +++ b/hw/acpi/ghes.c > >>>>>> @@ -57,8 +57,12 @@ > >>>>>> /* The memory section CPER size, UEFI 2.6: N.2.5 Memory Error > >>>>>> Section */ > >>>>>> #define ACPI_GHES_MEM_CPER_LENGTH 80 > >>>>>> > >>>>>> -/* Masks for block_status flags */ > >>>>>> -#define ACPI_GEBS_UNCORRECTABLE 1 > >>>>>> +/* Bits for block_status flags */ > >>>>>> +#define ACPI_GEBS_UNCORRECTABLE 0 > >>>>>> +#define ACPI_GEBS_CORRECTABLE 1 > >>>>>> +#define ACPI_GEBS_MULTIPLE_UNCORRECTABLE 2 > >>>>>> +#define ACPI_GEBS_MULTIPLE_CORRECTABLE 3 > >>>>> > >>>>> So this maps to the bits in block status. > >>>>> > >>>>> I'm not actually sure what these multiple variants are meant to tell us. > >>>>> The multiple error blocks example referred to by the spec is a way to > >>>>> represent > >>>>> the same error applying to multiple places. So that's one error, many > >>>>> blocks. > >>>>> I have no idea if we set these bits in that case. > >>>>> > >>>>> Based on a quick look I don't think linux even takes any notice. THere > >>>>> are defines in actbl1.h but I'm not seeing any use made of them. > >>>>> > >>>> > >>>> I hope Igor can confirm since it was suggested by him. > >>>> > >>>> It's hard to understand how exactly these multiple variants are used > >>>> from the > >>>> spec. In ACPI 6.5 Table 18.11, it's explained as below. > >>>> > >>>> Bit [2] - Multiple Uncorrectable Errors: If set to one, indicates that > >>>> more > >>>> than one uncorrectable errors have been detected. > >>>> > >>>> I don't see those multiple variants have been used by Linux. So I think > >>>> it's > >>>> safe to drop them. > >>> > >>> even though example describes 'same' error at different components, > >>> the bit fields descriptions doesn't set any limits on what 'more than > >>> one' means. > >>> > >>> Also from guest POV it's multiple different pages that we are reporting > >>> here > >>> as multiple CPERs. > >>> It seems to me that setting *_MULTIPLE_* here is correct thing to do. > >>> > >> > >> I don't have strong opinions. Lets keep to set _MULTIPLE_ flag if Jonathan > >> is fine. Again, this field isn't used by Linux guest. > > I don't care strongly. Maybe we should ask for a spec clarification as I > > doubt > > implementations will be consistent on this given the vague description and > > that > > Linux ignores it today. > > > > Google Gemini has the following question. If it can be trusted, it should be > set when @num_of_addresses is larger than 1. > > Quota from Google Gemini: > > The system firmware sets this bit to indicate to the Operating System Power > Management (OSPM) > that more than one correctable error condition has been detected and logged > for the associated > hardware component since the last time the status was cleared by the > software. This is crucial > because a high frequency of correctable errors often indicates a potential > underlying hardware > issue that could lead to uncorrectable (and potentially fatal) errors if not > addressed (e.g., > in memory, where multiple correctable errors might trigger a spare memory > operation). > > >> > >>>>>> +#define ACPI_GEBS_ERROR_DATA_ENTRIES 4 > >>>>> > >>>>> This is bits 4-13 and the define isn't used. I'd drop it. > >>>>> > >>>> > >>>> The definition is used in acpi_ghes_memory_errors() of this patch. > >>>> However, > >>>> I don't see it has been used by Linux. This field isn't used by Linux to > >>>> determine > >>>> the total number of error entries. So I think I can drop it either if > >>>> Igor is ok. > >>>> > >> > >> Lets keep this field either in next revision if Jonathan is fine. > > > > I'm fine with the field, but not the value. As far as I can tell form the > > spec, it should > > be a mask, not a single bit. > > > > Agreed, lets keep ACPI_HEST_ERROR_ENTRY_COUNT as zero in next revision. I'm even more confused now. The GEBS Error Data entry count should be field from 13:4 and the value taken should be the number of entries in the record, so 1, 4, 16 depending on the page size. So that define of the value 4 is garbage. If it were DATA_ENTRIES_SHIFT then I'd be much happier. > > Thanks, > Gavin > >
