On Tue, 11 Nov 2025 22:19:18 +1000 Gavin Shan <[email protected]> wrote:
> Hi Jonathan, > > On 11/11/25 9:55 PM, Jonathan Cameron wrote: > > On Tue, 11 Nov 2025 20:55:17 +1000 > > Gavin Shan <[email protected]> wrote: > >> On 11/11/25 8:07 PM, Jonathan Cameron wrote: > >>> On Tue, 11 Nov 2025 14:08:13 +1000 > >>> Gavin Shan <[email protected]> wrote: > >>>> On 11/11/25 12:49 AM, Igor Mammedov wrote: > >>>>> On Thu, 6 Nov 2025 13:15:52 +1000 > >>>>> Gavin Shan <[email protected]> wrote: > >>>>>> On 11/6/25 12:14 AM, Jonathan Cameron wrote: > >>>>>>> On Wed, 5 Nov 2025 21:44:49 +1000 > >>>>>>> Gavin Shan <[email protected]> wrote: > >>>>>>> > >>>>>>>> In the situation where host and guest has 64KiB and 4KiB page sizes, > >>>>>>>> one problematic host page affects 16 guest pages. we need to send 16 > >>>>>>>> consective errors in this specific case. > >>>>>>>> > >>>>>>>> Extend acpi_ghes_memory_errors() to support multiple CPERs after the > >>>>>>>> hunk of code to generate the GHES error status is pulled out from > >>>>>>>> ghes_gen_err_data_uncorrectable_recoverable(). The status field of > >>>>>>>> generic error status block is also updated accordingly if multiple > >>>>>>>> error data entries are contained in the generic error status block. > >>>>>>>> > >>>>>>>> Signed-off-by: Gavin Shan <[email protected]> > >>>>>>> Hi Gavin, > >>>>>>> > >>>>>>> Mostly fine, but a few comments on the defines added and a > >>>>>>> question on what the multiple things are meant to mean? > >>>>>>> > >>>>>> > >>>>>> Thanks for your review and comments, replies as below. > >>>>>> > >>>>>>>> diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c > >>>>>>>> index a9c08e73c0..527b85c8d8 100644 > >>>>>>>> --- a/hw/acpi/ghes.c > >>>>>>>> +++ b/hw/acpi/ghes.c > >>>>>>>> @@ -57,8 +57,12 @@ > >>>>>>>> /* The memory section CPER size, UEFI 2.6: N.2.5 Memory Error > >>>>>>>> Section */ > >>>>>>>> #define ACPI_GHES_MEM_CPER_LENGTH 80 > >>>>>>>> > >>>>>>>> -/* Masks for block_status flags */ > >>>>>>>> -#define ACPI_GEBS_UNCORRECTABLE 1 > >>>>>>>> +/* Bits for block_status flags */ > >>>>>>>> +#define ACPI_GEBS_UNCORRECTABLE 0 > >>>>>>>> +#define ACPI_GEBS_CORRECTABLE 1 > >>>>>>>> +#define ACPI_GEBS_MULTIPLE_UNCORRECTABLE 2 > >>>>>>>> +#define ACPI_GEBS_MULTIPLE_CORRECTABLE 3 > >>>>>>> > >>>>>>> So this maps to the bits in block status. > >>>>>>> > >>>>>>> I'm not actually sure what these multiple variants are meant to tell > >>>>>>> us. > >>>>>>> The multiple error blocks example referred to by the spec is a way to > >>>>>>> represent > >>>>>>> the same error applying to multiple places. So that's one error, > >>>>>>> many blocks. > >>>>>>> I have no idea if we set these bits in that case. > >>>>>>> > >>>>>>> Based on a quick look I don't think linux even takes any notice. > >>>>>>> THere > >>>>>>> are defines in actbl1.h but I'm not seeing any use made of them. > >>>>>>> > >>>>>> > >>>>>> I hope Igor can confirm since it was suggested by him. > >>>>>> > >>>>>> It's hard to understand how exactly these multiple variants are used > >>>>>> from the > >>>>>> spec. In ACPI 6.5 Table 18.11, it's explained as below. > >>>>>> > >>>>>> Bit [2] - Multiple Uncorrectable Errors: If set to one, indicates that > >>>>>> more > >>>>>> than one uncorrectable errors have been detected. > >>>>>> > >>>>>> I don't see those multiple variants have been used by Linux. So I > >>>>>> think it's > >>>>>> safe to drop them. > >>>>> > >>>>> even though example describes 'same' error at different components, > >>>>> the bit fields descriptions doesn't set any limits on what 'more than > >>>>> one' means. > >>>>> > >>>>> Also from guest POV it's multiple different pages that we are reporting > >>>>> here > >>>>> as multiple CPERs. > >>>>> It seems to me that setting *_MULTIPLE_* here is correct thing to do. > >>>>> > >>>> > >>>> I don't have strong opinions. Lets keep to set _MULTIPLE_ flag if > >>>> Jonathan > >>>> is fine. Again, this field isn't used by Linux guest. > >>> I don't care strongly. Maybe we should ask for a spec clarification as I > >>> doubt > >>> implementations will be consistent on this given the vague description > >>> and that > >>> Linux ignores it today. > >>> > >> > >> Google Gemini has the following question. If it can be trusted, it should > >> be > >> set when @num_of_addresses is larger than 1. > >> > >> Quota from Google Gemini: > >> > >> The system firmware sets this bit to indicate to the Operating System > >> Power Management (OSPM) > >> that more than one correctable error condition has been detected and > >> logged for the associated > >> hardware component since the last time the status was cleared by the > >> software. This is crucial > >> because a high frequency of correctable errors often indicates a potential > >> underlying hardware > >> issue that could lead to uncorrectable (and potentially fatal) errors if > >> not addressed (e.g., > >> in memory, where multiple correctable errors might trigger a spare memory > >> operation). > >> > >>>> > >>>>>>>> +#define ACPI_GEBS_ERROR_DATA_ENTRIES 4 > >>>>>>> > >>>>>>> This is bits 4-13 and the define isn't used. I'd drop it. > >>>>>>> > >>>>>> > >>>>>> The definition is used in acpi_ghes_memory_errors() of this patch. > >>>>>> However, > >>>>>> I don't see it has been used by Linux. This field isn't used by Linux > >>>>>> to determine > >>>>>> the total number of error entries. So I think I can drop it either if > >>>>>> Igor is ok. > >>>>>> > >>>> > >>>> Lets keep this field either in next revision if Jonathan is fine. > >>> > >>> I'm fine with the field, but not the value. As far as I can tell form > >>> the spec, it should > >>> be a mask, not a single bit. > >>> > >> > >> Agreed, lets keep ACPI_HEST_ERROR_ENTRY_COUNT as zero in next revision. > > > > I'm even more confused now. The GEBS Error Data entry count should be > > field from 13:4 > > and the value taken should be the number of entries in the record, so 1, 4, > > 16 depending > > on the page size. > > > > So that define of the value 4 is garbage. If it were DATA_ENTRIES_SHIFT > > then I'd be much happier. > > > > My bad. I misunderstood your point. It will be fixed by using APIs from > "hw/registerfields.h" as suggested by Philippe in another reply. > > ... > FIELD(ACPI_GEBS, MULTIPLE_CORRECTABLE, 3, 1) > FIELD(ACPI_GEBS, ERROR_DATA_ENTRIES, 4, 10) > > then use FIELD_DP32() to only set the correct bits. > Perfect. Thanks! J > Thanks, > Gavin > >
