Re: [PATCH] ACPI / APEI: do memory failure on the physical address reported by ARM processor error section

Xiaofei Tan Mon, 03 Aug 2020 02:15:03 -0700

Hi James，

On 2020/7/31 21:48, James Morse wrote:
> Hi Tan,
> 
> On 30/07/2020 08:32, Xiaofei Tan wrote:
>> After the following commit applied, user-mode SEA is preferentially
>> processed by APEI. Do memory failure to recover.
>>
>> But there are some problems:
>> 1) The function apei_claim_sea() has processed an CPER, does not
>> mean that memory failure handling has done. Because the firmware-first
>> RAS error is reported by both producer and consumer. Mostly SEA uses
>> ARM processor error section to report as a consumer. (The producer could
>> be DDRC and cache, and use memory error section and other error section
>> to report). But memory failure handling for ARM processor error section
>> has not been supported. We should add it.
> 
> I can't follow what you are saying here.
> 
> APEI doesn't parse the Processor Error records. This has always been true, 
> its not a
> regression introduced by that commit.
>


The APEI parsing error didn't affect the SEA processing flow before. After that 
commit, it is changed.

> 
>> 2) Some hardware platforms can't record physical address each time. But
>> they could always have reported a firmware-first RAS error using ARM
>> processor error section. Such platform should update firmware. Don't
>> report the RAS error when physical address is not recorded.
> 
> Eh? If firmware fails to describe the error, we should carry on and pretend 
> nothing happened?
> 

I mean firmware don't report RAS error in SEA process if physical address is 
not recorded.
The producer RAS node can still report the error.


> I think if the APEI code gets CPER records that have the fields linux needs 
> to handle the
> error, (for memory: that's the physical address), it should return an error 
> to the caller,
> as the work hasn't been done.
> 
> In the case of arm64's synchronous external abort, the response should be the
> apei_claim_sea() code not claiming the abort, as there is a problem with the 
> records.
> Certainly the current behaviour can be improved.
> 

Agree.


> 
>> Fixes: 8fcc4ae6faf8 ("arm64: acpi: Make apei_claim_sea() synchronise with 
>> APEI's irq work")
> 
> I don't see how parsing this extra record fixes a bug in this commit.
> Presumably you were depending on the arch code killing the thread regardless 
> of whether
> APEI found work to do ... which masked the fact it finds work, but doesn't 
> know what to do
> with it.
> 

Hmm,it's a little far-fetched. But i don't know how to describe the 
relationship with that commit.
Any idea?

> 
> I'm assuming your platform describes errors it detects in the cache as 
> processor errors
> for the cache, instead of memory errors.
> 

Yes.

> 
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 81bf71b..07bfa28 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -466,6 +466,44 @@ static bool ghes_handle_memory_failure(struct 
>> acpi_hest_generic_data *gdata,
>>      return false;
>>  }
>>  
>> +static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, 
>> int sev)
>> +{
>> +    struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
>> +    struct cper_arm_err_info *err_info;
>> +    bool queued = false;
>> +    int sec_sev, i;
>> +
>> +    log_arm_hw_error(err);
>> +
>> +    if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))
>> +            return false;
> 
>> +    sec_sev = ghes_severity(gdata->error_severity);
>> +    if (sev != GHES_SEV_RECOVERABLE || sec_sev != GHES_SEV_RECOVERABLE)
>> +            return false;
> 
> This is to filter out corrected errors? 

Yes.

What if this section is fatal?

Fatal errors won't come here.

> The panic on fatal code only looks as the severity in the Generic Error 
> Status Block.
> 

Yes.

> I think the right thing to do is to explicitly test each "Cache error 
> structure"'s bits
> for corrected/uncorrected instead.
> 

Do you mean skip TLB/Bus/Micro-architectural Error？

> These top-level severities describe a group of records. You may have a 
> corrected error
> event that still has latent faults left in the system.
> 

Yes

> 
> This thing has multiple variable length entries in it.
> Could we sanity test that 'err->err_info_num' doesn't take us outside 
> err->section_length?
> (we already do this sort of thing in the probe code)
> 

I think firmware should ensure the data is valid.

> 
>> +    err_info = (struct cper_arm_err_info *) (err + 1);
>> +    for (i = 0; i < err->err_info_num; i++, err_info++) {
>> +            unsigned long pfn;
> 
> Please check the type of this error, and only invoke memory_failure_queue() 
> for caches.
> (does your firmware generate the other types too?)
> 

Our firmware only generate two types: cache error and TLB error.
The type of TLB error is only for MMU, and it can't record physical address, 
but only VA.

> 
> For a bus error, why are we complaining that this isn't memory?

There are two types of errors from the memory: bus error and poison error.
CPU core RAS nodes can't record bus errors.
It can record poison errors in some scenarios, but will be taken as cache 
errors with a flag "PN".

> If this were a TLB error, what does the physical address mean? Is it part of 
> the page
> tables or the final output address? (Who knows what the physical address 
> means for a
> micro-architectural error!)

You are right, we can't record a physical address for error types other than 
cache error.

> 
> I think these other types should print a ratelimited warning that this type 
> isn't
> understood. We shouldn't pretend they are memory and hope for the best.
> 

OK. I will add a ratelimited warning for other types here.

> 
> Please check the corrected or uncorrected bit in the type-specific u64 for 
> caches.
> 

OK

> 
>> +            if (!(err_info->validation_bits & 
>> CPER_ARM_INFO_VALID_PHYSICAL_ADDR))
>> +                    continue;
> 
> 
>> +            pfn = PHYS_PFN(err_info->physical_fault_addr);
>> +            if (!pfn_valid(pfn)) {
> 
>> +                    pr_warn(FW_WARN GHES_PFX
> 
> ratelimit!
> 

OK

>> +                            "Invalid address in generic error data: 
>> 0x%#llx\n",
>> +                            err_info->physical_fault_addr);
>> +                    continue;
>> +            }
>> +
>> +            memory_failure_queue(pfn, 0);
>> +            queued = true;
> 
> This bit is almost the same as part of ghes_handle_memory_failure(), please 
> pull that out
> to a common helper. I think you'll need to pass the flags for 
> memory_failure_queue() in.
> 
> 

OK.

> 
> Thanks,
> 
> James
> 
>> +    }
>> +
>> +    return queued;
>> +}
>> +
>>  /*
>>   * PCIe AER errors need to be sent to the AER driver for reporting and
>>   * recovery. The GHES severities map to the following AER severities and
>> @@ -543,9 +581,7 @@ static bool ghes_do_proc(struct ghes *ghes,
>>                      ghes_handle_aer(gdata);
>>              }
>>              else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
>> -                    struct cper_sec_proc_arm *err = 
>> acpi_hest_get_payload(gdata);
>> -
>> -                    log_arm_hw_error(err);
>> +                    queued = ghes_handle_arm_hw_error(gdata, sev);
>>              } else {
>>                      void *err = acpi_hest_get_payload(gdata);
>>  
>>
> 
> 
> .
> 

-- 
 thanks
tanxiaofei

Re: [PATCH] ACPI / APEI: do memory failure on the physical address reported by ARM processor error section

Reply via email to