subject:"\[PATCH 1\/1\] RAS\: Add CPU Correctable Error Collector to isolate an erroneous CPU core"

RE: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-10-02 Thread Shiju Jose

Hi Boris, Hi James,

>-Original Message-
>From: Borislav Petkov [mailto:b...@alien8.de]
>Sent: 01 October 2020 18:31
>To: James Morse 
>Cc: Shiju Jose ; linux-e...@vger.kernel.org; linux-
>a...@vger.kernel.org; linux-kernel@vger.kernel.org; tony.l...@intel.com;
>r...@rjwysocki.net; l...@kernel.org; Linuxarm 
>Subject: Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate
>an erroneous CPU core
>
>On Thu, Oct 01, 2020 at 06:16:03PM +0100, James Morse wrote:
>> If the corrected-count is available somewhere, can't this policy be
>> made in user-space?
>
>You mean rasdaemon goes and offlines CPUs when certain thresholds are
>reached? Sure. It would be much more flexible too.

I will send the kernel changes for existing CEC to support the CPU CE errors. 
Can you please have a look?

Thanks,
Shiju

>
>--
>Regards/Gruss,
>Boris.
>
>https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-10-01 Thread Borislav Petkov

On Thu, Oct 01, 2020 at 06:16:03PM +0100, James Morse wrote:
> If the corrected-count is available somewhere, can't this policy be
> made in user-space?

You mean rasdaemon goes and offlines CPUs when certain thresholds are
reached? Sure. It would be much more flexible too.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-10-01 Thread James Morse

Hi guys,

On 17/09/2020 09:40, Borislav Petkov wrote:
> On Thu, Sep 10, 2020 at 03:29:56PM +, Shiju Jose wrote:

> You can't know what exactly you wanna do if you don't have a use case
> you're trying to address.
> 
>> According to the ARM Processor CPER definition the error types
>> reported are Cache Error, TLB Error, Bus Error and micro-architectural
>> Error.
> 
> Bus error sounds like not even originating in the CPU but the CPU only
> reporting it. Imagine if that really were the case, and you go disable
> the CPU but the error source is still there. You've just disabled the
> reporting of the error only and now you don't even know anymore that
> you're getting errors.
> 
>> Few thoughts on this,
>> 1. Not sure will a CPU core would work/perform as normal after disabling
>> a functional unit?
> 
> You can disable parts of caches, etc, so that you can have a somewhat
> functioning CPU until the replacement maintenance can take place.

This is implementation-specific stuff that only firmware can do...


>> 2. Support in the HW to disable a function unit alone may not available.
> 
> Yes.
> 
>> 3. If it is require to store and retrieve the error count based on
>> functional unit, then CEC will become more complex?
> 
> Depends on how it is designed. That's why we're first talking about what
> needs to be done exactly before going off and doing something.
> 
>> This requirement is the part of the early fault prediction by taking
>> action when large number of corrected errors reported on a CPU core
>> before it causing serious faults.
> 
> And do you know of actual real-life examples where this is really the
> case? Do you have any users who report a large error count on ARM CPUs,
> originating from the caches and that something like that would really
> help?
> 
> Because from my x86 CPUs limited experience, the cache arrays are mostly
> fine and errors reported there are not something that happens very
> frequently so we don't even need to collect and count those.
> 
> So is this something which you need to have in order to check a box
> somewhere that there is some functionality or is there an actual
> real-life use case behind it which a customer has requested?

If the corrected-count is available somewhere, can't this policy be made in 
user-space?


Thanks,

James

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-09-17 Thread Borislav Petkov

On Thu, Sep 10, 2020 at 03:29:56PM +, Shiju Jose wrote:
> Ok. However the functions such as __find_elem() use
> memory specific PFN() and PAGE_SHIFT.

You can add your version find_elem_cpu() or so. You can do this with a
set of function pointers which belong to the different type of storage
the CEC needs, you can do all kinds of fun.

> I will check this. For CPU, the corrected errors count for a short
> time period to be checked. Thus old errors outside this period would
> not be considered and would be cleared. It is not clear to me whether
> in the current CEC, the count for the old errors outside a time period
> would be excluded for the threshold check or removed?

Currently, the CEC decays the errors each time do_spring_cleaning()
runs, by decrementing DECAY_BITS in the PFN record. Those which get
DECAY_BITS of 0, get overwritten when the data structure is full.

You can do something similar by halving the error count or something
more complex like save the error timestamp and eliminate...

You can't know what exactly you wanna do if you don't have a use case
you're trying to address.

> According to the ARM Processor CPER definition the error types
> reported are Cache Error, TLB Error, Bus Error and micro-architectural
> Error.

Bus error sounds like not even originating in the CPU but the CPU only
reporting it. Imagine if that really were the case, and you go disable
the CPU but the error source is still there. You've just disabled the
reporting of the error only and now you don't even know anymore that
you're getting errors.

> Few thoughts on this,
> 1. Not sure will a CPU core would work/perform as normal after disabling
> a functional unit?

You can disable parts of caches, etc, so that you can have a somewhat
functioning CPU until the replacement maintenance can take place.

> 2. Support in the HW to disable a function unit alone may not available.

Yes.

> 3. If it is require to store and retrieve the error count based on
> functional unit, then CEC will become more complex?

Depends on how it is designed. That's why we're first talking about what
needs to be done exactly before going off and doing something.

> This requirement is the part of the early fault prediction by taking
> action when large number of corrected errors reported on a CPU core
> before it causing serious faults.

And do you know of actual real-life examples where this is really the
case? Do you have any users who report a large error count on ARM CPUs,
originating from the caches and that something like that would really
help?

Because from my x86 CPUs limited experience, the cache arrays are mostly
fine and errors reported there are not something that happens very
frequently so we don't even need to collect and count those.

So is this something which you need to have in order to check a box
somewhere that there is some functionality or is there an actual
real-life use case behind it which a customer has requested?

> We are mainly looking for disable CPU core on large number of L1/L2
> cache corrected errors reported on a CPU core. Can we add atleast
> removing CPU core for the CPU cache corrected errors filtering out
> other error types?

See above.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

RE: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-09-10 Thread Shiju Jose

Hello Boris,

>-Original Message-
>From: Borislav Petkov [mailto:b...@alien8.de]
>Sent: 09 September 2020 13:02
>To: Shiju Jose 
>Cc: linux-e...@vger.kernel.org; linux-a...@vger.kernel.org; linux-
>ker...@vger.kernel.org; tony.l...@intel.com; r...@rjwysocki.net;
>james.mo...@arm.com; l...@kernel.org; Linuxarm
>
>Subject: Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate
>an erroneous CPU core
>
>On Tue, Sep 01, 2020 at 04:20:54PM +, Shiju Jose wrote:
>> CPU CEC derived the infrastructure of the CEC only and the logic used
>> in the CEC for CE count storage, CE count calculation and page
>> isolation is very unique for the memory pages, which seems cannot be
>> reusable for the CPU CEs.
>
>Oh, because it saves the reported error's PFN and you want to save
>
>[CPU num | error count]
>
>?
Yes. 

>
>Well, you can easily change that by extending the existing CEC to have a
>different storage format for CPU errors, i.e., use a different ce_array which
>gets passed to the functions anyway.
Ok. However the functions such as __find_elem() use
memory specific PFN() and PAGE_SHIFT.

>
>> Also the values set for the parameters such as threshold, time period
>> for the memory errors and CPU errors would be different.
>
>And your implementation with sliding windows is so totally different that it
>warrants the duplication of the code? I don't think so.
>
>You can use the current CEC to do exactly what you wanna do, with the
>decaying and so on.
I will check this.
For CPU, the corrected errors count for a short time period to be checked.
Thus old errors outside this period would not be considered and would be 
cleared. 
It is not clear to me whether in the current CEC, the count for the old errors 
outside
a time period would be excluded for the threshold check or removed?

>
>Because all you wanna do is count the errors a CPU triggered.
>
>However, a CPU can trigger a *lot* of different types of errors.
>You're putting them all in the same basket by doing:
>
>else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM))
>   /* add to CEC */
>
>and only for correctable.
>
>What type of errors get reported in CPER_SEC_PROC_ARM?
According to the ARM Processor CPER definition the error types reported are
Cache Error, TLB Error, Bus Error and micro-architectural Error.

>
>If they're all lumped together and if some functional unit generates a lot of
>errors, instead of disabling that unit only, you'll go and remove the whole
>CPU?
>
Few thoughts on this,
1. Not sure will a CPU core would work/perform as normal after disabling
a functional unit?
2. Support in the HW to disable a function unit alone may not available.
3. If it is require to store and retrieve the error count based on functional 
unit,
then CEC will become more complex?

>Doesn't make a whole lot of sense to me.
>
>How about you define what exactly you're trying to solve, maybe give an
>example of a real issue someone is encountering and you're trying to
>address? Because there was never a necessity so far to disable CPUs on
>x86 due to correctable errors. Why is that needed on ARM?
>
This requirement is the part of the early fault prediction by taking action
when large number of corrected errors reported on a CPU core
before it causing serious faults. 
We are mainly looking for disable CPU core on large number of L1/L2 cache
corrected errors  reported on a CPU core. Can we add atleast removing CPU core
for the CPU cache corrected errors filtering out other error types?
 
[...]

Thanks,
Shiju

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-09-09 Thread Borislav Petkov

On Tue, Sep 01, 2020 at 04:20:54PM +, Shiju Jose wrote:
> CPU CEC derived the infrastructure of the CEC only and the logic
> used in the CEC for CE count storage, CE count calculation and page
> isolation is very unique for the memory pages, which seems cannot be
> reusable for the CPU CEs.

Oh, because it saves the reported error's PFN and you want to save

[CPU num | error count]

?

Well, you can easily change that by extending the existing CEC to have a
different storage format for CPU errors, i.e., use a different ce_array
which gets passed to the functions anyway.

> Also the values set for the parameters such as threshold, time period
> for the memory errors and CPU errors would be different.

And your implementation with sliding windows is so totally different
that it warrants the duplication of the code? I don't think so.

You can use the current CEC to do exactly what you wanna do, with the
decaying and so on.

Because all you wanna do is count the errors a CPU triggered.

However, a CPU can trigger a *lot* of different types of errors.
You're putting them all in the same basket by doing:

else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM))
/* add to CEC */

and only for correctable.

What type of errors get reported in CPER_SEC_PROC_ARM?

If they're all lumped together and if some functional unit generates a
lot of errors, instead of disabling that unit only, you'll go and remove
the whole CPU?

Doesn't make a whole lot of sense to me.

How about you define what exactly you're trying to solve, maybe give an
example of a real issue someone is encountering and you're trying to
address? Because there was never a necessity so far to disable CPUs on
x86 due to correctable errors. Why is that needed on ARM?

> Thus extending cec.c to support CPU CEs would include adding CPU CEC
> specific code for storing error count, isolation etc which I thought
> would result the code less tidy and less readable unless find more
> reusable logic.

Depends on how you design it.

But with what I'm seeing so far, I'm still sceptical this is needed at
all.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-09-01 Thread kernel test robot

Hi Shiju,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on pm/linux-next]
[also build test ERROR on arm64/for-next/core linux/master linus/master 
v5.9-rc3 next-20200828]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Shiju-Jose/RAS-Add-CPU-Correctable-Error-Collector-to-isolate-an-erroneous-CPU-core/20200901-222704
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git 
linux-next
config: x86_64-randconfig-a013-20200901 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 
c10e63677f5d20f18010f8f68c631ddc97546f7d)
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# install x86_64 cross compiling tool for clang build
# apt-get install binutils-x86-64-linux-gnu
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

>> drivers/acpi/apei/ghes.c:527:8: error: implicit declaration of function 
>> 'get_logical_index' [-Werror,-Wimplicit-function-declaration]
   cpu = get_logical_index(err->mpidr);
 ^
   1 error generated.

# 
https://github.com/0day-ci/linux/commit/5d1b166196baa45a5e541b6c2524e28fddd8
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Shiju-Jose/RAS-Add-CPU-Correctable-Error-Collector-to-isolate-an-erroneous-CPU-core/20200901-222704
git checkout 5d1b166196baa45a5e541b6c2524e28fddd8
vim +/get_logical_index +527 drivers/acpi/apei/ghes.c

   513  
   514  static void ghes_handle_arm_hw_error(struct acpi_hest_generic_data 
*gdata)
   515  {
   516  struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
   517  struct cper_arm_err_info *err_info;
   518  int sec_sev;
   519  int cpu, i, ret;
   520  
   521  log_arm_hw_error(err);
   522  
   523  sec_sev = ghes_severity(gdata->error_severity);
   524  if (sec_sev != GHES_SEV_CORRECTED)
   525  return;
   526  
 > 527  cpu = get_logical_index(err->mpidr);
   528  if (cpu == -EINVAL)
   529  return;
   530  
   531  err_info = (struct cper_arm_err_info *)(err + 1);
   532  for (i = 0; i < err->err_info_num; i++) {
   533  ret = cpu_cec_add_ce(cpu, err_info->multiple_error + 1);
   534  if (ret)
   535  break;
   536  err_info += 1;
   537  }
   538  }
   539  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-09-01 Thread kernel test robot

Hi Shiju,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on pm/linux-next]
[also build test ERROR on arm64/for-next/core linux/master linus/master 
v5.9-rc3 next-20200828]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Shiju-Jose/RAS-Add-CPU-Correctable-Error-Collector-to-isolate-an-erroneous-CPU-core/20200901-222704
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git 
linux-next
config: i386-allyesconfig (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce (this is a W=1 build):
# save the attached .config to linux build tree
make W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

   drivers/acpi/apei/ghes.c: In function 'ghes_handle_arm_hw_error':
>> drivers/acpi/apei/ghes.c:527:8: error: implicit declaration of function 
>> 'get_logical_index' [-Werror=implicit-function-declaration]
 527 |  cpu = get_logical_index(err->mpidr);
 |^
   cc1: some warnings being treated as errors

# 
https://github.com/0day-ci/linux/commit/5d1b166196baa45a5e541b6c2524e28fddd8
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Shiju-Jose/RAS-Add-CPU-Correctable-Error-Collector-to-isolate-an-erroneous-CPU-core/20200901-222704
git checkout 5d1b166196baa45a5e541b6c2524e28fddd8
vim +/get_logical_index +527 drivers/acpi/apei/ghes.c

   513  
   514  static void ghes_handle_arm_hw_error(struct acpi_hest_generic_data 
*gdata)
   515  {
   516  struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
   517  struct cper_arm_err_info *err_info;
   518  int sec_sev;
   519  int cpu, i, ret;
   520  
   521  log_arm_hw_error(err);
   522  
   523  sec_sev = ghes_severity(gdata->error_severity);
   524  if (sec_sev != GHES_SEV_CORRECTED)
   525  return;
   526  
 > 527  cpu = get_logical_index(err->mpidr);
   528  if (cpu == -EINVAL)
   529  return;
   530  
   531  err_info = (struct cper_arm_err_info *)(err + 1);
   532  for (i = 0; i < err->err_info_num; i++) {
   533  ret = cpu_cec_add_ce(cpu, err_info->multiple_error + 1);
   534  if (ret)
   535  break;
   536  err_info += 1;
   537  }
   538  }
   539  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip

RE: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-09-01 Thread Shiju Jose

Hi Boris,

>-Original Message-
>From: Borislav Petkov [mailto:b...@alien8.de]
>Sent: 01 September 2020 15:36
>To: Shiju Jose 
>Cc: linux-e...@vger.kernel.org; linux-a...@vger.kernel.org; linux-
>ker...@vger.kernel.org; tony.l...@intel.com; r...@rjwysocki.net;
>james.mo...@arm.com; l...@kernel.org; Linuxarm
>
>Subject: Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate
>an erroneous CPU core
>
>On Tue, Sep 01, 2020 at 03:01:40PM +0100, Shiju Jose wrote:
>> When the CPU correctable errors reported on an ARM64 CPU core too
>> often, it should be isolated. Add the CPU correctable error collector
>> to store the CPU correctable error count.
>>
>> When the correctable error count for a CPU exceed the threshold value
>> in a short time period, it will try to isolate the CPU core.
>> The threshold value, time period etc are configurable.
>>
>> Implementation details is added in the file.
>>
>> Signed-off-by: Shiju Jose 
>> ---
>>  Documentation/ABI/testing/debugfs-cpu-cec |  22 ++
>>  arch/arm64/ras/Kconfig|   8 +
>>  drivers/acpi/apei/ghes.c  |  30 +-
>>  drivers/ras/Kconfig   |   1 +
>>  drivers/ras/Makefile  |   1 +
>>  drivers/ras/cpu_cec.c | 393 ++
>
>So instead of adding the ability to collect other error types to the CEC, 
>you're
>duplicating the CEC itself?!
>
>Why?
CPU CEC derived the infrastructure of the CEC only and the logic used in the 
CEC for
CE count storage, CE count calculation and page isolation is very unique for the
memory pages,  which seems cannot be reusable for the CPU CEs. 
Also the values set for the parameters such as threshold, time period for the 
memory errors
and  CPU errors would be different.
Thus extending cec.c to support CPU CEs would include adding CPU CEC specific 
code
for storing error count, isolation etc which I thought would result the code 
less tidy and
less readable unless find more reusable logic.

>
>--
>Regards/Gruss,
>Boris.
>
>https://people.kernel.org/tglx/notes-about-netiquette

Thanks,
Shiju

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-09-01 Thread Borislav Petkov

On Tue, Sep 01, 2020 at 03:01:40PM +0100, Shiju Jose wrote:
> When the CPU correctable errors reported on an ARM64 CPU core too often,
> it should be isolated. Add the CPU correctable error collector to
> store the CPU correctable error count.
> 
> When the correctable error count for a CPU exceed the threshold
> value in a short time period, it will try to isolate the CPU core.
> The threshold value, time period etc are configurable.
> 
> Implementation details is added in the file.
> 
> Signed-off-by: Shiju Jose 
> ---
>  Documentation/ABI/testing/debugfs-cpu-cec |  22 ++
>  arch/arm64/ras/Kconfig|   8 +
>  drivers/acpi/apei/ghes.c  |  30 +-
>  drivers/ras/Kconfig   |   1 +
>  drivers/ras/Makefile  |   1 +
>  drivers/ras/cpu_cec.c | 393 ++

So instead of adding the ability to collect other error types to the
CEC, you're duplicating the CEC itself?!

Why?

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

[PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-09-01 Thread Shiju Jose

When the CPU correctable errors reported on an ARM64 CPU core too often,
it should be isolated. Add the CPU correctable error collector to
store the CPU correctable error count.

When the correctable error count for a CPU exceed the threshold
value in a short time period, it will try to isolate the CPU core.
The threshold value, time period etc are configurable.

Implementation details is added in the file.

Signed-off-by: Shiju Jose 
---
 Documentation/ABI/testing/debugfs-cpu-cec |  22 ++
 arch/arm64/ras/Kconfig|   8 +
 drivers/acpi/apei/ghes.c  |  30 +-
 drivers/ras/Kconfig   |   1 +
 drivers/ras/Makefile  |   1 +
 drivers/ras/cpu_cec.c | 393 ++
 drivers/ras/ras.c |   3 +
 include/linux/ras.h   |  16 +
 8 files changed, 471 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/ABI/testing/debugfs-cpu-cec
 create mode 100644 arch/arm64/ras/Kconfig
 create mode 100644 drivers/ras/cpu_cec.c

diff --git a/Documentation/ABI/testing/debugfs-cpu-cec 
b/Documentation/ABI/testing/debugfs-cpu-cec
new file mode 100644
index ..31f4e8c902e4
--- /dev/null
+++ b/Documentation/ABI/testing/debugfs-cpu-cec
@@ -0,0 +1,22 @@
+What:   /sys/kernel/debug/ras/cpu_cec/threshold
+Date:   Aug 2020
+Contact:linux-e...@vger.kernel.org
+Description:Threshold value for the CPU corrected errors to
+   offline a CPU core. Default value is 5000.
+
+What:   /sys/kernel/debug/ras/cpu_cec/disable
+Date:   Aug 2020
+Contact:linux-e...@vger.kernel.org
+Description:Disable the RAS CPU corrected errors collector.
+   1:disable, 0:enable. Enabled by default.
+
+What:   /sys/kernel/debug/ras/cpu_cec/stats
+Date:   Aug 2020
+Contact:linux-e...@vger.kernel.org
+Description:Dump the stats of the CPU correctable errors.
+
+What:   /sys/kernel/debug/ras/cpu_cec/time_period
+Date:   Aug 2020
+Contact:linux-e...@vger.kernel.org
+Description:Time period, in seconds, for the CPU CEs count
+   threshold check. Default value is 24hrs.
diff --git a/arch/arm64/ras/Kconfig b/arch/arm64/ras/Kconfig
new file mode 100644
index ..a892245193f0
--- /dev/null
+++ b/arch/arm64/ras/Kconfig
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: GPL-2.0
+config RAS_CPU_CEC
+   bool "RAS CPU Correctable Error Collector"
+   depends on ARM64 && HOTPLUG_CPU && DEBUG_FS
+   help
+ Collects the CPU correctable errors. When the CEs count for
+ a CPU exceeds the threshold, try to isolate the CPU core.
+
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 81bf71b10d44..b6ff4866ca32 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -511,6 +511,32 @@ static void ghes_handle_aer(struct acpi_hest_generic_data 
*gdata)
 #endif
 }
 
+static void ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata)
+{
+   struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata);
+   struct cper_arm_err_info *err_info;
+   int sec_sev;
+   int cpu, i, ret;
+
+   log_arm_hw_error(err);
+
+   sec_sev = ghes_severity(gdata->error_severity);
+   if (sec_sev != GHES_SEV_CORRECTED)
+   return;
+
+   cpu = get_logical_index(err->mpidr);
+   if (cpu == -EINVAL)
+   return;
+
+   err_info = (struct cper_arm_err_info *)(err + 1);
+   for (i = 0; i < err->err_info_num; i++) {
+   ret = cpu_cec_add_ce(cpu, err_info->multiple_error + 1);
+   if (ret)
+   break;
+   err_info += 1;
+   }
+}
+
 static bool ghes_do_proc(struct ghes *ghes,
 const struct acpi_hest_generic_status *estatus)
 {
@@ -543,9 +569,7 @@ static bool ghes_do_proc(struct ghes *ghes,
ghes_handle_aer(gdata);
}
else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) {
-   struct cper_sec_proc_arm *err = 
acpi_hest_get_payload(gdata);
-
-   log_arm_hw_error(err);
+   ghes_handle_arm_hw_error(gdata);
} else {
void *err = acpi_hest_get_payload(gdata);
 
diff --git a/drivers/ras/Kconfig b/drivers/ras/Kconfig
index c2a236f2e846..d2f877e5f7ad 100644
--- a/drivers/ras/Kconfig
+++ b/drivers/ras/Kconfig
@@ -32,5 +32,6 @@ menuconfig RAS
 if RAS
 
 source "arch/x86/ras/Kconfig"
+source "arch/arm64/ras/Kconfig"
 
 endif
diff --git a/drivers/ras/Makefile b/drivers/ras/Makefile
index 6f0404f50107..d6e8c38be3cb 100644
--- a/drivers/ras/Makefile
+++ b/drivers/ras/Makefile
@@ -2,3 +2,4 @@
 obj-$(CONFIG_RAS)  += ras.o
 obj-$(CONFIG_DEBUG_FS) += debugfs.o
 obj-$(CONFIG_RAS_CEC)  += cec.o
+obj-$(CONFIG_RAS_CPU_CEC)  += cpu_cec.o
diff --git a/drivers/ras/cpu_cec.

RE: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

RE: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

RE: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

[PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

11 matches

Site Navigation

Mail list logo

Footer information