RE: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-10-02 Thread Shiju Jose
r...@rjwysocki.net; l...@kernel.org; Linuxarm >Subject: Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate >an erroneous CPU core > >On Thu, Oct 01, 2020 at 06:16:03PM +0100, James Morse wrote: >> If the corrected-count is available somewhere, can't this policy be >&

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-10-01 Thread Borislav Petkov
On Thu, Oct 01, 2020 at 06:16:03PM +0100, James Morse wrote: > If the corrected-count is available somewhere, can't this policy be > made in user-space? You mean rasdaemon goes and offlines CPUs when certain thresholds are reached? Sure. It would be much more flexible too. -- Regards/Gruss,

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-10-01 Thread James Morse
Hi guys, On 17/09/2020 09:40, Borislav Petkov wrote: > On Thu, Sep 10, 2020 at 03:29:56PM +, Shiju Jose wrote: > You can't know what exactly you wanna do if you don't have a use case > you're trying to address. > >> According to the ARM Processor CPER definition the error types >> reported

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-09-17 Thread Borislav Petkov
On Thu, Sep 10, 2020 at 03:29:56PM +, Shiju Jose wrote: > Ok. However the functions such as __find_elem() use > memory specific PFN() and PAGE_SHIFT. You can add your version find_elem_cpu() or so. You can do this with a set of function pointers which belong to the different type of storage

RE: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-09-10 Thread Shiju Jose
ames.mo...@arm.com; l...@kernel.org; Linuxarm > >Subject: Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate >an erroneous CPU core > >On Tue, Sep 01, 2020 at 04:20:54PM +, Shiju Jose wrote: >> CPU CEC derived the infrastructure of the CEC only and the logic u

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-09-09 Thread Borislav Petkov
On Tue, Sep 01, 2020 at 04:20:54PM +, Shiju Jose wrote: > CPU CEC derived the infrastructure of the CEC only and the logic > used in the CEC for CE count storage, CE count calculation and page > isolation is very unique for the memory pages, which seems cannot be > reusable for the CPU CEs.

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-09-01 Thread kernel test robot
Hi Shiju, Thank you for the patch! Yet something to improve: [auto build test ERROR on pm/linux-next] [also build test ERROR on arm64/for-next/core linux/master linus/master v5.9-rc3 next-20200828] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-09-01 Thread kernel test robot
Hi Shiju, Thank you for the patch! Yet something to improve: [auto build test ERROR on pm/linux-next] [also build test ERROR on arm64/for-next/core linux/master linus/master v5.9-rc3 next-20200828] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting

RE: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-09-01 Thread Shiju Jose
ames.mo...@arm.com; l...@kernel.org; Linuxarm > >Subject: Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate >an erroneous CPU core > >On Tue, Sep 01, 2020 at 03:01:40PM +0100, Shiju Jose wrote: >> When the CPU correctable errors reported on an ARM64 CPU core too &g

Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-09-01 Thread Borislav Petkov
On Tue, Sep 01, 2020 at 03:01:40PM +0100, Shiju Jose wrote: > When the CPU correctable errors reported on an ARM64 CPU core too often, > it should be isolated. Add the CPU correctable error collector to > store the CPU correctable error count. > > When the correctable error count for a CPU exceed

[PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

2020-09-01 Thread Shiju Jose
When the CPU correctable errors reported on an ARM64 CPU core too often, it should be isolated. Add the CPU correctable error collector to store the CPU correctable error count. When the correctable error count for a CPU exceed the threshold value in a short time period, it will try to isolate