Re: [RFC PATCH 0/7] RAS/CEC: Extend CEC for errors count check on short time period

2020-10-07 Thread James Morse
Hi Shiju, On 06/10/2020 17:13, Shiju Jose wrote: [...] > Please find following pseudo code we added for the kernel side to make sure > we correctly understand your suggestions. > > 1. Create edac device and edac device sysfs entries for the online CPU caches. > /drivers/edac/edac_device.c >

RE: [RFC PATCH 0/7] RAS/CEC: Extend CEC for errors count check on short time period

2020-10-06 Thread Shiju Jose
el@vger.kernel.org; tony.l...@intel.com; >r...@rjwysocki.net; l...@kernel.org; Linuxarm >Subject: Re: [RFC PATCH 0/7] RAS/CEC: Extend CEC for errors count check on >short time period > >Hi Shiju, > >On 02/10/2020 16:38, Shiju Jose wrote: >>> -Original Message-

Re: [RFC PATCH 0/7] RAS/CEC: Extend CEC for errors count check on short time period

2020-10-02 Thread Borislav Petkov
On Fri, Oct 02, 2020 at 06:33:17PM +0100, James Morse wrote: > > I think adding the CPU error collection to the kernel > > has the following advantages, > > 1. The CPU error collection and isolation would not be active if the > > rasdaemon stopped running or not running on a machine.

Re: [RFC PATCH 0/7] RAS/CEC: Extend CEC for errors count check on short time period

2020-10-02 Thread James Morse
r.kernel.org; tony.l...@intel.com; r...@rjwysocki.net; >> james.mo...@arm.com; l...@kernel.org; Linuxarm >> >> Subject: Re: [RFC PATCH 0/7] RAS/CEC: Extend CEC for errors count check on >> short time period >> >> On Fri, Oct 02, 2020 at 01:22:28PM +0100, Shiju

RE: [RFC PATCH 0/7] RAS/CEC: Extend CEC for errors count check on short time period

2020-10-02 Thread Luck, Tony
> Because from my x86 CPUs limited experience, the cache arrays are mostly > fine and errors reported there are not something that happens very > frequently so we don't even need to collect and count those. On Intel X86 we leave the counting and threshold decisions about cache health to the

RE: [RFC PATCH 0/7] RAS/CEC: Extend CEC for errors count check on short time period

2020-10-02 Thread Shiju Jose
t;james.mo...@arm.com; l...@kernel.org; Linuxarm > >Subject: Re: [RFC PATCH 0/7] RAS/CEC: Extend CEC for errors count check on >short time period > >On Fri, Oct 02, 2020 at 01:22:28PM +0100, Shiju Jose wrote: >> Open Questions based on the feedback from Boris, 1. ARM process

Re: [RFC PATCH 0/7] RAS/CEC: Extend CEC for errors count check on short time period

2020-10-02 Thread Borislav Petkov
On Fri, Oct 02, 2020 at 01:22:28PM +0100, Shiju Jose wrote: > Open Questions based on the feedback from Boris, > 1. ARM processor error types are cache/TLB/bus errors. >[Reference N2.4.4.1 ARM Processor Error Information UEFI Spec v2.8] > Any of the above error types should not be consider for

[RFC PATCH 0/7] RAS/CEC: Extend CEC for errors count check on short time period

2020-10-02 Thread Shiju Jose
In ARM64 hardware platforms, for example our Kunpeng platforms, CPU L1/L2 cache corrected errors are reported in the ARM processor error section. The situations the CPU CE errors are reported too often is not unlikely and may need to isolate that CPU core to prevent leading to more serious