RE: [PATCH v3 0/8] AMD64 EDAC fixes

2019-08-26 Thread Ghannam, Yazen
> -Original Message-
> From: linux-edac-ow...@vger.kernel.org  On 
> Behalf Of Borislav Petkov
> Sent: Monday, August 26, 2019 9:59 AM
> To: Ghannam, Yazen 
> Cc: Adam Borowski ; linux-e...@vger.kernel.org; 
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes
> 
> On Mon, Aug 26, 2019 at 02:19:18PM +, Ghannam, Yazen wrote:
> > I was tracking down the failure with ECC disabled, and that seems to be it.
> >
> > So I think we should return 0 "if (!edac_has_mcs())", because we'd only get
> > there if ECC is disabled on all nodes and there wasn't some other 
> > initialization
> > error.
> >
> > I'll send a patch for this soon.
> >
> > Adam, would you mind testing this patch?
> 
> You can't return 0 when ECC is disabled on all nodes because then the
> driver remains loaded without driving anything. That silly userspace
> needs to understand that ENODEV means "stop trying to load this driver".
> 

Yes, you're right.

I'll try and track down the interaction here between userspace and the module.
Please let me know if you have any suggestions.

Thanks,
Yazen


Re: [PATCH v3 0/8] AMD64 EDAC fixes

2019-08-26 Thread Borislav Petkov
On Mon, Aug 26, 2019 at 02:19:18PM +, Ghannam, Yazen wrote:
> I was tracking down the failure with ECC disabled, and that seems to be it.
>
> So I think we should return 0 "if (!edac_has_mcs())", because we'd only get
> there if ECC is disabled on all nodes and there wasn't some other 
> initialization
> error.
>
> I'll send a patch for this soon.
>
> Adam, would you mind testing this patch?

You can't return 0 when ECC is disabled on all nodes because then the
driver remains loaded without driving anything. That silly userspace
needs to understand that ENODEV means "stop trying to load this driver".

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


RE: [PATCH v3 0/8] AMD64 EDAC fixes

2019-08-26 Thread Ghannam, Yazen
> -Original Message-
> From: linux-edac-ow...@vger.kernel.org  On 
> Behalf Of Borislav Petkov
> Sent: Friday, August 23, 2019 10:38 AM
> To: Ghannam, Yazen 
> Cc: Adam Borowski ; linux-e...@vger.kernel.org; 
> linux-kernel@vger.kernel.org
> Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes
> 
> On Fri, Aug 23, 2019 at 03:28:59PM +, Ghannam, Yazen wrote:
> > Boris, Do you think it'd be appropriate to change the return values
> > for some cases?
> >
> > For example, ECC disabled is a hardware configuration. This doesn't
> > mean that the module failed any operations in this case.
> >
> > In other words, the module checks for a feature. If the feature is not
> > present, then return without failure (and maybe give a message).
> 
> That makes sense but AFAICT if probe_one_instance() sees that ECC is not
> enabled, it returns 0.
> 
> The "if (!edac_has_mcs())" check later is to verify that at least once
> instance was loaded successfully and, if not, then return an error.
> 
> So where does it return failure?
> 

I was tracking down the failure with ECC disabled, and that seems to be it.

So I think we should return 0 "if (!edac_has_mcs())", because we'd only get
there if ECC is disabled on all nodes and there wasn't some other initialization
error.

I'll send a patch for this soon.

Adam, would you mind testing this patch?

Thanks,
Yazen


Re: [PATCH v3 0/8] AMD64 EDAC fixes

2019-08-23 Thread Borislav Petkov
On Fri, Aug 23, 2019 at 03:28:59PM +, Ghannam, Yazen wrote:
> Boris, Do you think it'd be appropriate to change the return values
> for some cases?
>
> For example, ECC disabled is a hardware configuration. This doesn't
> mean that the module failed any operations in this case.
>
> In other words, the module checks for a feature. If the feature is not
> present, then return without failure (and maybe give a message).

That makes sense but AFAICT if probe_one_instance() sees that ECC is not
enabled, it returns 0.

The "if (!edac_has_mcs())" check later is to verify that at least once
instance was loaded successfully and, if not, then return an error.

So where does it return failure?

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


RE: [PATCH v3 0/8] AMD64 EDAC fixes

2019-08-23 Thread Ghannam, Yazen
> -Original Message-
> From: linux-edac-ow...@vger.kernel.org  On 
> Behalf Of Ghannam, Yazen
> Sent: Thursday, August 22, 2019 1:54 PM
> To: Adam Borowski 
> Cc: linux-e...@vger.kernel.org; linux-kernel@vger.kernel.org; b...@alien8.de
> Subject: RE: [PATCH v3 0/8] AMD64 EDAC fixes
> 
...
> I wonder if the module is being loaded multiple times. I'll try to reproduce 
> this and track it down.
> 

I was able to reproduce a similar failure. I do see that the module is being 
loaded multiple times on failure.

Here's a call trace from one dump_stack() output:
[  +0.004964] CPU: 132 PID: 2680 Comm: systemd-udevd Not tainted 
4.20.0-edac-debug+ #36
[  +0.009802] Call Trace:
[  +0.002727]  dump_stack+0x63/0x85
[  +0.003696]  amd64_edac_init+0x2163/0x3000 [amd64_edac_mod]
[  +0.006216]  ? __wake_up+0x13/0x20
[  +0.003790]  ? 0xc120d000
[  +0.003694]  do_one_initcall+0x4a/0x1c9
[  +0.004277]  ? _cond_resched+0x19/0x40
[  +0.004178]  ? kmem_cache_alloc_trace+0x15c/0x1d0
[  +0.005244]  do_init_module+0x5f/0x216
[  +0.004180]  load_module+0x21d5/0x2ac0
[  +0.004179]  ? wait_woken+0x80/0x80
[  +0.003889]  __do_sys_finit_module+0xfc/0x120
[  +0.004858]  ? __do_sys_finit_module+0xfc/0x120
[  +0.005052]  __x64_sys_finit_module+0x1a/0x20
[  +0.004857]  do_syscall_64+0x5a/0x120
[  +0.004081]  entry_SYSCALL_64_after_hwframe+0x44/0xa9


So it seems that userspace (systemd-udevd) keeps trying to load the module. I'm 
not sure how to prevent this from within the module.

Boris,
Do you think it'd be appropriate to change the return values for some cases?

For example, ECC disabled is a hardware configuration. This doesn't mean that 
the module failed any operations in this case.

In other words, the module checks for a feature. If the feature is not present, 
then return without failure (and maybe give a message).

Thanks,
Yazen


RE: [PATCH v3 0/8] AMD64 EDAC fixes

2019-08-22 Thread Ghannam, Yazen
> -Original Message-
> From: linux-edac-ow...@vger.kernel.org  On 
> Behalf Of Adam Borowski
> Sent: Wednesday, August 21, 2019 7:50 PM
> To: Ghannam, Yazen 
> Cc: linux-e...@vger.kernel.org; linux-kernel@vger.kernel.org; b...@alien8.de
> Subject: Re: [PATCH v3 0/8] AMD64 EDAC fixes
> 
> On Wed, Aug 21, 2019 at 11:59:53PM +, Ghannam, Yazen wrote:
> > I've also added RFC patches to avoid the "ECC disabled" message for
> > nodes without memory. I haven't fully tested these, but I wanted to get
> > your thoughts. Here's an earlier discussion:
> > https://lkml.kernel.org/r/20180321191335.7832-1-yazen.ghan...@amd.com
> 
> While you're editing that code, could you please also cut the spam if ECC is
> actually disabled?  For example, a 2990WX with non-ECC RAM gets 1024 lines;
> 64 copies of:
> 
> [8.186164] EDAC amd64: Node 0: DRAM ECC disabled.
> [8.188364] EDAC amd64: ECC disabled in the BIOS or no ECC capability, 
> module will not load.
> Either enable ECC checking or force module loading by setting 
> 'ecc_enable_override'.
> (Note that use of the override may cause unknown side 
> effects.)
> [8.194762] EDAC amd64: Node 1: DRAM ECC disabled.
> [8.196307] EDAC amd64: ECC disabled in the BIOS or no ECC capability, 
> module will not load.
> Either enable ECC checking or force module loading by setting 
> 'ecc_enable_override'.
> (Note that use of the override may cause unknown side 
> effects.)
> [8.199840] EDAC amd64: Node 2: DRAM ECC disabled.
> [8.200963] EDAC amd64: ECC disabled in the BIOS or no ECC capability, 
> module will not load.
> Either enable ECC checking or force module loading by setting 
> 'ecc_enable_override'.
> (Note that use of the override may cause unknown side 
> effects.)
> [8.204326] EDAC amd64: Node 3: DRAM ECC disabled.
> [8.205436] EDAC amd64: ECC disabled in the BIOS or no ECC capability, 
> module will not load.
> Either enable ECC checking or force module loading by setting 
> 'ecc_enable_override'.
> (Note that use of the override may cause unknown side 
> effects.)
> 

I wonder if the module is being loaded multiple times. I'll try to reproduce 
this and track it down.

Thanks for testing and reporting this issue.

-Yazen


Re: [PATCH v3 0/8] AMD64 EDAC fixes

2019-08-22 Thread Borislav Petkov
On Thu, Aug 22, 2019 at 11:46:07AM +0200, Adam Borowski wrote:
> Yeah, some of messages are no longer emitted for memory-less nodes (NUMA 1
> and 3).  Your patch set also overhauls the messages.

Not my patchset - Yazen's.

> But, the amount of redundant messages I'm complaining about has actually
> increased:

He's working on that - that still needs some love.

Thanks for testing.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v3 0/8] AMD64 EDAC fixes

2019-08-22 Thread Adam Borowski
On Thu, Aug 22, 2019 at 10:35:48AM +0200, Borislav Petkov wrote:
> On Thu, Aug 22, 2019 at 02:50:20AM +0200, Adam Borowski wrote:
> > While you're editing that code, could you please also cut the spam if ECC is
> > actually disabled?  For example, a 2990WX with non-ECC RAM gets 1024 lines;
> 
> Patch is in there. I'll give you extra points if you spot it.

Yeah, some of messages are no longer emitted for memory-less nodes (NUMA 1
and 3).  Your patch set also overhauls the messages.

But, the amount of redundant messages I'm complaining about has actually
increased:

dmesg|grep EDAC|cut -c 16-|sort|uniq -c
256 EDAC MC: UMC0 chip selects:
256 EDAC MC: UMC1 chip selects:
  1 EDAC MC: Ver: 3.0.0
128 EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will 
not load.
^ three lines each
 64 EDAC amd64: F17h detected (node 0).
 64 EDAC amd64: F17h detected (node 1).
 64 EDAC amd64: F17h detected (node 2).
 64 EDAC amd64: F17h detected (node 3).
512 EDAC amd64: MC: 0: 0MB 1: 0MB
256 EDAC amd64: MC: 2: 0MB 3: 0MB
256 EDAC amd64: MC: 2:  8192MB 3: 0MB
 64 EDAC amd64: Node 0: DRAM ECC disabled.
 64 EDAC amd64: Node 2: DRAM ECC disabled.
256 EDAC amd64: using x4 syndromes.

(Full dmesg at http://ix.io/1T1o)

While on 5.3-rc5 without the patchset I get:

  1 EDAC MC: Ver: 3.0.0
256 EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will 
not load.
^ three lines each
 64 EDAC amd64: Node 0: DRAM ECC disabled.
 64 EDAC amd64: Node 1: DRAM ECC disabled.
 64 EDAC amd64: Node 2: DRAM ECC disabled.
 64 EDAC amd64: Node 3: DRAM ECC disabled.

So I wonder if we could deduplicate those.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.


Re: [PATCH v3 0/8] AMD64 EDAC fixes

2019-08-22 Thread Borislav Petkov
On Thu, Aug 22, 2019 at 02:50:20AM +0200, Adam Borowski wrote:
> While you're editing that code, could you please also cut the spam if ECC is
> actually disabled?  For example, a 2990WX with non-ECC RAM gets 1024 lines;

Patch is in there. I'll give you extra points if you spot it.

> Meow!

Wuff!

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v3 0/8] AMD64 EDAC fixes

2019-08-21 Thread Adam Borowski
On Wed, Aug 21, 2019 at 11:59:53PM +, Ghannam, Yazen wrote:
> I've also added RFC patches to avoid the "ECC disabled" message for
> nodes without memory. I haven't fully tested these, but I wanted to get
> your thoughts. Here's an earlier discussion:
> https://lkml.kernel.org/r/20180321191335.7832-1-yazen.ghan...@amd.com

While you're editing that code, could you please also cut the spam if ECC is
actually disabled?  For example, a 2990WX with non-ECC RAM gets 1024 lines;
64 copies of:

[8.186164] EDAC amd64: Node 0: DRAM ECC disabled.
[8.188364] EDAC amd64: ECC disabled in the BIOS or no ECC capability, 
module will not load.
Either enable ECC checking or force module loading by setting 
'ecc_enable_override'.
(Note that use of the override may cause unknown side effects.)
[8.194762] EDAC amd64: Node 1: DRAM ECC disabled.
[8.196307] EDAC amd64: ECC disabled in the BIOS or no ECC capability, 
module will not load.
Either enable ECC checking or force module loading by setting 
'ecc_enable_override'.
(Note that use of the override may cause unknown side effects.)
[8.199840] EDAC amd64: Node 2: DRAM ECC disabled.
[8.200963] EDAC amd64: ECC disabled in the BIOS or no ECC capability, 
module will not load.
Either enable ECC checking or force module loading by setting 
'ecc_enable_override'.
(Note that use of the override may cause unknown side effects.)
[8.204326] EDAC amd64: Node 3: DRAM ECC disabled.
[8.205436] EDAC amd64: ECC disabled in the BIOS or no ECC capability, 
module will not load.
Either enable ECC checking or force module loading by setting 
'ecc_enable_override'.
(Note that use of the override may cause unknown side effects.)


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋  The root of a real enemy is an imaginary friend.
⠈⠳⣄