Re: [PATCH] drm/amdgpu: Add kernel parameter to manage memory error handling.

Tom St Denis Thu, 13 Apr 2017 09:38:57 -0700

On 13/04/17 11:38 AM, Panariti, David wrote:

+ Vilas

-----Original Message-----
From: Deucher, Alexander
Sent: Wednesday, April 12, 2017 9:29 PM
To: 'Michel Dänzer' <mic...@daenzer.net>; Panariti, David
<david.panar...@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH] drm/amdgpu: Add kernel parameter to manage
memory error handling.

-----Original Message-----
From: Michel Dänzer [mailto:mic...@daenzer.net]
Sent: Wednesday, April 12, 2017 9:17 PM
To: Panariti, David
Cc: Deucher, Alexander; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdgpu: Add kernel parameter to manage

memory

error handling.

On 13/04/17 02:38 AM, Panariti, David wrote:

From: Michel Dänzer [mailto:mic...@daenzer.net]

@@ -212,6 +213,9 @@ module_param_named(cg_mask,

amdgpu_cg_mask, uint,

0444);  MODULE_PARM_DESC(pg_mask, "Powergating flags mask (0 =

disable

power gating)");  module_param_named(pg_mask,

amdgpu_pg_mask,

uint,

0444);

+MODULE_PARM_DESC(ecc_mask, "ECC/EDC flags mask (0 = disable
+ECC/EDC)");


"0 = disable ECC/EDC" implies that they're enabled by default? Was
that already the case before this patch?


[davep] Yes it was, and there was actually a problem in some cases
where the CZ would hang which is why I added the param. I was
wondering if it would be better to default to them being off, but I
wasn't sure how important maintaining original behavior is
considered. Actually, there are some bugs in the workaround function
as it is, so it really should default to off.


I agree. There have been some bug reports about Carrizo hangs, I
wonder if any of those might be related to this.


Only the embedded SKUs support EDC.  If they are embedded parts, it could
be related.


[davep]  Sorry for the length, but I wanted all of the details out there for 
the most informed decision.

Another thing is that they can go from not hanging to hanging for no 
discernable reason.
The KIQ changes, however, have seemed to have fixed it.  For one chip and a few 
tens of reboots.

There is also the issue of improperly initialized *gpr registers.
  From the doc:
"Due to a hardware condition whereby some shader instructions utilize uninitialized 
SGPRs and/or VGPRs, the S/VPGR memories must be initialized prior to EDC operation, as, 
not doing so will cause erroneous counts to show up in the EDC counters."
I seem to recall Vilas saying it is the poison that isn't reset properly.  But 
I'm not sure about the actual register contents.
Vilas?

I suggest, at a minimum,  checking for cz *and* the EDC fuse.  If so, 
explicitly disable EDC, run the shaders to zero the *gprs, leave EDC disabled, 
and merge in the existing new code to zero all of the counters.
All of the code exists, in one place or another.

The parameter to enable/disable probably won't be needed until EDC is fully 
implemented.
However, EDC can be enabled in a way that simply allows it to count errors.  
This has never caused a hang for me.
The counts are useful for reliability research.  This was one of the goals of 
the original EDC task.
A umr script could be written (by interested parties) to read the counters and 
in fact enable the EDC counters.
I think this should be done if anyone is interested in the numbers.
Vilas?  Any R&R work left in this area?  Do you think customers would be 
interested in doing this on their own?

For the special keeners we could add EDC counters to umr's --top andthen in theory it'll be included in the log output.

If you can send me info on how to enable/read the counters I can take alook at it.


Cheers,
Tom
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH] drm/amdgpu: Add kernel parameter to manage memory error handling.

Reply via email to