Re: TDR and VRAM lost handling in KMD (v2)

Nicolai Hähnle Thu, 12 Oct 2017 02:11:52 -0700

On 12.10.2017 10:49, Christian König wrote:

However, !guilty && ctx->reset_counter != adev->reset_counter does notimply that the context was lost.
The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET if!guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
As far as I understand it, the case of !guilty && ctx->reset_counter!= adev->reset_counter && ctx->vram_lost_counter ==adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because aGPU reset occurred, but it didn't affect our context.
I disagree on that.
AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there was areset but we haven't been causing it.
That the OpenGL extension is specified otherwise is unfortunate, but Ithink we shouldn't use that for the kernel interface here.

Two counterpoints:

1. Why should any application care that there was a reset while it wasidle? The OpenGL behavior is what makes sense.

2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything todaybecause we never return it :)

amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which is inline with the OpenGL spec: we're conservatively returning that a resethappened because we don't know whether the context was affected, and wereturn UNKNOWN because we also don't know whether the context was guiltyor not.

Returning AMDGPU_CTX_NO_RESET in the case of !guilty &&ctx->vram_lost_counter == adev->vram_lost_counter is simply a refinementand improvement of the current, overly conservative behavior.


Cheers,
Nicolai

Regards,
Christian.

Am 12.10.2017 um 10:44 schrieb Nicolai Hähnle:
I think we should stick to the plan where kernel contexts stay "stuck"after a GPU reset. This is the most robust behavior for the kernel.
Even if the OpenGL spec says that an OpenGL context can be re-usedwithout destroying and re-creating it, the UMD can take care ofre-creating the kernel context.
This means amdgpu_ctx_query should *not* reset ctx->reset_counter.

Cheers,
Nicolai


On 12.10.2017 10:41, Nicolai Hähnle wrote:
Hi Monk,
Thanks for the summary. Most of it looks good to me, though I can'tspeak to all the kernel internals.
Just some comments:

On 12.10.2017 10:03, Liu, Monk wrote:
lFor cs_submit() IOCTL:
1.check if current ctx been marked “*guilty*”and return“*ECANCELED*” if so.
2.set job->*vram_lost_counter* with adev->*vram_lost_counter*, andreturn “*ECANCELED*” if ctx->*vram_lost_counter* !=job->*vram_lost_counter* (Christian already submitted this patch)
a)discussion: can we return “ENODEV” if vram_lost_counter mismatch ?that way UMD know this context is under “device lost”
My plan for UMD is to always query the VRAM lost counter when anykind of context lost situation is detected. So cs_submit() shouldreturn an error in this situation, but it could just be ECANCELED. Wedon't need to distinguish between different types of errors here.
lIntroduce a new IOCTL to let UMD query latestadev->*vram_lost_counter*:
Christian already sent a patch for this.
lFor amdgpu_ctx_query():
n*Don’t update ctx->reset_counter when querying this function,otherwise the query result is not consistent *
Hmm. I misremembered part of the spec, see below.
nSet out->state.reset_status to “AMDGPU_CTX_GUILTY_RESET” if the ctxis “*guilty*”, no need to check “ctx->reset_counter”
Agreed.
nSet out->state.reset_status to “AMDGPU_CTX_INNOCENT_RESET” *if thectx isn’t “guilty” && ctx->reset_counter != adev->reset_counter *
I disagree. The meaning of AMDGPU_CTX_*_RESET should reflect thecorresponding enums in user space APIs. I don't know how it works inVulkan, but in OpenGL, returning GL_INNOCENT_CONTEXT_RESET_ARB meansthat the context was lost.
However, !guilty && ctx->reset_counter != adev->reset_counter doesnot imply that the context was lost.
The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESETif !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.
As far as I understand it, the case of !guilty && ctx->reset_counter!= adev->reset_counter && ctx->vram_lost_counter ==adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because aGPU reset occurred, but it didn't affect our context.
I unfortunately noticed another subtlety while re-reading the OpenGLspec. OpenGL says that the OpenGL context itself does *not* have tobe re-created in order to recover from the reset. Re-creating allobjects in the context is sufficient.
I believe this is the original motivation for why amdgpu_ctx_query()will reset the ctx->reset_counter.
For Mesa, it's still okay if the kernel keeps blocking submissions aswe can just recreate the kernel context. But OrcaGL is also affected.
Does anybody know off-hand where the relevant parts of the Vulkanspec are? I didn't actually find anything in a quick search.
[snip]
For UMD behavior we still have something need to consider:
If MESA creates a new context from an old context (share list?? I’mnot familiar with UMD , David Mao shall have some discuss on it withNicolai), the new created context’s vram_lost_counter
And reset_counter shall all be ported from that old context ,otherwise CS_SUBMIT will not block it which isn’t correct
The kernel doesn't have to do anything for this, it is entirely theUMD's responsibility. All UMD needs from KMD is the function forquerying the vram_lost_counter.
Cheers,
Nicolai
Need your feedback, thx
*From:*amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] *OnBehalf Of *Liu, Monk
*Sent:* 2017年10月11日13:34
*To:* Koenig, Christian <christian.koe...@amd.com>; Haehnle, Nicolai<nicolai.haeh...@amd.com>; Olsak, Marek <marek.ol...@amd.com>;Deucher, Alexander <alexander.deuc...@amd.com>*Cc:* Ramirez, Alejandro <alejandro.rami...@amd.com>;amd-gfx@lists.freedesktop.org; Filipas, Mario<mario.fili...@amd.com>; Ding, Pixel <pixel.d...@amd.com>; Li,Bingley <bingley...@amd.com>; Jiang, Jerry (SW) <jerry.ji...@amd.com>
*Subject:* TDR and VRAM lost handling in KMD:

Hi Christian & Nicolai,
We need to achieve some agreements on what should MESA/UMD do andwhat should KMD do, *please give your comments with **“okay”or“No”and your idea on below items,*
lWhen a job timed out (set from lockup_timeout kernel parameter),What KMD should do in TDR routine :
1.Update adev->*gpu_reset_counter*, and stop scheduler first,(*gpu_reset_counter* is used to force vm flush after GPU reset, outof this thread’s scope so no more discussion on it)
2.Set its fence error status to “*ETIME*”,

3.Find the entity/ctx behind this job, and set this ctx as “*guilty*”
4.Kick out this job from scheduler’s mirror list, so this job won’tget re-scheduled to ring anymore.
5.Kick out all jobs in this “guilty”ctx’s KFIFO queue, and set alltheir fence status to “*ECANCELED*”
*6.*Force signal all fences that get kicked out by above twosteps,*otherwise UMD will block forever if waiting on those fences*
7.Do gpu reset, which is can be some callbacks to let bare-metal andSR-IOV implement with their favor style
8.After reset, KMD need to aware if the VRAM lost happens or not,bare-metal can implement some function to judge, while for SR-IOV Iprefer to read it from GIM side (for initial version we considerit’s always VRAM lost, till GIM side change aligned)
9.If VRAM lost not hit, continue, otherwise:

a)Update adev->*vram_lost_counter*,
b)Iterate over all living ctx, and set all ctx as “*guilty*”sinceVRAM lost actually ruins all VRAM contents
c)Kick out all jobs in all ctx’s KFIFO queue, and set all theirfence status to “*ECANCELDED*”
10.Do GTT recovery and VRAM page tables/entries recovery (optional,do we need it ???)
11.Re-schedule all JOBs remains in mirror list to ring again andrestart scheduler (for VRAM lost case, no JOB will re-scheduled)
lFor cs_wait() IOCTL:
After it found fence signaled, it should check with*“dma_fence_get_status” *to see if there is error there,
And return the error status of fence

lFor cs_wait_fences() IOCTL:

Similar with above approach

lFor cs_submit() IOCTL:
It need to check if current ctx been marked as “*guilty*”and return“*ECANCELED*”if so
lIntroduce a new IOCTL to let UMD query *vram_lost_counter*:
This way, UMD can also block app from submitting, like @Nicolaimentioned, we can cache one copy of *vram_lost_counter* whenenumerate physical device, and deny all
gl-context from submitting if the counter queried bigger than thatone cached in physical device. (looks a little overkill to me, buteasy to implement )
UMD can also return error to APP when creating gl-context if foundcurrent queried*vram_lost_counter *bigger than that one cached inphysical device.
BTW: I realized that gl-context is a little different with kernel’scontext. Because for kernel. BO is not related with context but onlywith FD, while in UMD, BO have a backend
gl-context, so block submitting in UMD layer is also needed althoughKMD will do its job as bottom line
lBasically “vram_lost_counter”is exposure by kernel to let UMD takethe control of robust extension feature, it will be UMD’s call tomove, KMD only deny “guilty”context from submitting
Need your feedback, thx

We’d better make TDR feature landed ASAP

BR Monk


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: TDR and VRAM lost handling in KMD (v2)

Reply via email to