Re: TDR and VRAM lost handling in KMD (v2)

Nicolai Hähnle Fri, 13 Oct 2017 14:59:53 -0700

On 13.10.2017 14:04, Liu, Monk wrote:

That's what I suggested, look to know it's agreed


Yeah, that's fine with me as well.

Cheers,
Nicolai


BR Monk

-----Original Message-----
From: Koenig, Christian
Sent: 2017年10月13日 20:01
To: Liu, Monk <monk....@amd.com>; Haehnle, Nicolai <nicolai.haeh...@amd.com>; Michel Dänzer 
<mic...@daenzer.net>; Olsak, Marek <marek.ol...@amd.com>; Deucher, Alexander 
<alexander.deuc...@amd.com>; Zhou, David(ChunMing) <david1.z...@amd.com>; Mao, David <david....@amd.com>
Cc: Ramirez, Alejandro <alejandro.rami...@amd.com>; amd-gfx@lists.freedesktop.org; Filipas, Mario 
<mario.fili...@amd.com>; Ding, Pixel <pixel.d...@amd.com>; Li, Bingley 
<bingley...@amd.com>; Jiang, Jerry (SW) <jerry.ji...@amd.com>
Subject: Re: TDR and VRAM lost handling in KMD (v2)

I think the best approach is to keep it as it is right now and don't change a 
thing.

And we add a new IOCTL with a bit more sane return values. E.g. guilty status 
and VRAM lost status as flags.

Regards,
Christian.

Am 13.10.2017 um 13:51 schrieb Liu, Monk:

Ping Christian & Nicolai

This ctx_query() is a little annoying to me

BR Monk

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf
Of Liu, Monk
Sent: 2017年10月13日 17:19
To: Haehnle, Nicolai <nicolai.haeh...@amd.com>; Koenig, Christian
<christian.koe...@amd.com>; Michel Dänzer <mic...@daenzer.net>; Olsak,
Marek <marek.ol...@amd.com>; Deucher, Alexander
<alexander.deuc...@amd.com>; Zhou, David(ChunMing)
<david1.z...@amd.com>; Mao, David <david....@amd.com>
Cc: Ramirez, Alejandro <alejandro.rami...@amd.com>;
amd-gfx@lists.freedesktop.org; Filipas, Mario <mario.fili...@amd.com>;
Ding, Pixel <pixel.d...@amd.com>; Li, Bingley <bingley...@amd.com>;
Jiang, Jerry (SW) <jerry.ji...@amd.com>
Subject: RE: TDR and VRAM lost handling in KMD (v2)

Alright, if MESA can handle clone context's VRAM_LOST_COUNTER mismatch
issue, no need to introduce one more U/K in kmd,

So we have only one issue unresolved and need determined ASAP:

How to modify amdgpu_ctx_query() ??

Current design won't work later with our discussion on the TDR (v2) right ? 
unless MESA stop calling this ctx_query() and totally depend on new introduced 
interface that get latest vram-lost-counter to judge Context healthy.

BR Monk

-----Original Message-----
From: Haehnle, Nicolai
Sent: 2017年10月13日 15:43
To: Liu, Monk <monk....@amd.com>; Koenig, Christian
<christian.koe...@amd.com>; Michel Dänzer <mic...@daenzer.net>; Olsak,
Marek <marek.ol...@amd.com>; Deucher, Alexander
<alexander.deuc...@amd.com>; Zhou, David(ChunMing)
<david1.z...@amd.com>; Mao, David <david....@amd.com>
Cc: Ramirez, Alejandro <alejandro.rami...@amd.com>;
amd-gfx@lists.freedesktop.org; Filipas, Mario <mario.fili...@amd.com>;
Ding, Pixel <pixel.d...@amd.com>; Li, Bingley <bingley...@amd.com>;
Jiang, Jerry (SW) <jerry.ji...@amd.com>
Subject: Re: TDR and VRAM lost handling in KMD (v2)

On 13.10.2017 05:57, Liu, Monk wrote:

That sounds sane, but unfortunately might not be possible with the existing 
IOCTL. Keep in mind that we need to keep backward compatibility here.

unfortunately the current scheme on amdgpu_ctx_query() won’t work
with TDR feature, which is aim to support vulkan/mesa/close-ogl/radv
…

It’s enumeration is too limited to MESA implement …

*Do you have good idea ?  both keep the compatibility here and give
flexible ?*

*looks like we need to add a new amdgpu_ctx_query_2() INTERFACE ….*

·*A new IOCTL added for context:*

Void amdgpu_ctx_reinit(){

           Ctx→vram_lost_counter = adev→vram_lost_counter;

           Ctx→reset_counter = adev→reset_counter;

}


Mhm, is there any advantage to just creating a new context?

[ML] sorry, this function is wrong, here is my original idea:

MESA can create a new ctx based on an old one,like:

Create gl-ctx1,

Create BO-A under gl-ctx1 …

VRAM LOST

Create gl-ctx2 from gl-ctx1 (share list, I’m not familiar with UMD,
David Mao an Nicolai can input)

Create BO-b UNDER gl-ctx2 …

Submit job upon gl-ctx2, but it can refer to BO-A,

With our design, kernel won’t block submit from context2 (from
gl-ctx2) since its vram_lost_counter equals to latest adev copy

But gl-ctx2 is a clone from gl-ctx1, the only difference is the
vram_lost/gpu_reset counter is updated to latest one

So logically we should also block submit from gl-ctx2 (mapping to
kernel context2), and we failed do so …

That’s why I want to add a new “amdgpu_ctx_clone”, which should work like:

Int amdgpu_ctx_clone(struct context *original, struct context *new) {

       New->vram_lost_counter = original->vram_lost_counter;

       New->gpu_reset_counter = original->gpu_reset_counter;

}

   From the Mesa perspective, I don't think we would need to use that as long 
as we can query the device VRAM lost counter.

Personally, I think it's better for the UMD to build its policy on lower-level 
primitives such as the VRAM lost counter query.

Cheers,
Nicolai

BR Monk

*From:*Koenig, Christian
*Sent:* 2017年10月12日19:50
*To:* Liu, Monk <monk....@amd.com>; Haehnle, Nicolai
<nicolai.haeh...@amd.com>; Michel Dänzer <mic...@daenzer.net>; Olsak,
Marek <marek.ol...@amd.com>; Deucher, Alexander
<alexander.deuc...@amd.com>; Zhou, David(ChunMing)
<david1.z...@amd.com>; Mao, David <david....@amd.com>
*Cc:* Ramirez, Alejandro <alejandro.rami...@amd.com>;
amd-gfx@lists.freedesktop.org; Filipas, Mario
<mario.fili...@amd.com>; Ding, Pixel <pixel.d...@amd.com>; Li,
Bingley <bingley...@amd.com>; Jiang, Jerry (SW) <jerry.ji...@amd.com>
*Subject:* Re: TDR and VRAM lost handling in KMD (v2)

Am 12.10.2017 um 13:37 schrieb Liu, Monk:

      Hi team

      Very good, many policy and implement are agreed, looks we only have
      some arguments in amdgpu_ctx_query(), well I also confused with the
      current implement of it, ☹

      First, I want to know if you guys agree that we*don't update
      ctx->reset_counter in amdgpu_ctx_query() *? because I want to make
      the query result always consistent upon a given context,


That sounds like a good idea to me, but I'm not sure if it won't
break userspace (I don't think so). Nicolai or Marek need to comment.


      Second, I want to say that for KERNEL, it shouldn't use the term
      from MESA or OGL or VULKAN, e.g. kernel shouldn't use
      AMDGPU_INNOCENT_RESET to map to GL_INNOCENT_RESET_ARB, etc...

      Because that way kernel will be very limited to certain UMD, so I
      suggest we totally re-name the context status, and each UMD has its
      own way to map the kernel context's result to
gl-context/vk-context/etc…


Yes, completely agree.


      Kernel should only provide below **FLAG bits** on a given context:

      ·Define AMDGPU_CTX_STATE_GUILTY 0x1             //as long as TDR
      detects a job hang, KMD set the context behind this context as "guilty"

      ·Define AMDGPU_CTX_STATE_VRAMLOST         0x2      //as long as
      there is a VRAM lost event hit after this context created, we mark
      this context "VRAM_LOST", so UMD can say that all BO under this
      context may lost their content,  since kernel have no relationship
      between context and BO so this is UMD's call to judge if a BO
      considered "VRAM lost" or not.

      ·Define AMDGPU_CTX_STATE_RESET   0x3     //as long as there is a gpu
      reset occurred after context creation, this flag shall be set


That sounds sane, but unfortunately might not be possible with the
existing IOCTL. Keep in mind that we need to keep backward
compatibility here.


      Sample code:

      Int amdgpu_ctx_query(struct amdgpu_ctx_query_parm * out, …..) {

               if (ctx- >vram_lost_counter != adev->vram_lost_counter)

                       out- >status |= AMDGPU_CTX_STATE_VRAM_LOST;

      if (ctx- >reset_counter != adev→reset_counter) {

      out- >status |= AMDGPU_CTX_STATE_RESET;

      if (ctx- >guilty == TRUE)

                               out- >status |=
AMDGPU_CTX_STATE_GUILTY;

      }

      return 0;

      }

      For UMD if it found "out.status == 0" means there is no gpu reset
      even occurred, the context is totally regular,

      ·*A new IOCTL added for context:*

      Void amdgpu_ctx_reinit(){

               Ctx→vram_lost_counter = adev→vram_lost_counter;

               Ctx→reset_counter = adev→reset_counter;

      }


Mhm, is there any advantage to just creating a new context?

Regards,
Christian.


      *if UMD decide *not* to release the "guilty" context but continue
      using it after UMD acknowledged GPU hang on certain job/context, I
      suggest UMD call "amdgpu_ctx_reinit()":*

      That way after you re-init() this context, you can get updated
      result from "amdgpu_ctx_query", which will probably give you
      "out.status == 0" as long as no gpu reset/vram lost hit after re-init().

      BR Monk

      -----Original Message-----
      From: Koenig, Christian
      Sent: 2017年10月12日18:13
      To: Haehnle, Nicolai <nicolai.haeh...@amd.com>
      <mailto:nicolai.haeh...@amd.com>; Michel Dänzer <mic...@daenzer.net>
      <mailto:mic...@daenzer.net>; Liu, Monk <monk....@amd.com>
      <mailto:monk....@amd.com>; Olsak, Marek <marek.ol...@amd.com>
      <mailto:marek.ol...@amd.com>; Deucher, Alexander
      <alexander.deuc...@amd.com> <mailto:alexander.deuc...@amd.com>;
      Zhou, David(ChunMing) <david1.z...@amd.com>
      <mailto:david1.z...@amd.com>; Mao, David <david....@amd.com>
      <mailto:david....@amd.com>
      Cc: Ramirez, Alejandro <alejandro.rami...@amd.com>
      <mailto:alejandro.rami...@amd.com>; amd-gfx@lists.freedesktop.org
      <mailto:amd-gfx@lists.freedesktop.org>; Filipas, Mario
      <mario.fili...@amd.com> <mailto:mario.fili...@amd.com>; Ding, Pixel
      <pixel.d...@amd.com> <mailto:pixel.d...@amd.com>; Li, Bingley
      <bingley...@amd.com> <mailto:bingley...@amd.com>; Jiang, Jerry (SW)
      <jerry.ji...@amd.com> <mailto:jerry.ji...@amd.com>
      Subject: Re: TDR and VRAM lost handling in KMD (v2)

      Am 12.10.2017 um 11:44 schrieb Nicolai Hähnle:

      > On 12.10.2017 11:35, Michel Dänzer wrote:

      >> On 12/10/17 11:23 AM, Christian König wrote:

      >>> Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle:

      >>>> On 12.10.2017 10:49, Christian König wrote:

      >>>>>> However, !guilty && ctx->reset_counter !=
adev->reset_counter

      >>>>>> does not imply that the context was lost.

      >>>>>>

      >>>>>> The way I understand it, we should return

      >>>>>> AMDGPU_CTX_INNOCENT_RESET if !guilty && ctx->vram_lost_counter != 
adev->vram_lost_counter.

      >>>>>>

      >>>>>> As far as I understand it, the case of !guilty &&

      >>>>>> ctx->reset_counter != adev->reset_counter &&

      >>>>>> ctx->vram_lost_counter ==

      >>>>>> adev->vram_lost_counter should return
AMDGPU_CTX_NO_RESET,

      >>>>>> adev->because a

      >>>>>> GPU reset occurred, but it didn't affect our context.

      >>>>> I disagree on that.

      >>>>>

      >>>>> AMDGPU_CTX_INNOCENT_RESET just means what it does
currently, there

      >>>>> was a reset but we haven't been causing it.

      >>>>>

      >>>>> That the OpenGL extension is specified otherwise is
unfortunate,

      >>>>> but I think we shouldn't use that for the kernel interface here.

      >>>> Two counterpoints:

      >>>>

      >>>> 1. Why should any application care that there was a reset
while it

      >>>> was idle? The OpenGL behavior is what makes sense.

      >>>

      >>> The application is certainly not interest if a reset
happened or

      >>> not, but I though that the driver stack might be.

      >>>

      >>>>

      >>>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything
today

      >>>> because we never return it :)

      >>>>

      >>>

      >>> Good point.

      >>>

      >>>> amdgpu_ctx_query only ever returns
AMDGPU_CTX_UNKNOWN_RESET, which

      >>>> is in line with the OpenGL spec: we're conservatively
returning

      >>>> that a reset happened because we don't know whether the
context was

      >>>> affected, and we return UNKNOWN because we also don't know
whether

      >>>> the context was guilty or not.

      >>>>

      >>>> Returning AMDGPU_CTX_NO_RESET in the case of !guilty &&

      >>>> ctx->vram_lost_counter == adev->vram_lost_counter is simply
a

      >>>> refinement and improvement of the current, overly
conservative

      >>>> behavior.

      >>>

      >>> Ok let's reenumerate what I think the different return
values should

      >>> mean:

      >>>

      >>> * AMDGPU_CTX_GUILTY_RESET

      >>>

      >>> guilty is set to true for this context.

      >>>

      >>> * AMDGPU_CTX_INNOCENT_RESET

      >>>

      >>> guilty is not set and vram_lost_counter has changed.

      >>>

      >>> * AMDGPU_CTX_UNKNOWN_RESET

      >>>

      >>> guilty is not set and vram_lost_counter has not changed, but

      >>> gpu_reset_counter has changed.

      >>

      >> I don't understand the distinction you're proposing between

      >> AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I
think both

      >> cases you're describing should return either

      >> AMDGPU_CTX_INNOCENT_RESET, if the value of guilty is
reliable, or

      >> AMDGPU_CTX_UNKNOWN_RESET if it's not.

      >

      > I think it'd make more sense if it was called

      > "AMDGPU_CTX_UNAFFECTED_RESET".

      >

      > So:

      > - AMDGPU_CTX_GUILTY_RESET --> the context was affected by a
reset, and

      > we know that it's the context's fault

      > - AMDGPU_CTX_INNOCENT_RESET --> the context was affected by a
reset,

      > and we know that it *wasn't* the context's fault (no context
job

      > active)

      > - AMDGPU_CTX_UNKNOWN_RESET --> the context was affected by a
reset,

      > and we don't know who's responsible (this could be returned in
the

      > unlikely case where context A's gfx job has not yet finished,
but

      > context B's gfx job has already started; it could be the fault
of A,

      > it could be the fault of B -- which somehow manages to hang a
part of

      > the hardware that then prevents A's job from finishing -- or
it could

      > be both; but it's a bit academic)

      > - AMDGPU_CTX_UNAFFECTED_RESET --> there was a reset, but this
context

      > wasn't affected

      >

      > This last value would currently just be discarded by Mesa
(because we

      > should only disturb applications when we have to), but perhaps

      > somebody else could find it useful?

      Yes, that's what I had in mind as well.

      Cause otherwise we would return AMDGPU_CTX_NO_RESET while there
      actually was a reset and that certainly doesn't sound correct to me.

      Regards,

      Christian.

      >

      > Cheers,

      > Nicolai

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: TDR and VRAM lost handling in KMD (v2)

Reply via email to