sched: signal and free remaining fences in amd_sched_entity_fini

Nicolai Hähnle Mon, 09 Oct 2017 07:02:29 -0700

It depends on what you mean by "handle". If amdgpu_cs_submit_raw were toreturn ECANCELED, the correct error message would be printed.

We don't do any of the "trying to continue" business because back whenwe last discussed that we said that it wasn't such a great idea, and tobe honest, it really isn't a great idea for normal applications. For theX server / compositor it could be valuable though.


Cheers,
Nicolai

On 09.10.2017 15:57, Olsak, Marek wrote:

Mesa does not handle -ECANCELED. It only returns -ECANCELED from theMesa winsys layer if the CS ioctl wasn't called (because the context isalready lost and so the winsys doesn't submit further CS ioctls).
When the CS ioctl fails for the first time, the kernel error is returnedand the context is marked as "lost".
The next command submission is automatically dropped by the winsys andit returns -ECANCELED.
Marek

------------------------------------------------------------------------
*From:* Haehnle, Nicolai
*Sent:* Monday, October 9, 2017 2:58:02 PM
*To:* Koenig, Christian; Liu, Monk; Nicolai Hähnle;amd-gfx@lists.freedesktop.org; Olsak, Marek
*Cc:* Li, Bingley
*Subject:* Re: [PATCH 5/5] drm/amd/sched: signal and free remainingfences in amd_sched_entity_fini
On 09.10.2017 14:33, Christian König wrote:
Am 09.10.2017 um 13:27 schrieb Nicolai Hähnle:
On 09.10.2017 13:12, Christian König wrote:
Nicolai, how hard would it be to handle ENODEV as failure for allcurrently existing contexts?
Impossible? "All currently existing contexts" is not a well-definedconcept when multiple drivers co-exist in the same process.
Ok, let me refine the question: I assume there are resources "shared"between contexts like binary shader code for example which needs tobe reuploaded when VRAM is lost.
How hard would it be to handle that correctly?
Okay, that makes more sense :)
With the current interface it's still pretty difficult, but if wecould get a new per-device query ioctl which returns a "VRAM losscounter", it would be reasonably straight-forward.
The problem with the VRAM lost counter is that this isn't save either.E.g. you could have an application which uploads shaders, a GPU resethappens and VRAM is lost and then the application creates a new contextand makes submission with broken shader binaries.
Hmm. Here's how I imagined we'd be using a VRAM lost counter:

int si_shader_binary_upload(...)
{
     ...
     shader->bo_vram_lost_counter = sscreen->vram_lost_counter;
     shader->bo = pipe_buffer_create(...);
     ptr = sscreen->b.ws->buffer_map(shader->bo->buf, ...);
     ... copies ...
     sscreen->b.ws->buffer_unmap(shader->bo->buf);
}

int si_shader_select(...)
{
     ...
     r = si_shader_select_with_key(ctx->sscreen, state, ...);
     if (r) return r;

     if (state->current->bo_vram_lost_counter !=
         ctx->sscreen->vram_lost_counter) {
        ... re-upload sequence ...
     }
}

(Not shown: logic that compares ctx->vram_lost_counter with
sscreen->vram_lost_counter and forces a re-validation of all state
including shaders.)

That should cover this scenario, shouldn't it?

Oh... I see one problem. But it should be easy to fix: when creating a
new amdgpu context, Mesa needs to query the vram lost counter. So then
the sequence of events would be either:

- VRAM lost counter starts at 0
- Mesa uploads a shader binary
- Unrelated GPU reset happens, kernel increments VRAM lost counter to 1
- Mesa creates a new amdgpu context, queries the VRAM lost counter --> 1
- si_screen::vram_lost_counter is updated to 1
- Draw happens on the new context --> si_shader_select will catch the
VRAM loss

Or:

- VRAM lost counter starts at 0
- Mesa uploads a shader binary
- Mesa creates a new amdgpu context, VRAM lost counter still 0
- Unrelated GPU reset happens, kernel increments VRAM lost counter to 1
- Draw happens on the new context and proceeds normally
...
- Mesa flushes the CS, and the kernel will return an error code because
the device VRAM lost counter is different from the amdgpu context VRAM
lost counter
So I would still vote for a separate IOCTL to reset the VRAM lost statewhich is called *before" user space starts to reuploadshader/descriptors etc...
The question is: is that separate IOCTL per-context or per-fd? If it's
per-fd, then it's not compatible with multiple drivers. If it's
per-context, then I don't see how it helps. Perhaps you could explain?


  > This way you also catch the case when another reset happens while you
  > re-upload things.

My assumption would be that the re-upload happens *after* the new amdgpu
context is created. Then the repeat reset should be caught by the kernel
when we try to submit a CS on the new context (this is assuming that the
kernel context's vram lost counter is initialized properly when the
context is created):

- Mesa prepares upload, sets shader->bo_vram_lost_counter to 0
- Mesa uploads a shader binary
- While doing this, a GPU reset happens[0], kernel increments device
VRAM lost counter to 1
- Draw happens with the new shader, Mesa proceeds normally
...
- On flush / CS submit, the kernel detects the VRAM lost state and
returns an error to Mesa

[0] Out of curiosity: What happens on the CPU side if the PCI / full
ASIC reset method is used? Is there a time window where we could get a SEGV?


[snip]
BTW, I still don't like ENODEV. It seems more like the kind of errorcode you'd return with hot-pluggable GPUs where the device canphysically disappear...
Yeah, ECANCELED sounds like a better alternative. But I think we shouldstill somehow note the fatality of loosing VRAM to userspace.
How about ENODATA or EBADFD?
According to the manpage, EBADFD is "File descriptor in bad state.".
Sounds fitting :)

Cheers,
Nicolai
Regards,
Christian.
Cheers,
Nicolai
Regards,
Christian.

Am 09.10.2017 um 13:04 schrieb Nicolai Hähnle:
On 09.10.2017 12:59, Christian König wrote:
Nicolai, how hard would it be to handle ENODEV as failure for allcurrently existing contexts?
Impossible? "All currently existing contexts" is not a well-definedconcept when multiple drivers co-exist in the same process.
And what would be the purpose of this? If it's to support VRAM loss,having a per-context VRAM loss counter would enable each context tosignal ECANCELED separately.
Cheers,
Nicolai
Monk, would it be ok with you when we return ENODEV only forexisting context when VRAM is lost and/or we have a strict mode GPUreset? E.g. newly created contexts would continue work as they should.
Regards,
Christian.

Am 09.10.2017 um 12:49 schrieb Nicolai Hähnle:
Hi Monk,
Yes, you're right, we're only using ECANCELED internally. But as aconsequence, Mesa would already handle a kernel error of ECANCELEDon context loss correctly :)
Cheers,
Nicolai

On 09.10.2017 12:35, Liu, Monk wrote:
Hi Christian
You reject some of my patches that returns -ENODEV, with thecause that MESA doesn't do the handling on -ENODEV
But if Nicolai can confirm that MESA do have a handling on-ECANCELED, then we need to overall align our error code, ondetail below IOCTL can return error code:
Amdgpu_cs_ioctl
Amdgpu_cs_wait_ioctl
Amdgpu_cs_wait_fences_ioctl
Amdgpu_info_ioctl


My patches is:
return -ENODEV on cs_ioctl if the context is detected guilty,
also return -ENODEV on cs_wait|cs_wait_fences if the fence issignaled but with error -ETIME,also return -ENODEV on info_ioctl so UMD can query if gpu resethappened after the process created (because for strict mode weblock process instead of context)
according to Nicolai:
amdgpu_cs_ioctl *can* return -ECANCELED, but to be franklyspeaking, kernel part doesn't have any place with "-ECANCELED" sothis solution on MESA side doesn't align with *current* amdgpudriver,which only return 0 on success or -EINVALID on other error butdefinitely no "-ECANCELED" error code,
so if we talking about community rules we shouldn't let MESAhandle -ECANCELED , we should have a unified error code
+ Marek

BR Monk




-----Original Message-----
From: Haehnle, Nicolai
Sent: 2017年10月9日 18:14
To: Koenig, Christian <christian.koe...@amd.com>; Liu, Monk<monk....@amd.com>; Nicolai Hähnle <nhaeh...@gmail.com>;amd-gfx@lists.freedesktop.orgSubject: Re: [PATCH 5/5] drm/amd/sched: signal and free remainingfences in amd_sched_entity_fini
On 09.10.2017 10:02, Christian König wrote:
For gpu reset patches (already submitted to pub) I would makekernelreturn -ENODEV if the waiting fence (in cs_wait or wait_fencesIOCTL)founded as error, that way UMD would run into robust extensionpath
and considering the GPU hang occurred,
Well that is only closed source behavior which is completely
irrelevant for upstream development.
As far as I know we haven't pushed the change to return -ENODEVupstream.
FWIW, radeonsi currently expects -ECANCELED on CS submissions andtreats those as context lost. Perhaps we could use the same erroron fences?
That makes more sense to me than -ENODEV.

Cheers,
Nicolai
Regards,
Christian.

Am 09.10.2017 um 08:42 schrieb Liu, Monk:
Christian
It would be really nice to have an error code set on
s_fence->finished before it is signaled, usedma_fence_set_error()
for this.
For gpu reset patches (already submitted to pub) I would makekernelreturn -ENODEV if the waiting fence (in cs_wait or wait_fencesIOCTL)founded as error, that way UMD would run into robust extensionpath
and considering the GPU hang occurred,
Don't know if this is expected for the case of normal processbeingkilled or crashed like Nicolai hit ... since there is no gpuhang hit
BR Monk




-----Original Message-----
From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On
Behalf Of Christian K?nig
Sent: 2017年9月28日 23:01
To: Nicolai Hähnle <nhaeh...@gmail.com>;
amd-gfx@lists.freedesktop.org
Cc: Haehnle, Nicolai <nicolai.haeh...@amd.com>
Subject: Re: [PATCH 5/5] drm/amd/sched: signal and free remaining
fences in amd_sched_entity_fini

Am 28.09.2017 um 16:55 schrieb Nicolai Hähnle:
From: Nicolai Hähnle <nicolai.haeh...@amd.com>
Highly concurrent Piglit runs can trigger a race conditionwhere a
pending SDMA job on a buffer object is never executed because the
corresponding process is killed (perhaps due to a crash).Since thejob's fences were never signaled, the buffer object waseffectively
leaked. Worse, the buffer was stuck wherever it happened to be at
the time, possibly in VRAM.

The symptom was user space processes stuck in interruptible waits
with kernel stacks like:

       [<ffffffffbc5e6722>] dma_fence_default_wait+0x112/0x250
       [<ffffffffbc5e6399>] dma_fence_wait_timeout+0x39/0xf0
       [<ffffffffbc5e82d2>]
reservation_object_wait_timeout_rcu+0x1c2/0x300
[<ffffffffc03ce56f>]ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0
[ttm]
       [<ffffffffc03cf1ea>] ttm_mem_evict_first+0xba/0x1a0 [ttm]
       [<ffffffffc03cf611>] ttm_bo_mem_space+0x341/0x4c0 [ttm]
       [<ffffffffc03cfc54>] ttm_bo_validate+0xd4/0x150 [ttm]
[<ffffffffc03cffbd>] ttm_bo_init_reserved+0x2ed/0x420[ttm] [<ffffffffc042f523>]amdgpu_bo_create_restricted+0x1f3/0x470
[amdgpu]
       [<ffffffffc042f9fa>] amdgpu_bo_create+0xda/0x220 [amdgpu]
       [<ffffffffc04349ea>] amdgpu_gem_object_create+0xaa/0x140
[amdgpu]
       [<ffffffffc0434f97>] amdgpu_gem_create_ioctl+0x97/0x120
[amdgpu]
       [<ffffffffc037ddba>] drm_ioctl+0x1fa/0x480 [drm]
       [<ffffffffc041904f>] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
       [<ffffffffbc23db33>] do_vfs_ioctl+0xa3/0x5f0
       [<ffffffffbc23e0f9>] SyS_ioctl+0x79/0x90
       [<ffffffffbc864ffb>] entry_SYSCALL_64_fastpath+0x1e/0xad
       [<ffffffffffffffff>] 0xffffffffffffffff

Signed-off-by: Nicolai Hähnle <nicolai.haeh...@amd.com>
Acked-by: Christian König <christian.koe...@amd.com>
---
    drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 7 ++++++-
    1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
index 54eb77cffd9b..32a99e980d78 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
@@ -220,22 +220,27 @@ void amd_sched_entity_fini(struct
amd_gpu_scheduler *sched,
amd_sched_entity_is_idle(entity));
        amd_sched_rq_remove_entity(rq, entity);
        if (r) {
            struct amd_sched_job *job;
            /* Park the kernel for a moment to make sure it isn't
processing
             * our enity.
             */
            kthread_park(sched->thread);
            kthread_unpark(sched->thread);
-        while (kfifo_out(&entity->job_queue, &job, sizeof(job)))
+ while (kfifo_out(&entity->job_queue, &job,sizeof(job))) {
+            struct amd_sched_fence *s_fence = job->s_fence;
+            amd_sched_fence_scheduled(s_fence);
It would be really nice to have an error code set on
s_fence->finished before it is signaled, usedma_fence_set_error() for this.
Additional to that it would be nice to note in the subject linethat
this is a rather important bug fix.

With that fixed the whole series is Reviewed-by: Christian König
<christian.koe...@amd.com>.

Regards,
Christian.
+ amd_sched_fence_finished(s_fence);
+ dma_fence_put(&s_fence->finished);
                sched->ops->free_job(job);
+        }
        }
        kfifo_free(&entity->job_queue);
    }
static void amd_sched_entity_wakeup(struct dma_fence *f,struct
dma_fence_cb *cb)
    {
        struct amd_sched_entity *entity =
            container_of(cb, struct amd_sched_entity, cb);
        entity->dependency = NULL;
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini

Reply via email to