Ah, crap yeah that won't work since we don't free the ring.

Key point is we need to distinct between the ring doesn't work temporary because we are in a GPU reset and it doesn't work at all because we are missing firmware or stuff like that.

And no, checking the gpu_reset flag is totally racy and can't be done either. How about checking accel_working instead?

Christian.

Am 01.03.2018 um 07:01 schrieb Liu, Monk:
Please change the test to use ring->ring_obj instead, this way we still bail 
out if somebody tries to submit commands before the ring is even allocated.
I don't understand how could fill_buffer() get run under the case that 
ring->ring_obj is not even allocated ... where is such case ?


/Monk

-----Original Message-----
From: Koenig, Christian
Sent: 2018年2月28日 20:46
To: Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for 
fill_buffer

Good point, but in this case we need some other handling.

Please change the test to use ring->ring_obj instead, this way we still bail 
out if somebody tries to submit commands before the ring is even allocated.

And you also need to fix a couple of other places in amdgpu_ttm.c.

Regards,
Christian.

Am 28.02.2018 um 13:34 schrieb Liu, Monk:
Because when SDMA was hang by like process A, and meanwhile another
process B is already running into the code of fill_buffer() So just let process 
B continue, don't block it otherwise process B would fail by software reason .

Let it run and finally process B's job would fail and GPU recover will
repeat it again (since it is a kernel job)

Without this solution other process will be greatly harmed by one
black sheep that triggering GPU recover

/Monk



-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
Sent: 2018年2月28日 20:24
To: Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready
for fill_buffer

Am 28.02.2018 um 08:21 schrieb Monk Liu:
because this time SDMA may under GPU RESET so its ring->ready may not
true, keep going and GPU scheduler will reschedule this job if it
failed.

give a warning on copy_buffer when go through direct_submit while
ring->ready is false
NAK, that test has already saved us quite a bunch of trouble with the fb layer.

Why exactly are you running into issues with that?

Christian.

Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
Signed-off-by: Monk Liu <monk....@amd.com>
---
    drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
    1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index e38e6db..7b75ac9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t 
src_offset,
        amdgpu_ring_pad_ib(ring, &job->ibs[0]);
        WARN_ON(job->ibs[0].length_dw > num_dw);
        if (direct_submit) {
+               WARN_ON(!ring->ready);
                r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
                                       NULL, fence);
                job->fence = dma_fence_get(*fence); @@ -1692,11 +1693,6 @@ int
amdgpu_fill_buffer(struct amdgpu_bo *bo,
        struct amdgpu_job *job;
        int r;
- if (!ring->ready) {
-               DRM_ERROR("Trying to clear memory with ring turned off.\n");
-               return -EINVAL;
-       }
-
        if (bo->tbo.mem.mem_type == TTM_PL_TT) {
                r = amdgpu_ttm_alloc_gart(&bo->tbo);
                if (r)
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to