amdgpu: don't return when ring not ready for fill_buffer

Liu, Monk Thu, 01 Mar 2018 02:12:13 -0800

Oh yeah, I grep "ring->ready = false" and lost others ...

-----Original Message-----
From: Koenig, Christian 
Sent: 2018年3月1日 17:51
To: Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for 
fill_buffer


Look for amdgpu_ttm_set_active_vram_size(), setting ring->ready to false is 
directly beneath it.

Christian.

Am 01.03.2018 um 10:41 schrieb Liu, Monk:
> Oops, looks only si will call this dma_stop() during hw_fini, I 
> checked for sdma4.0/3.0/2.4, no such logic
>
>
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
> Sent: 2018年3月1日 17:39
> To: Liu, Monk <monk....@amd.com>; Koenig, Christian 
> <christian.koe...@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
> for fill_buffer
>
> Hi Monk,
>
> look at si_dma_stop() and/or other SDMA generation code.
>
> Regards,
> Christian.
>
> Am 01.03.2018 um 10:37 schrieb Liu, Monk:
>> Hi Christian
>>
>> During suspend/resume, which stage will set sdma's ring->ready to 
>> false ?? I can only find it in amdgpu_ring_fini() But I saw that it 
>> is only invoked in unload_kms(), looks device_suspnd() will not set
>> ring->ready to false
>>
>> /Monk
>>
>> -----Original Message-----
>> From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On 
>> Behalf Of Liu, Monk
>> Sent: 2018年3月1日 16:52
>> To: Koenig, Christian <christian.koe...@amd.com>; 
>> amd-gfx@lists.freedesktop.org
>> Subject: RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
>> for fill_buffer
>>
>>> amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring is 
>>> available during suspend/resume.
>> As long as we are in gpu reset, we don't return error when ring not ready, 
>> cuz eventually it either success or failed and rescheduled by scheduler 
>> since it is kernel job What's your concern ?
>>
>> /Monk
>>
>>
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: 2018年3月1日 16:41
>> To: Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
>> for fill_buffer
>>
>>> If we are in gpu rest stage, we can let SDMA job live with the case of ring 
>>> not ready,  can you point out where is the problem ?
>> amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring is 
>> available during suspend/resume.
>>
>> E.g. suspend works as this:
>> 1. Evict all normal BOs from VRAM using the SDMA.
>> 2. Disable SDMA, GART, PSP etc... e.g. disable the hardware.
>> 3. Unpin all BOs in VRAM used by hardware engines, e.g. the GART table, 
>> firmware etc...
>> 4. Evict all remaining BOs from VRAM using CPU copies.
>>
>> Resume works the other way around:
>> 1. Move all BOs for GART table and firmware back into VRAM using CPU copies.
>> 2. Pin those BOs in VRAM and enable the hardware.
>> 3. Use the SDMA to move back all other BOs into VRAM.
>>
>> To figure out if we should use the CPU copy or the SDMA the ring->ready flag 
>> is used. So removing this check is really a no-go because it would break 
>> suspend/resume.
>>
>> The only other alternative I can see is to add a separate flag in 
>> amdgpu_mman if buffer_funcs should be used or not. See all the callers of 
>> amdgpu_ttm_set_active_vram_size() as well.
>>
>> Something like renaming amdgpu_ttm_set_active_vram_size() into
>> amdgpu_ttm_set_buffer_funcs_read() and ignoring any change if gpu_reset is 
>> set.
>>
>> BTW: Changing the active VRAM size in GPU reset is quite a bug you stumbled 
>> over here, so this really needs fixing anyway.
>>
>> Regards,
>> Christian.
>>
>> Am 01.03.2018 um 09:23 schrieb Liu, Monk:
>>> Accel_working is only for GFX ring, I don't think test it is appropriate 
>>> for SDMA jobs ....
>>>
>>> If we are in gpu rest stage, we can let SDMA job live with the case of ring 
>>> not ready,  can you point out where is the problem ?
>>>
>>> In fact we did stress TDR test with this patch and it can really fix 
>>> couple over kill issues
>>>
>>>
>>> /Monk
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
>>> Sent: 2018年3月1日 16:16
>>> To: Liu, Monk <monk....@amd.com>; Koenig, Christian 
>>> <christian.koe...@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not 
>>> ready for fill_buffer
>>>
>>> Ah, crap yeah that won't work since we don't free the ring.
>>>
>>> Key point is we need to distinct between the ring doesn't work temporary 
>>> because we are in a GPU reset and it doesn't work at all because we are 
>>> missing firmware or stuff like that.
>>>
>>> And no, checking the gpu_reset flag is totally racy and can't be done 
>>> either. How about checking accel_working instead?
>>>
>>> Christian.
>>>
>>> Am 01.03.2018 um 07:01 schrieb Liu, Monk:
>>>>> Please change the test to use ring->ring_obj instead, this way we still 
>>>>> bail out if somebody tries to submit commands before the ring is even 
>>>>> allocated.
>>>> I don't understand how could fill_buffer() get run under the case that 
>>>> ring->ring_obj is not even allocated ... where is such case ?
>>>>
>>>>
>>>> /Monk
>>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian
>>>> Sent: 2018年2月28日 20:46
>>>> To: Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not 
>>>> ready for fill_buffer
>>>>
>>>> Good point, but in this case we need some other handling.
>>>>
>>>> Please change the test to use ring->ring_obj instead, this way we still 
>>>> bail out if somebody tries to submit commands before the ring is even 
>>>> allocated.
>>>>
>>>> And you also need to fix a couple of other places in amdgpu_ttm.c.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 28.02.2018 um 13:34 schrieb Liu, Monk:
>>>>> Because when SDMA was hang by like process A, and meanwhile 
>>>>> another process B is already running into the code of fill_buffer() So 
>>>>> just let process B continue, don't block it otherwise process B would 
>>>>> fail by software reason .
>>>>>
>>>>> Let it run and finally process B's job would fail and GPU recover 
>>>>> will repeat it again (since it is a kernel job)
>>>>>
>>>>> Without this solution other process will be greatly harmed by one 
>>>>> black sheep that triggering GPU recover
>>>>>
>>>>> /Monk
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
>>>>> Sent: 2018年2月28日 20:24
>>>>> To: Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org
>>>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not 
>>>>> ready for fill_buffer
>>>>>
>>>>> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>>>>>> because this time SDMA may under GPU RESET so its ring->ready may 
>>>>>> not true, keep going and GPU scheduler will reschedule this job 
>>>>>> if it failed.
>>>>>>
>>>>>> give a warning on copy_buffer when go through direct_submit while
>>>>>> ring->ready is false
>>>>> NAK, that test has already saved us quite a bunch of trouble with the fb 
>>>>> layer.
>>>>>
>>>>> Why exactly are you running into issues with that?
>>>>>
>>>>> Christian.
>>>>>
>>>>>> Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
>>>>>> Signed-off-by: Monk Liu <monk....@amd.com>
>>>>>> ---
>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
>>>>>>        1 file changed, 1 insertion(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>>> index e38e6db..7b75ac9 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>>> @@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, 
>>>>>> uint64_t src_offset,
>>>>>>          amdgpu_ring_pad_ib(ring, &job->ibs[0]);
>>>>>>          WARN_ON(job->ibs[0].length_dw > num_dw);
>>>>>>          if (direct_submit) {
>>>>>> +                WARN_ON(!ring->ready);
>>>>>>                  r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>>>>>>                                         NULL, fence);
>>>>>>                  job->fence = dma_fence_get(*fence); @@ -1692,11 +1693,6 
>>>>>> @@ int amdgpu_fill_buffer(struct amdgpu_bo *bo,
>>>>>>          struct amdgpu_job *job;
>>>>>>          int r;
>>>>>>        
>>>>>> -        if (!ring->ready) {
>>>>>> -                DRM_ERROR("Trying to clear memory with ring turned 
>>>>>> off.\n");
>>>>>> -                return -EINVAL;
>>>>>> -        }
>>>>>> -
>>>>>>          if (bo->tbo.mem.mem_type == TTM_PL_TT) {
>>>>>>                  r = amdgpu_ttm_alloc_gart(&bo->tbo);
>>>>>>                  if (r)
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer

Reply via email to