amdgpu: don't return when ring not ready for fill_buffer

Liu, Monk Thu, 01 Mar 2018 01:41:32 -0800

Oops, looks only si will call this dma_stop() during hw_fini, I checked for 
sdma4.0/3.0/2.4, no such logic




-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com] 
Sent: 2018年3月1日 17:39
To: Liu, Monk <monk....@amd.com>; Koenig, Christian <christian.koe...@amd.com>; 
amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for 
fill_buffer

Hi Monk,

look at si_dma_stop() and/or other SDMA generation code.

Regards,
Christian.

Am 01.03.2018 um 10:37 schrieb Liu, Monk:
> Hi Christian
>
> During suspend/resume, which stage will set sdma's ring->ready to 
> false ?? I can only find it in amdgpu_ring_fini() But I saw that it is 
> only invoked in unload_kms(), looks device_suspnd() will not set 
> ring->ready to false
>
> /Monk
>
> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf 
> Of Liu, Monk
> Sent: 2018年3月1日 16:52
> To: Koenig, Christian <christian.koe...@amd.com>; 
> amd-gfx@lists.freedesktop.org
> Subject: RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
> for fill_buffer
>
>> amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring is 
>> available during suspend/resume.
> As long as we are in gpu reset, we don't return error when ring not ready, 
> cuz eventually it either success or failed and rescheduled by scheduler since 
> it is kernel job What's your concern ?
>
> /Monk
>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2018年3月1日 16:41
> To: Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
> for fill_buffer
>
>> If we are in gpu rest stage, we can let SDMA job live with the case of ring 
>> not ready,  can you point out where is the problem ?
> amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring is 
> available during suspend/resume.
>
> E.g. suspend works as this:
> 1. Evict all normal BOs from VRAM using the SDMA.
> 2. Disable SDMA, GART, PSP etc... e.g. disable the hardware.
> 3. Unpin all BOs in VRAM used by hardware engines, e.g. the GART table, 
> firmware etc...
> 4. Evict all remaining BOs from VRAM using CPU copies.
>
> Resume works the other way around:
> 1. Move all BOs for GART table and firmware back into VRAM using CPU copies.
> 2. Pin those BOs in VRAM and enable the hardware.
> 3. Use the SDMA to move back all other BOs into VRAM.
>
> To figure out if we should use the CPU copy or the SDMA the ring->ready flag 
> is used. So removing this check is really a no-go because it would break 
> suspend/resume.
>
> The only other alternative I can see is to add a separate flag in amdgpu_mman 
> if buffer_funcs should be used or not. See all the callers of 
> amdgpu_ttm_set_active_vram_size() as well.
>
> Something like renaming amdgpu_ttm_set_active_vram_size() into
> amdgpu_ttm_set_buffer_funcs_read() and ignoring any change if gpu_reset is 
> set.
>
> BTW: Changing the active VRAM size in GPU reset is quite a bug you stumbled 
> over here, so this really needs fixing anyway.
>
> Regards,
> Christian.
>
> Am 01.03.2018 um 09:23 schrieb Liu, Monk:
>> Accel_working is only for GFX ring, I don't think test it is appropriate for 
>> SDMA jobs ....
>>
>> If we are in gpu rest stage, we can let SDMA job live with the case of ring 
>> not ready,  can you point out where is the problem ?
>>
>> In fact we did stress TDR test with this patch and it can really fix 
>> couple over kill issues
>>
>>
>> /Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
>> Sent: 2018年3月1日 16:16
>> To: Liu, Monk <monk....@amd.com>; Koenig, Christian 
>> <christian.koe...@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
>> for fill_buffer
>>
>> Ah, crap yeah that won't work since we don't free the ring.
>>
>> Key point is we need to distinct between the ring doesn't work temporary 
>> because we are in a GPU reset and it doesn't work at all because we are 
>> missing firmware or stuff like that.
>>
>> And no, checking the gpu_reset flag is totally racy and can't be done 
>> either. How about checking accel_working instead?
>>
>> Christian.
>>
>> Am 01.03.2018 um 07:01 schrieb Liu, Monk:
>>>> Please change the test to use ring->ring_obj instead, this way we still 
>>>> bail out if somebody tries to submit commands before the ring is even 
>>>> allocated.
>>> I don't understand how could fill_buffer() get run under the case that 
>>> ring->ring_obj is not even allocated ... where is such case ?
>>>
>>>
>>> /Monk
>>>
>>> -----Original Message-----
>>> From: Koenig, Christian
>>> Sent: 2018年2月28日 20:46
>>> To: Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not 
>>> ready for fill_buffer
>>>
>>> Good point, but in this case we need some other handling.
>>>
>>> Please change the test to use ring->ring_obj instead, this way we still 
>>> bail out if somebody tries to submit commands before the ring is even 
>>> allocated.
>>>
>>> And you also need to fix a couple of other places in amdgpu_ttm.c.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 28.02.2018 um 13:34 schrieb Liu, Monk:
>>>> Because when SDMA was hang by like process A, and meanwhile another 
>>>> process B is already running into the code of fill_buffer() So just let 
>>>> process B continue, don't block it otherwise process B would fail by 
>>>> software reason .
>>>>
>>>> Let it run and finally process B's job would fail and GPU recover 
>>>> will repeat it again (since it is a kernel job)
>>>>
>>>> Without this solution other process will be greatly harmed by one 
>>>> black sheep that triggering GPU recover
>>>>
>>>> /Monk
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Christian König [mailto:ckoenig.leichtzumer...@gmail.com]
>>>> Sent: 2018年2月28日 20:24
>>>> To: Liu, Monk <monk....@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not 
>>>> ready for fill_buffer
>>>>
>>>> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>>>>> because this time SDMA may under GPU RESET so its ring->ready may 
>>>>> not true, keep going and GPU scheduler will reschedule this job if 
>>>>> it failed.
>>>>>
>>>>> give a warning on copy_buffer when go through direct_submit while
>>>>> ring->ready is false
>>>> NAK, that test has already saved us quite a bunch of trouble with the fb 
>>>> layer.
>>>>
>>>> Why exactly are you running into issues with that?
>>>>
>>>> Christian.
>>>>
>>>>> Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
>>>>> Signed-off-by: Monk Liu <monk....@amd.com>
>>>>> ---
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
>>>>>       1 file changed, 1 insertion(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> index e38e6db..7b75ac9 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> @@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, 
>>>>> uint64_t src_offset,
>>>>>           amdgpu_ring_pad_ib(ring, &job->ibs[0]);
>>>>>           WARN_ON(job->ibs[0].length_dw > num_dw);
>>>>>           if (direct_submit) {
>>>>> +         WARN_ON(!ring->ready);
>>>>>                   r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>>>>>                                          NULL, fence);
>>>>>                   job->fence = dma_fence_get(*fence); @@ -1692,11 +1693,6 
>>>>> @@ 
>>>>> int amdgpu_fill_buffer(struct amdgpu_bo *bo,
>>>>>           struct amdgpu_job *job;
>>>>>           int r;
>>>>>       
>>>>> - if (!ring->ready) {
>>>>> -         DRM_ERROR("Trying to clear memory with ring turned off.\n");
>>>>> -         return -EINVAL;
>>>>> - }
>>>>> -
>>>>>           if (bo->tbo.mem.mem_type == TTM_PL_TT) {
>>>>>                   r = amdgpu_ttm_alloc_gart(&bo->tbo);
>>>>>                   if (r)
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer

Reply via email to