Am 12.03.25 um 10:23 schrieb Zhang, Jesse(Jie):
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
>
>  
>
>  
>
> *From:*Koenig, Christian <[email protected]>
> *Sent:* Wednesday, March 12, 2025 4:39 PM
> *To:* Zhang, Jesse(Jie) <[email protected]>; [email protected]
> *Cc:* Deucher, Alexander <[email protected]>; Kim, Jonathan 
> <[email protected]>; Zhu, Jiadong <[email protected]>
> *Subject:* Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by 
> removing dynamic callbacks
>
>  
>
> Am 12.03.25 um 09:15 schrieb Zhang, Jesse(Jie):
>
>     [SNIP9
>
>         -
>
>         +     gfx_ring->funcs->stop_queue(adev, instance_id);
>
>      
>
>     Yeah that starts to look good. Question here is who is calling 
> amdgpu_sdma_reset_engine()?
>
>      
>
>     If this call comes from engine specific code we might not need the 
> start/stop_queue callbacks all together.
>
>      
>
>         Kfd and sdma v4/v5/v5_2 will call amdgpu_sdma_reset_engine, and 
> start/stop_queue callbacks are only implemented in sdmav4/sdmav5/sdma5_2.
>
>
> Why would the KFD call this as well? Because it detects an issue with a SDMA 
> user queue  If yes I would rather suggest that the KFD calls the reset 
> function of the paging queue.
>
> Since this reset function is specific to the SDMA HW generation anyway you 
> don't need those extra functions to abstract starting and stopping of the 
> queue for each HW generation.
>
> kfd can't call reset function directly, unless we add a parameter src  to 
> distinguish kfd and kgd in reset function, like this:
>
> int (*reset)(struct amdgpu_ring *ring, unsigned int vmid, */int src/* );
>
> As Alex said in another thread,
>
> We need to distinguish  kfd and kgd  in reset.
>
> If kfd triggers a reset, kgd must save healthy jobs and recover jobs after 
> reset.
>
> If kgd triggers a reset, kgd must abandon bad jobs after reset.(and perhaps 
> kfd needs to save its healthy jobs for recovery).
>

I don't think the source of the reset should be relevant to the reset procedure.

The source is basically just the first one who runs into a timeout, that can be 
both KFD and KGD.

But the cause of the timeout is not necessary the one who signals that a 
timeout happens.

So as far as I can see you should not have that as parameter either.

>  
>
> If we can add a parameter, I am ok for that solution too.
>
>  
>
> Additionally:
>
> For sdma6/7, when a queue reset fails, we may need to fall back to an engine 
> reset for a attempt.
>

Yeah, but that should be trivial.

Regards,
Christian.

>  
>
> Thanks
>
> Jesse
>
>
> Regards,
> Christian.
>
>
>      
>
>      
>
>     Thanks
>
>     Jesse
>
>      
>
>     Regards,
>
>     Christian.
>
>      
>
>               /* Perform the SDMA reset for the specified instance */
>
>               ret = amdgpu_dpm_reset_sdma(adev, 1 << instance_id);
>
>               if (ret) {
>
>         @@ -591,18 +573,7 @@ int amdgpu_sdma_reset_engine(struct 
> amdgpu_device *adev, uint32_t instance_id, b
>
>                       goto exit;
>
>               }
>
>          
>
>         -     /* Invoke all registered post_reset callbacks */
>
>         -     list_for_each_entry(funcs, &adev->sdma.reset_callback_list, 
> list) {
>
>         -             if (funcs->post_reset) {
>
>         -                     ret = funcs->post_reset(adev, instance_id);
>
>         -                     if (ret) {
>
>         -                             dev_err(adev->dev,
>
>         -                             "afterReset callback failed for 
> instance %u: %d\n",
>
>         -                                     instance_id, ret);
>
>         -                             goto exit;
>
>         -                     }
>
>         -             }
>
>         -     }
>
>         +     gfx_ring->funcs->start_queue(adev, instance_id);
>
>          
>
>          exit:
>
>               /* Restart the scheduler's work queue for the GFX and page rings
>
>         diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
>         b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
>         index fd34dc138081..c1f7ccff9c4e 100644
>
>         --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
>         +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c
>
>         @@ -2132,6 +2132,8 @@ static const struct amdgpu_ring_funcs 
> sdma_v4_4_2_ring_funcs = {
>
>               .emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait,
>
>               .emit_reg_write_reg_wait = 
> amdgpu_ring_emit_reg_write_reg_wait_helper,
>
>               .reset = sdma_v4_4_2_reset_queue,
>
>         +     .stop_queue = sdma_v4_4_2_stop_queue,
>
>         +     .start_queue = sdma_v4_4_2_restore_queue,
>
>               .is_guilty = sdma_v4_4_2_ring_is_guilty,  };
>
>          
>
>      
>
>  
>

Reply via email to