Thanks for checking.

Alex

On Thu, Oct 9, 2025 at 9:55 AM Kim, Jonathan <[email protected]> wrote:
>
> [Public]
>
> > -----Original Message-----
> > From: Alex Deucher <[email protected]>
> > Sent: Wednesday, October 8, 2025 4:16 PM
> > To: Kim, Jonathan <[email protected]>
> > Cc: [email protected]; Liu, Shaoyun <[email protected]>
> > Subject: Re: [PATCH] drm/amdgpu: fix gfx12 mes packet submission status 
> > check
> >
> > On Wed, Oct 8, 2025 at 4:12 PM Kim, Jonathan <[email protected]>
> > wrote:
> > >
> > > [Public]
> > >
> > > > -----Original Message-----
> > > > From: Alex Deucher <[email protected]>
> > > > Sent: Wednesday, October 8, 2025 1:46 PM
> > > > To: Kim, Jonathan <[email protected]>
> > > > Cc: [email protected]; Liu, Shaoyun <[email protected]>
> > > > Subject: Re: [PATCH] drm/amdgpu: fix gfx12 mes packet submission status
> > check
> > > >
> > > > On Wed, Oct 8, 2025 at 1:27 PM Kim, Jonathan <[email protected]>
> > > > wrote:
> > > > >
> > > > > [Public]
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Alex Deucher <[email protected]>
> > > > > > Sent: Wednesday, October 8, 2025 1:12 PM
> > > > > > To: Kim, Jonathan <[email protected]>
> > > > > > Cc: [email protected]; Liu, Shaoyun
> > <[email protected]>
> > > > > > Subject: Re: [PATCH] drm/amdgpu: fix gfx12 mes packet submission 
> > > > > > status
> > > > check
> > > > > >
> > > > > > On Wed, Oct 8, 2025 at 12:51 PM Jonathan Kim
> > <[email protected]>
> > > > > > wrote:
> > > > > > >
> > > > > > > The driver currently only checks that the MES packet submission 
> > > > > > > fence
> > > > > > > did not timeout but does not actually check if the fence return 
> > > > > > > status
> > > > > > > matches the expected completion value it passed to MES prior to
> > > > > > > submission.
> > > > > > >
> > > > > > > For example, this can result in REMOVE_QUEUE requests returning
> > success
> > > > > > > to the driver when the queue actually failed to preempt.
> > > > > > >
> > > > > > > Fix this by having the driver actually compare the completion 
> > > > > > > status
> > > > > > > value to the expected success value.
> > > > > >
> > > > > > This should be correct as is:
> > > > > >
> > > > > > *status_ptr = 0;
> > > > >
> > > > > That's not what I'm observing at the moment.
> > > > > amdgpu_fence_wait_polling can still return fine where status_ptr != 0 
> > > > > nor 1.
> > > > > From what I've been told, only 0x1 == success (the completion fence 
> > > > > value
> > > > passed to MES).
> > > > > Shaoyun can probably elaborate or correct me if I'm wrong.
> > > > > It's pretty easy for me to put the HW in a bad state (dangling queues 
> > > > > post
> > > > remove) without the driver knowing in its current state.
> > > >
> > > > 1 is just what the driver puts as the fence value to write:
> > > > api_status->api_completion_fence_value = 1;
> > > > and the memory location is initialized to 0:
> > > > *status_ptr = 0;
> > > > The firmware should either write 1 or nothing, writing a random value
> > > > in there looks like memory corruption or a firmware bug.  If it
> > > > doesn't write the fence, the driver can assume the operation failed.
> > > > If it writes random values, then we have no idea what's going on.
> > >
> > > FW writes high 32 bits to status_ptr for debug codes if low bits is 0.
> > > So we should probably only check the 32 lower bits then.
> >
> > Thanks for clearing this up.  Please check if mes_v11 needs a similar fix.
>
> Looks like GFX11 is fine the way it is.
> The hi/lo bit split in the status return only affects GFX12.
>
> Jon
>
> >
> > Alex
> >
> > >
> > > Jon
> > >
> > > >
> > > > Alex
> > > >
> > > > >
> > > > > Jon
> > > > >
> > > > > > ...
> > > > > > api_status->api_completion_fence_value = 1;
> > > > > > ...
> > > > > > if (r < 1 || !*status_ptr) {
> > > > > >
> > > > > > Alex
> > > > > >
> > > > > > >
> > > > > > > Signed-off-by: Jonathan Kim <[email protected]>
> > > > > > > ---
> > > > > > >  drivers/gpu/drm/amd/amdgpu/mes_v12_0.c | 3 +--
> > > > > > >  1 file changed, 1 insertion(+), 2 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
> > > > > > b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
> > > > > > > index aff06f06aeee..58f61170cf85 100644
> > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
> > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
> > > > > > > @@ -228,8 +228,7 @@ static int
> > > > > > mes_v12_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes,
> > > > > > >                         pipe, x_pkt->header.opcode);
> > > > > > >
> > > > > > >         r = amdgpu_fence_wait_polling(ring, seq, timeout);
> > > > > > > -       if (r < 1 || !*status_ptr) {
> > > > > > > -
> > > > > > > +       if (r < 1 || *status_ptr != 
> > > > > > > api_status->api_completion_fence_value)
> > {
> > > > > > >                 if (misc_op_str)
> > > > > > >                         dev_err(adev->dev, "MES(%d) failed to 
> > > > > > > respond to msg=%s
> > > > > > (%s)\n",
> > > > > > >                                 pipe, op_str, misc_op_str);
> > > > > > > --
> > > > > > > 2.34.1
> > > > > > >

Reply via email to