Thanks for checking. Alex
On Thu, Oct 9, 2025 at 9:55 AM Kim, Jonathan <[email protected]> wrote: > > [Public] > > > -----Original Message----- > > From: Alex Deucher <[email protected]> > > Sent: Wednesday, October 8, 2025 4:16 PM > > To: Kim, Jonathan <[email protected]> > > Cc: [email protected]; Liu, Shaoyun <[email protected]> > > Subject: Re: [PATCH] drm/amdgpu: fix gfx12 mes packet submission status > > check > > > > On Wed, Oct 8, 2025 at 4:12 PM Kim, Jonathan <[email protected]> > > wrote: > > > > > > [Public] > > > > > > > -----Original Message----- > > > > From: Alex Deucher <[email protected]> > > > > Sent: Wednesday, October 8, 2025 1:46 PM > > > > To: Kim, Jonathan <[email protected]> > > > > Cc: [email protected]; Liu, Shaoyun <[email protected]> > > > > Subject: Re: [PATCH] drm/amdgpu: fix gfx12 mes packet submission status > > check > > > > > > > > On Wed, Oct 8, 2025 at 1:27 PM Kim, Jonathan <[email protected]> > > > > wrote: > > > > > > > > > > [Public] > > > > > > > > > > > -----Original Message----- > > > > > > From: Alex Deucher <[email protected]> > > > > > > Sent: Wednesday, October 8, 2025 1:12 PM > > > > > > To: Kim, Jonathan <[email protected]> > > > > > > Cc: [email protected]; Liu, Shaoyun > > <[email protected]> > > > > > > Subject: Re: [PATCH] drm/amdgpu: fix gfx12 mes packet submission > > > > > > status > > > > check > > > > > > > > > > > > On Wed, Oct 8, 2025 at 12:51 PM Jonathan Kim > > <[email protected]> > > > > > > wrote: > > > > > > > > > > > > > > The driver currently only checks that the MES packet submission > > > > > > > fence > > > > > > > did not timeout but does not actually check if the fence return > > > > > > > status > > > > > > > matches the expected completion value it passed to MES prior to > > > > > > > submission. > > > > > > > > > > > > > > For example, this can result in REMOVE_QUEUE requests returning > > success > > > > > > > to the driver when the queue actually failed to preempt. > > > > > > > > > > > > > > Fix this by having the driver actually compare the completion > > > > > > > status > > > > > > > value to the expected success value. > > > > > > > > > > > > This should be correct as is: > > > > > > > > > > > > *status_ptr = 0; > > > > > > > > > > That's not what I'm observing at the moment. > > > > > amdgpu_fence_wait_polling can still return fine where status_ptr != 0 > > > > > nor 1. > > > > > From what I've been told, only 0x1 == success (the completion fence > > > > > value > > > > passed to MES). > > > > > Shaoyun can probably elaborate or correct me if I'm wrong. > > > > > It's pretty easy for me to put the HW in a bad state (dangling queues > > > > > post > > > > remove) without the driver knowing in its current state. > > > > > > > > 1 is just what the driver puts as the fence value to write: > > > > api_status->api_completion_fence_value = 1; > > > > and the memory location is initialized to 0: > > > > *status_ptr = 0; > > > > The firmware should either write 1 or nothing, writing a random value > > > > in there looks like memory corruption or a firmware bug. If it > > > > doesn't write the fence, the driver can assume the operation failed. > > > > If it writes random values, then we have no idea what's going on. > > > > > > FW writes high 32 bits to status_ptr for debug codes if low bits is 0. > > > So we should probably only check the 32 lower bits then. > > > > Thanks for clearing this up. Please check if mes_v11 needs a similar fix. > > Looks like GFX11 is fine the way it is. > The hi/lo bit split in the status return only affects GFX12. > > Jon > > > > > Alex > > > > > > > > Jon > > > > > > > > > > > Alex > > > > > > > > > > > > > > Jon > > > > > > > > > > > ... > > > > > > api_status->api_completion_fence_value = 1; > > > > > > ... > > > > > > if (r < 1 || !*status_ptr) { > > > > > > > > > > > > Alex > > > > > > > > > > > > > > > > > > > > Signed-off-by: Jonathan Kim <[email protected]> > > > > > > > --- > > > > > > > drivers/gpu/drm/amd/amdgpu/mes_v12_0.c | 3 +-- > > > > > > > 1 file changed, 1 insertion(+), 2 deletions(-) > > > > > > > > > > > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c > > > > > > b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c > > > > > > > index aff06f06aeee..58f61170cf85 100644 > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c > > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c > > > > > > > @@ -228,8 +228,7 @@ static int > > > > > > mes_v12_0_submit_pkt_and_poll_completion(struct amdgpu_mes *mes, > > > > > > > pipe, x_pkt->header.opcode); > > > > > > > > > > > > > > r = amdgpu_fence_wait_polling(ring, seq, timeout); > > > > > > > - if (r < 1 || !*status_ptr) { > > > > > > > - > > > > > > > + if (r < 1 || *status_ptr != > > > > > > > api_status->api_completion_fence_value) > > { > > > > > > > if (misc_op_str) > > > > > > > dev_err(adev->dev, "MES(%d) failed to > > > > > > > respond to msg=%s > > > > > > (%s)\n", > > > > > > > pipe, op_str, misc_op_str); > > > > > > > -- > > > > > > > 2.34.1 > > > > > > >
