Re: [PATCH] Increase tlb flush timeout for sriov

Deucher, Alexander Tue, 09 Aug 2022 10:05:21 -0700

[AMD Official Use Only - General]

Please add these details to the patch description when you respin.
________________________________
From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> on behalf of Liu, Shaoyun 
<shaoyun....@amd.com>
Sent: Monday, August 8, 2022 6:30 PM
To: Alex Deucher <alexdeuc...@gmail.com>; Milinkovic, Dusica 
<dusica.milinko...@amd.com>
Cc: amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] Increase tlb flush timeout for sriov


As I discussed with Alice ,this change is when multi-vf running compute 
benchmark (Luxmark) at the same time, which involves multiple vf  do the tlb 
invalidation at the same time. They observed kiq timeout after submit the tlb 
invalidate command. Although each vf has the invalidate register set, but from 
hw, the invalidate requests are queue to execute.

Alice, as we discussed, we can use maximum 12*100ms for the timeout , it 
shouldn't be 6000ms. Did you see issues with 1200 ms timeout?

Regards
Shaoyun.liu
________________________________
From: amd-gfx <amd-gfx-boun...@lists.freedesktop.org> on behalf of Alex Deucher 
<alexdeuc...@gmail.com>
Sent: August 8, 2022 4:49 PM
To: Milinkovic, Dusica <dusica.milinko...@amd.com>
Cc: amd-gfx@lists.freedesktop.org <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH] Increase tlb flush timeout for sriov

On Wed, Aug 3, 2022 at 5:02 AM Dusica Milinkovic
<dusica.milinko...@amd.com> wrote:
>

Please include a patch description.  Why do you need a longer timeout?
 What problem does it fix?

> Signed-off-by: Dusica Milinkovic <dusica.milinko...@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c | 6 +++++-
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c  | 6 +++++-
>  2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> index 9ae8cdaa033e..6ab7d329916f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c
> @@ -419,6 +419,7 @@ static int gmc_v10_0_flush_gpu_tlb_pasid(struct 
> amdgpu_device *adev,
>         uint32_t seq;
>         uint16_t queried_pasid;
>         bool ret;
> +       uint32_t sriov_usec_timeout = 6000000;  /* wait for 12 * 500ms for 
> SRIOV */
>         struct amdgpu_ring *ring = &adev->gfx.kiq.ring;
>         struct amdgpu_kiq *kiq = &adev->gfx.kiq;
>
> @@ -437,7 +438,10 @@ static int gmc_v10_0_flush_gpu_tlb_pasid(struct 
> amdgpu_device *adev,
>
>                 amdgpu_ring_commit(ring);
>                 spin_unlock(&adev->gfx.kiq.ring_lock);
> -               r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);
> +               if (amdgpu_sriov_vf(adev))
> +                       r = amdgpu_fence_wait_polling(ring, seq, 
> sriov_usec_timeout);
> +               else
> +                       r = amdgpu_fence_wait_polling(ring, seq, 
> adev->usec_timeout);

What about something like this?
u32 usec_timeout = amdgpu_sriov_vf(adev) ? 6000000 :
adev->usec_timeout;  /* wait for 12 * 500ms for SRIOV */
...
r = amdgpu_fence_wait_polling(ring, seq, usec_timeout);


>                 if (r < 1) {
>                         dev_err(adev->dev, "wait for kiq fence error: 
> %ld.\n", r);
>                         return -ETIME;
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c 
> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 22761a3bb818..941a6b52fa72 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -896,6 +896,7 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct 
> amdgpu_device *adev,
>         uint32_t seq;
>         uint16_t queried_pasid;
>         bool ret;
> +       uint32_t sriov_usec_timeout = 6000000;  /* wait for 12 * 500ms for 
> SRIOV */
>         struct amdgpu_ring *ring = &adev->gfx.kiq.ring;
>         struct amdgpu_kiq *kiq = &adev->gfx.kiq;
>
> @@ -935,7 +936,10 @@ static int gmc_v9_0_flush_gpu_tlb_pasid(struct 
> amdgpu_device *adev,
>
>                 amdgpu_ring_commit(ring);
>                 spin_unlock(&adev->gfx.kiq.ring_lock);
> -               r = amdgpu_fence_wait_polling(ring, seq, adev->usec_timeout);
> +               if (amdgpu_sriov_vf(adev))
> +                       r = amdgpu_fence_wait_polling(ring, seq, 
> sriov_usec_timeout);
> +               else
> +                       r = amdgpu_fence_wait_polling(ring, seq, 
> adev->usec_timeout);

Same comment here.

Alex

>                 if (r < 1) {
>                         dev_err(adev->dev, "wait for kiq fence error: 
> %ld.\n", r);
>                         up_read(&adev->reset_domain->sem);
> --
> 2.25.1
>

Re: [PATCH] Increase tlb flush timeout for sriov

Reply via email to