On 2025-05-13 17:01, Alex Deucher wrote:
On Tue, May 13, 2025 at 4:33 PM David Wu <david...@amd.com> wrote:

On 2025-05-13 15:29, Alex Deucher wrote:
On Tue, May 13, 2025 at 3:01 PM David Wu <david...@amd.com> wrote:
On 2025-05-13 14:40, Alex Deucher wrote:

On Tue, May 13, 2025 at 2:23 PM David (Ming Qiang) Wu <david....@amd.com> wrote:

V2: not to add extra read-back in vcn_v4_0_start and vcn_v5_0_0_start as
      there are read-back calls already. New comments for better understanding.

Similar to the previous changes made for VCN v4.0.5, the addition of
register read-back support in VCN v4.0.0 and v5.0.0 is intended to
prevent potential race conditions, even though such issues have not
been observed yet. This change ensures consistency across different
VCN variants and helps avoid similar issues on newer or closely
related GPUs. The overhead introduced by this read-back is negligible.

Signed-off-by: David (Ming Qiang) Wu <david....@amd.com>
Reviewed-by: Mario Limonciello <mario.limoncie...@amd.com>

Maybe split this into two patches, one for vcn 4 and one for vcn 5.
That will make it easier to backport to stable.  What about other
VCNs?

will split.

This applies to those VCNs where regVCN_RB1_DB_CRTL is used for setting 
doorbell index, which

means VCN 4 and up - all of them are covered (similar code is already there for 
those not in this patch).
Sure that prevents the doorbell from getting missed, but what about
other registers setup in the VCN start() functions?  What if some of
those are still pending when the doorbell is rung for other VCNs?
I think adding a read-back is needed if there is any concern about race
condition.
If we only concern about start() it should be easy to add. The question
is how we will know there is
a race condition. Adding read back everywhere when missing after write
is not a solution I think.
For any VCN functions we need to check carefully
(eg. vcn_v4_0_unified_ring_set_wptr()) in case it adds too much
overhead and actually not needed (at least haven't seen the issue).
Any suggestion as to where we should add or at the moment for _start()?
I can work on it for sure or leave it for
future improvement.
I think _start() makes the most sense because it will only be called
when we power on the VCN instance.  As long as work keeps coming in,
it will stay on.  The race is theoretically possible on any VCN
instance.  E.g., when the first VCN job comes in, VCN gets powered on,
and then we call _start() to program what we need.  After that, we
ring the doorbell to kick off the job.  The programming sequence in
_start() could still be in flight on the PCIe bus when the doorbell
gets rung.  Which is apparently exactly what happens when we hit this
issue on VCN 4.x and 5.x.  On VCN the doorbell gets ignored because it
races with the doorbell id register, but on other VCNs, the doorbell
getting missed may not happen, but it could be something else that
races, e.g., ring size.
well - right. For production we enable DPG which is now protected.
the other register write at the end of _start() is for non-DPG case.
I can add read-back there just in case.
I guess for other issues/potential race conditions we need to investigate
case by case. Sounds OK?

David


Alex

David

Alex

David

Alex

---
   drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c   | 4 ++++
   drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c | 4 ++++
   2 files changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
index 8fff470bce873..070a2a8cdf6f4 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v4_0.c
@@ -1122,6 +1122,10 @@ static int vcn_v4_0_start_dpg_mode(struct 
amdgpu_vcn_inst *vinst, bool indirect)
                          ring->doorbell_index << 
VCN_RB1_DB_CTRL__OFFSET__SHIFT |
                          VCN_RB1_DB_CTRL__EN_MASK);

+       /* Keeping one read-back to ensure all register writes are done, 
otherwise
+        * it may introduce race conditions */
+       RREG32_SOC15(VCN, inst_idx, regVCN_RB1_DB_CTRL);
+
          return 0;
   }

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
index 27dcc6f37a730..77c27a317e4c8 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v5_0_0.c
@@ -794,6 +794,10 @@ static int vcn_v5_0_0_start_dpg_mode(struct 
amdgpu_vcn_inst *vinst,
                  ring->doorbell_index << VCN_RB1_DB_CTRL__OFFSET__SHIFT |
                  VCN_RB1_DB_CTRL__EN_MASK);

+       /* Keeping one read-back to ensure all register writes are done, 
otherwise
+        * it may introduce race conditions */
+       RREG32_SOC15(VCN, inst_idx, regVCN_RB1_DB_CTRL);
+
          return 0;
   }

--
2.34.1

Reply via email to