amdgpu: Move scheduler init to after XGMI is ready

JingWen Chen Wed, 02 Mar 2022 01:51:44 -0800

Hi Andrey,

Most part of the patches are OK, but the code will introduce a ib test fail on 
the disabled vcn of sienna_cichlid.


In SRIOV use case we will disable one vcn on sienna_cichlid, I have attached a 
patch to fix this issue, please check the attachment.

Best Regards,

Jingwen Chen


On 2/26/22 5:22 AM, Andrey Grodzovsky wrote:
> Hey, patches attached - i applied the patches and resolved merge conflicts 
> but weren't able to test as my on board's network card doesn't work with 5.16 
> kernel (it does with 5.17, maybe it's Kconfig issue and i need to check more).
> The patches are on top of 'cababde192b2 Yifan Zhang         2 days ago     
> drm/amd/pm: fix mode2 reset fail for smu 13.0.5 ' commit.
>
> Please test and let me know. Maybe by Monday I will be able to resolve the 
> connectivity issue on 5.16.
>
> Andrey
>
> On 2022-02-24 22:13, JingWen Chen wrote:
>> Hi Andrey,
>>
>> Sorry for the misleading, I mean the whole patch series. We are depending on 
>> this patch series to fix the concurrency issue within SRIOV TDR sequence.
>>
>>
>>
>> On 2/25/22 1:26 AM, Andrey Grodzovsky wrote:
>>> No problem if so but before I do,
>>>
>>>
>>> JingWen - why you think this patch is needed as a standalone now ? It has 
>>> no use without the
>>> entire feature together with it. Is it some changes you want to do on top 
>>> of that code ?
>>>
>>>
>>> Andrey
>>>
>>>
>>> On 2022-02-24 12:12, Deucher, Alexander wrote:
>>>> [Public]
>>>>
>>>>
>>>> If it applies cleanly, feel free to drop it in.  I'll drop those patches 
>>>> for drm-next since they are already in drm-misc.
>>>>
>>>> Alex
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* amd-gfx <amd-gfx-boun...@lists.freedesktop.org> on behalf of 
>>>> Andrey Grodzovsky <andrey.grodzov...@amd.com>
>>>> *Sent:* Thursday, February 24, 2022 11:24 AM
>>>> *To:* Chen, JingWen <jingwen.ch...@amd.com>; Christian König 
>>>> <ckoenig.leichtzumer...@gmail.com>; dri-de...@lists.freedesktop.org 
>>>> <dri-de...@lists.freedesktop.org>; amd-gfx@lists.freedesktop.org 
>>>> <amd-gfx@lists.freedesktop.org>
>>>> *Cc:* Liu, Monk <monk....@amd.com>; Chen, Horace <horace.c...@amd.com>; 
>>>> Lazar, Lijo <lijo.la...@amd.com>; Koenig, Christian 
>>>> <christian.koe...@amd.com>; dan...@ffwll.ch <dan...@ffwll.ch>
>>>> *Subject:* Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after 
>>>> XGMI is ready
>>>> No because all the patch-set including this patch was landed into
>>>> drm-misc-next and will reach amd-staging-drm-next on the next upstream
>>>> rebase i guess.
>>>>
>>>> Andrey
>>>>
>>>> On 2022-02-24 01:47, JingWen Chen wrote:
>>>>> Hi Andrey,
>>>>>
>>>>> Will you port this patch into amd-staging-drm-next?
>>>>>
>>>>> on 2/10/22 2:06 AM, Andrey Grodzovsky wrote:
>>>>>> All comments are fixed and code pushed. Thanks for everyone
>>>>>> who helped reviewing.
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>> On 2022-02-09 02:53, Christian König wrote:
>>>>>>> Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:
>>>>>>>> Before we initialize schedulers we must know which reset
>>>>>>>> domain are we in - for single device there iis a single
>>>>>>>> domain per device and so single wq per device. For XGMI
>>>>>>>> the reset domain spans the entire XGMI hive and so the
>>>>>>>> reset wq is per hive.
>>>>>>>>
>>>>>>>> Signed-off-by: Andrey Grodzovsky <andrey.grodzov...@amd.com>
>>>>>>> One more comment below, with that fixed Reviewed-by: Christian König 
>>>>>>> <christian.koe...@amd.com>.
>>>>>>>
>>>>>>>> ---
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 ++++++++++++++++++++++
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--------------
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
>>>>>>>>      3 files changed, 51 insertions(+), 30 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index 9704b0e1fd82..00123b0013d3 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -2287,6 +2287,47 @@ static int amdgpu_device_fw_loading(struct 
>>>>>>>> amdgpu_device *adev)
>>>>>>>>          return r;
>>>>>>>>      }
>>>>>>>>      +static int amdgpu_device_init_schedulers(struct amdgpu_device 
>>>>>>>> *adev)
>>>>>>>> +{
>>>>>>>> +    long timeout;
>>>>>>>> +    int r, i;
>>>>>>>> +
>>>>>>>> +    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>>>> +        struct amdgpu_ring *ring = adev->rings[i];
>>>>>>>> +
>>>>>>>> +        /* No need to setup the GPU scheduler for rings that don't 
>>>>>>>> need it */
>>>>>>>> +        if (!ring || ring->no_scheduler)
>>>>>>>> +            continue;
>>>>>>>> +
>>>>>>>> +        switch (ring->funcs->type) {
>>>>>>>> +        case AMDGPU_RING_TYPE_GFX:
>>>>>>>> +            timeout = adev->gfx_timeout;
>>>>>>>> +            break;
>>>>>>>> +        case AMDGPU_RING_TYPE_COMPUTE:
>>>>>>>> +            timeout = adev->compute_timeout;
>>>>>>>> +            break;
>>>>>>>> +        case AMDGPU_RING_TYPE_SDMA:
>>>>>>>> +            timeout = adev->sdma_timeout;
>>>>>>>> +            break;
>>>>>>>> +        default:
>>>>>>>> +            timeout = adev->video_timeout;
>>>>>>>> +            break;
>>>>>>>> +        }
>>>>>>>> +
>>>>>>>> +        r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
>>>>>>>> + ring->num_hw_submission, amdgpu_job_hang_limit,
>>>>>>>> +                   timeout, adev->reset_domain.wq, ring->sched_score, 
>>>>>>>> ring->name);
>>>>>>>> +        if (r) {
>>>>>>>> +            DRM_ERROR("Failed to create scheduler on ring %s.\n",
>>>>>>>> +                  ring->name);
>>>>>>>> +            return r;
>>>>>>>> +        }
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +
>>>>>>>>      /**
>>>>>>>>       * amdgpu_device_ip_init - run init for hardware IPs
>>>>>>>>       *
>>>>>>>> @@ -2419,6 +2460,10 @@ static int amdgpu_device_ip_init(struct 
>>>>>>>> amdgpu_device *adev)
>>>>>>>>              }
>>>>>>>>          }
>>>>>>>>      +    r = amdgpu_device_init_schedulers(adev);
>>>>>>>> +    if (r)
>>>>>>>> +        goto init_failed;
>>>>>>>> +
>>>>>>>>          /* Don't init kfd if whole hive need to be reset during init 
>>>>>>>> */
>>>>>>>>          if (!adev->gmc.xgmi.pending_reset)
>>>>>>>> amdgpu_amdkfd_device_init(adev);
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>>> index 45977a72b5dd..fa302540c69a 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>>> @@ -457,8 +457,6 @@ int amdgpu_fence_driver_init_ring(struct 
>>>>>>>> amdgpu_ring *ring,
>>>>>>>>                        atomic_t *sched_score)
>>>>>>>>      {
>>>>>>>>          struct amdgpu_device *adev = ring->adev;
>>>>>>>> -    long timeout;
>>>>>>>> -    int r;
>>>>>>>>            if (!adev)
>>>>>>>>              return -EINVAL;
>>>>>>>> @@ -478,36 +476,12 @@ int amdgpu_fence_driver_init_ring(struct 
>>>>>>>> amdgpu_ring *ring,
>>>>>>>> spin_lock_init(&ring->fence_drv.lock);
>>>>>>>>          ring->fence_drv.fences = kcalloc(num_hw_submission * 2, 
>>>>>>>> sizeof(void *),
>>>>>>>>                           GFP_KERNEL);
>>>>>>>> -    if (!ring->fence_drv.fences)
>>>>>>>> -        return -ENOMEM;
>>>>>>>>      -    /* No need to setup the GPU scheduler for rings that don't 
>>>>>>>> need it */
>>>>>>>> -    if (ring->no_scheduler)
>>>>>>>> -        return 0;
>>>>>>>> +    ring->num_hw_submission = num_hw_submission;
>>>>>>>> +    ring->sched_score = sched_score;
>>>>>>> Let's move this into the caller and then use ring->num_hw_submission in 
>>>>>>> the fence code as well.
>>>>>>>
>>>>>>> The maximum number of jobs on the ring is not really fence specific.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Christian.
>>>>>>>
>>>>>>>>      -    switch (ring->funcs->type) {
>>>>>>>> -    case AMDGPU_RING_TYPE_GFX:
>>>>>>>> -        timeout = adev->gfx_timeout;
>>>>>>>> -        break;
>>>>>>>> -    case AMDGPU_RING_TYPE_COMPUTE:
>>>>>>>> -        timeout = adev->compute_timeout;
>>>>>>>> -        break;
>>>>>>>> -    case AMDGPU_RING_TYPE_SDMA:
>>>>>>>> -        timeout = adev->sdma_timeout;
>>>>>>>> -        break;
>>>>>>>> -    default:
>>>>>>>> -        timeout = adev->video_timeout;
>>>>>>>> -        break;
>>>>>>>> -    }
>>>>>>>> -
>>>>>>>> -    r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
>>>>>>>> -               num_hw_submission, amdgpu_job_hang_limit,
>>>>>>>> -               timeout, NULL, sched_score, ring->name);
>>>>>>>> -    if (r) {
>>>>>>>> -        DRM_ERROR("Failed to create scheduler on ring %s.\n",
>>>>>>>> -              ring->name);
>>>>>>>> -        return r;
>>>>>>>> -    }
>>>>>>>> +    if (!ring->fence_drv.fences)
>>>>>>>> +        return -ENOMEM;
>>>>>>>>            return 0;
>>>>>>>>      }
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>>>>> index fae7d185ad0d..7f20ce73a243 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
>>>>>>>> @@ -251,6 +251,8 @@ struct amdgpu_ring {
>>>>>>>>          bool has_compute_vm_bug;
>>>>>>>>          bool            no_scheduler;
>>>>>>>>          int            hw_prio;
>>>>>>>> +    unsigned num_hw_submission;
>>>>>>>> +    atomic_t        *sched_score;
>>>>>>>>      };
>>>>>>>>        #define amdgpu_ring_parse_cs(r, p, ib) 
>>>>>>>> ((r)->funcs->parse_cs((p), (ib)))

From f8e53c74ce7d094ac14648e9da6bf43b3098323f Mon Sep 17 00:00:00 2001
From: Jingwen Chen <jingwen.ch...@amd.com>
Date: Wed, 2 Mar 2022 17:44:38 +0800
Subject: [PATCH] drm/amd/amdgpu: set disabled vcn to no_schduler

[Why]
after the reset domain introduced, the sched.ready will be init after
hw_init, which will overwrite the setup in vcn hw_init, and lead to
vcn ib test fail.

[How]
set disabled vcn to no_scheduler

Signed-off-by: Jingwen Chen <jingwen.ch...@amd.com>
Change-Id: I69cfd7b5fe0b9f86c263b293dc663a9368f055b0
---
 drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c 
b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
index da11ceba0698..5b515ca36748 100644
--- a/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/vcn_v3_0.c
@@ -295,6 +295,7 @@ static int vcn_v3_0_hw_init(void *handle)
                        ring = &adev->vcn.inst[i].ring_dec;
                        if (amdgpu_vcn_is_disabled_vcn(adev, VCN_DECODE_RING, 
i)) {
                                ring->sched.ready = false;
+                               ring->no_scheduler = true;
                                dev_info(adev->dev, "ring %s is disabled by 
hypervisor\n", ring->name);
                        } else {
                                ring->wptr = 0;
@@ -307,6 +308,7 @@ static int vcn_v3_0_hw_init(void *handle)
                                ring = &adev->vcn.inst[i].ring_enc[j];
                                if (amdgpu_vcn_is_disabled_vcn(adev, 
VCN_ENCODE_RING, i)) {
                                        ring->sched.ready = false;
+                                       ring->no_scheduler = true;
                                        dev_info(adev->dev, "ring %s is 
disabled by hypervisor\n", ring->name);
                                } else {
                                        ring->wptr = 0;
-- 
2.32.0 (Apple Git-132)

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

Reply via email to