Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-03-04 Thread Christian König

Hi Jingwen,

can you please always send patches using "git send-email" and not as 
attachment? This one for example went under my radar because of this.


Regarding the patch itself can you please also remove the code which 
touches "ring->sched.ready"?


That variable is internal to the scheduler and should never ever be 
touched by any hardware specific code.


Thanks,
Christian.

Am 02.03.22 um 10:51 schrieb JingWen Chen:

Hi Andrey,

Most part of the patches are OK, but the code will introduce a ib test fail on 
the disabled vcn of sienna_cichlid.

In SRIOV use case we will disable one vcn on sienna_cichlid, I have attached a 
patch to fix this issue, please check the attachment.

Best Regards,

Jingwen Chen


On 2/26/22 5:22 AM, Andrey Grodzovsky wrote:

Hey, patches attached - i applied the patches and resolved merge conflicts but 
weren't able to test as my on board's network card doesn't work with 5.16 
kernel (it does with 5.17, maybe it's Kconfig issue and i need to check more).
The patches are on top of 'cababde192b2 Yifan Zhang 2 days ago 
drm/amd/pm: fix mode2 reset fail for smu 13.0.5 ' commit.

Please test and let me know. Maybe by Monday I will be able to resolve the 
connectivity issue on 5.16.

Andrey

On 2022-02-24 22:13, JingWen Chen wrote:

Hi Andrey,

Sorry for the misleading, I mean the whole patch series. We are depending on 
this patch series to fix the concurrency issue within SRIOV TDR sequence.



On 2/25/22 1:26 AM, Andrey Grodzovsky wrote:

No problem if so but before I do,


JingWen - why you think this patch is needed as a standalone now ? It has no 
use without the
entire feature together with it. Is it some changes you want to do on top of 
that code ?


Andrey


On 2022-02-24 12:12, Deucher, Alexander wrote:

[Public]


If it applies cleanly, feel free to drop it in.  I'll drop those patches for 
drm-next since they are already in drm-misc.

Alex


*From:* amd-gfx  on behalf of Andrey 
Grodzovsky 
*Sent:* Thursday, February 24, 2022 11:24 AM
*To:* Chen, JingWen ; Christian König 
; dri-de...@lists.freedesktop.org 
; amd-gfx@lists.freedesktop.org 

*Cc:* Liu, Monk ; Chen, Horace ; Lazar, Lijo 
; Koenig, Christian ; dan...@ffwll.ch 

*Subject:* Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is 
ready
No because all the patch-set including this patch was landed into
drm-misc-next and will reach amd-staging-drm-next on the next upstream
rebase i guess.

Andrey

On 2022-02-24 01:47, JingWen Chen wrote:

Hi Andrey,

Will you port this patch into amd-staging-drm-next?

on 2/10/22 2:06 AM, Andrey Grodzovsky wrote:

All comments are fixed and code pushed. Thanks for everyone
who helped reviewing.

Andrey

On 2022-02-09 02:53, Christian König wrote:

Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:

Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.

Signed-off-by: Andrey Grodzovsky 

One more comment below, with that fixed Reviewed-by: Christian König 
.


---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
      3 files changed, 51 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 9704b0e1fd82..00123b0013d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2287,6 +2287,47 @@ static int amdgpu_device_fw_loading(struct amdgpu_device 
*adev)
      return r;
      }
      +static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
+{
+    long timeout;
+    int r, i;
+
+    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+    struct amdgpu_ring *ring = adev->rings[i];
+
+    /* No need to setup the GPU scheduler for rings that don't need it */
+    if (!ring || ring->no_scheduler)
+    continue;
+
+    switch (ring->funcs->type) {
+    case AMDGPU_RING_TYPE_GFX:
+    timeout = adev->gfx_timeout;
+    break;
+    case AMDGPU_RING_TYPE_COMPUTE:
+    timeout = adev->compute_timeout;
+    break;
+    case AMDGPU_RING_TYPE_SDMA:
+    timeout = adev->sdma_timeout;
+    break;
+    default:
+    timeout = adev->video_timeout;
+    break;
+    }
+
+    r = drm_sched_init(>sched, _sched_ops,
+ ring->num_hw_submission, amdgpu_job_hang_limit,
+   timeout, adev->reset_domain.wq, ring->sched_score, 
ring->name);
+    if (r) {
+    DRM_ERROR("Failed to create scheduler on ring %s.\n",
+

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-03-03 Thread Chen, JingWen
Thanks a lot

Best Regards,
JingWen Chen



> On Mar 4, 2022, at 00:36, Grodzovsky, Andrey  
> wrote:
> 
> I pushed all the changes including your patch.
> 
> Andrey
> 
> On 2022-03-02 22:16, Andrey Grodzovsky wrote:
>> OK, i will do quick smoke test tomorrow and push all of it it then.
>> 
>> Andrey
>> 
>> On 2022-03-02 21:59, Chen, JingWen wrote:
>>> Hi Andrey,
>>> 
>>> I don't have the bare mental environment, I can only test the SRIOV cases.
>>> 
>>> Best Regards,
>>> JingWen Chen
>>> 
>>> 
>>> 
>>>> On Mar 3, 2022, at 01:55, Grodzovsky, Andrey  
>>>> wrote:
>>>> 
>>>> The patch is acked-by: Andrey Grodzovsky 
>>>> 
>>>> If you also smoked tested bare metal feel free to apply all the patches, 
>>>> if no let me know.
>>>> 
>>>> Andrey
>>>> 
>>>> On 2022-03-02 04:51, JingWen Chen wrote:
>>>>> Hi Andrey,
>>>>> 
>>>>> Most part of the patches are OK, but the code will introduce a ib test 
>>>>> fail on the disabled vcn of sienna_cichlid.
>>>>> 
>>>>> In SRIOV use case we will disable one vcn on sienna_cichlid, I have 
>>>>> attached a patch to fix this issue, please check the attachment.
>>>>> 
>>>>> Best Regards,
>>>>> 
>>>>> Jingwen Chen
>>>>> 
>>>>> 
>>>>> On 2/26/22 5:22 AM, Andrey Grodzovsky wrote:
>>>>>> Hey, patches attached - i applied the patches and resolved merge 
>>>>>> conflicts but weren't able to test as my on board's network card doesn't 
>>>>>> work with 5.16 kernel (it does with 5.17, maybe it's Kconfig issue and i 
>>>>>> need to check more).
>>>>>> The patches are on top of 'cababde192b2 Yifan Zhang 2 days ago   
>>>>>>   drm/amd/pm: fix mode2 reset fail for smu 13.0.5 ' commit.
>>>>>> 
>>>>>> Please test and let me know. Maybe by Monday I will be able to resolve 
>>>>>> the connectivity issue on 5.16.
>>>>>> 
>>>>>> Andrey
>>>>>> 
>>>>>> On 2022-02-24 22:13, JingWen Chen wrote:
>>>>>>> Hi Andrey,
>>>>>>> 
>>>>>>> Sorry for the misleading, I mean the whole patch series. We are 
>>>>>>> depending on this patch series to fix the concurrency issue within 
>>>>>>> SRIOV TDR sequence.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 2/25/22 1:26 AM, Andrey Grodzovsky wrote:
>>>>>>>> No problem if so but before I do,
>>>>>>>> 
>>>>>>>> 
>>>>>>>> JingWen - why you think this patch is needed as a standalone now ? It 
>>>>>>>> has no use without the
>>>>>>>> entire feature together with it. Is it some changes you want to do on 
>>>>>>>> top of that code ?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Andrey
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 2022-02-24 12:12, Deucher, Alexander wrote:
>>>>>>>>> [Public]
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> If it applies cleanly, feel free to drop it in. I'll drop those 
>>>>>>>>> patches for drm-next since they are already in drm-misc.
>>>>>>>>> 
>>>>>>>>> Alex
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>  
>>>>>>>>> *From:* amd-gfx  on behalf of 
>>>>>>>>> Andrey Grodzovsky 
>>>>>>>>> *Sent:* Thursday, February 24, 2022 11:24 AM
>>>>>>>>> *To:* Chen, JingWen ; Christian König 
>>>>>>>>> ; dri-de...@lists.freedesktop.org 
>>>>>>>>> ; amd-gfx@lists.freedesktop.org 
>>>>>>>>> 
>>>>>>>>> *Cc:* Liu, Monk ; Chen, Horace 
>>>>>>>>> ; Lazar, Lijo ; Koenig, 
>>>>>>>>> Christian ; dan...@ffwll.ch 
>>>>>

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-03-03 Thread Andrey Grodzovsky

I pushed all the changes including your patch.

Andrey

On 2022-03-02 22:16, Andrey Grodzovsky wrote:

OK, i will do quick smoke test tomorrow and push all of it it then.

Andrey

On 2022-03-02 21:59, Chen, JingWen wrote:

Hi Andrey,

I don't have the bare mental environment, I can only test the SRIOV 
cases.


Best Regards,
JingWen Chen



On Mar 3, 2022, at 01:55, Grodzovsky, Andrey 
 wrote:


The patch is acked-by: Andrey Grodzovsky 

If you also smoked tested bare metal feel free to apply all the 
patches, if no let me know.


Andrey

On 2022-03-02 04:51, JingWen Chen wrote:

Hi Andrey,

Most part of the patches are OK, but the code will introduce a ib 
test fail on the disabled vcn of sienna_cichlid.


In SRIOV use case we will disable one vcn on sienna_cichlid, I have 
attached a patch to fix this issue, please check the attachment.


Best Regards,

Jingwen Chen


On 2/26/22 5:22 AM, Andrey Grodzovsky wrote:
Hey, patches attached - i applied the patches and resolved merge 
conflicts but weren't able to test as my on board's network card 
doesn't work with 5.16 kernel (it does with 5.17, maybe it's 
Kconfig issue and i need to check more).
The patches are on top of 'cababde192b2 Yifan Zhang 2 days 
ago drm/amd/pm: fix mode2 reset fail for smu 13.0.5 ' commit.


Please test and let me know. Maybe by Monday I will be able to 
resolve the connectivity issue on 5.16.


Andrey

On 2022-02-24 22:13, JingWen Chen wrote:

Hi Andrey,

Sorry for the misleading, I mean the whole patch series. We are 
depending on this patch series to fix the concurrency issue 
within SRIOV TDR sequence.




On 2/25/22 1:26 AM, Andrey Grodzovsky wrote:

No problem if so but before I do,


JingWen - why you think this patch is needed as a standalone now 
? It has no use without the
entire feature together with it. Is it some changes you want to 
do on top of that code ?



Andrey


On 2022-02-24 12:12, Deucher, Alexander wrote:

[Public]


If it applies cleanly, feel free to drop it in. I'll drop those 
patches for drm-next since they are already in drm-misc.


Alex

 

*From:* amd-gfx  on 
behalf of Andrey Grodzovsky 

*Sent:* Thursday, February 24, 2022 11:24 AM
*To:* Chen, JingWen ; Christian König 
; 
dri-de...@lists.freedesktop.org 
; 
amd-gfx@lists.freedesktop.org 
*Cc:* Liu, Monk ; Chen, Horace 
; Lazar, Lijo ; 
Koenig, Christian ; dan...@ffwll.ch 

*Subject:* Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init 
to after XGMI is ready

No because all the patch-set including this patch was landed into
drm-misc-next and will reach amd-staging-drm-next on the next 
upstream

rebase i guess.

Andrey

On 2022-02-24 01:47, JingWen Chen wrote:

Hi Andrey,

Will you port this patch into amd-staging-drm-next?

on 2/10/22 2:06 AM, Andrey Grodzovsky wrote:

All comments are fixed and code pushed. Thanks for everyone
who helped reviewing.

Andrey

On 2022-02-09 02:53, Christian König wrote:

Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:

Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.

Signed-off-by: Andrey Grodzovsky 
One more comment below, with that fixed Reviewed-by: 
Christian König .



---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 
++
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 
++--

drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h |  2 +
  3 files changed, 51 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 9704b0e1fd82..00123b0013d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2287,6 +2287,47 @@ static int 
amdgpu_device_fw_loading(struct amdgpu_device *adev)

  return r;
  }
  +static int amdgpu_device_init_schedulers(struct 
amdgpu_device *adev)

+{
+    long timeout;
+    int r, i;
+
+    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+    struct amdgpu_ring *ring = adev->rings[i];
+
+    /* No need to setup the GPU scheduler for rings 
that don't need it */

+    if (!ring || ring->no_scheduler)
+    continue;
+
+    switch (ring->funcs->type) {
+    case AMDGPU_RING_TYPE_GFX:
+    timeout = adev->gfx_timeout;
+    break;
+    case AMDGPU_RING_TYPE_COMPUTE:
+    timeout = adev->compute_timeout;
+    break;
+    case AMDGPU_RING_TYPE_SDMA:
+    timeout = adev->sdma_timeout;
+    break;
+    default:
+    timeout = adev->video_timeout;
+    break;
+    }
+
+    r = drm_sched_init(>sched, _sched_ops,
+ ring->num_hw_submission, amdgpu_job_hang_limit,
+   timeout, adev->reset_domai

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-03-02 Thread Andrey Grodzovsky

OK, i will do quick smoke test tomorrow and push all of it it then.

Andrey

On 2022-03-02 21:59, Chen, JingWen wrote:

Hi Andrey,

I don't have the bare mental environment, I can only test the SRIOV cases.

Best Regards,
JingWen Chen




On Mar 3, 2022, at 01:55, Grodzovsky, Andrey  wrote:

The patch is acked-by: Andrey Grodzovsky 

If you also smoked tested bare metal feel free to apply all the patches, if no 
let me know.

Andrey

On 2022-03-02 04:51, JingWen Chen wrote:

Hi Andrey,

Most part of the patches are OK, but the code will introduce a ib test fail on 
the disabled vcn of sienna_cichlid.

In SRIOV use case we will disable one vcn on sienna_cichlid, I have attached a 
patch to fix this issue, please check the attachment.

Best Regards,

Jingwen Chen


On 2/26/22 5:22 AM, Andrey Grodzovsky wrote:

Hey, patches attached - i applied the patches and resolved merge conflicts but 
weren't able to test as my on board's network card doesn't work with 5.16 
kernel (it does with 5.17, maybe it's Kconfig issue and i need to check more).
The patches are on top of 'cababde192b2 Yifan Zhang 2 days ago 
drm/amd/pm: fix mode2 reset fail for smu 13.0.5 ' commit.

Please test and let me know. Maybe by Monday I will be able to resolve the 
connectivity issue on 5.16.

Andrey

On 2022-02-24 22:13, JingWen Chen wrote:

Hi Andrey,

Sorry for the misleading, I mean the whole patch series. We are depending on 
this patch series to fix the concurrency issue within SRIOV TDR sequence.



On 2/25/22 1:26 AM, Andrey Grodzovsky wrote:

No problem if so but before I do,


JingWen - why you think this patch is needed as a standalone now ? It has no 
use without the
entire feature together with it. Is it some changes you want to do on top of 
that code ?


Andrey


On 2022-02-24 12:12, Deucher, Alexander wrote:

[Public]


If it applies cleanly, feel free to drop it in.  I'll drop those patches for 
drm-next since they are already in drm-misc.

Alex


*From:* amd-gfx  on behalf of Andrey 
Grodzovsky 
*Sent:* Thursday, February 24, 2022 11:24 AM
*To:* Chen, JingWen ; Christian König 
; dri-de...@lists.freedesktop.org 
; amd-gfx@lists.freedesktop.org 

*Cc:* Liu, Monk ; Chen, Horace ; Lazar, Lijo 
; Koenig, Christian ; dan...@ffwll.ch 

*Subject:* Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is 
ready
No because all the patch-set including this patch was landed into
drm-misc-next and will reach amd-staging-drm-next on the next upstream
rebase i guess.

Andrey

On 2022-02-24 01:47, JingWen Chen wrote:

Hi Andrey,

Will you port this patch into amd-staging-drm-next?

on 2/10/22 2:06 AM, Andrey Grodzovsky wrote:

All comments are fixed and code pushed. Thanks for everyone
who helped reviewing.

Andrey

On 2022-02-09 02:53, Christian König wrote:

Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:

Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.

Signed-off-by: Andrey Grodzovsky 

One more comment below, with that fixed Reviewed-by: Christian König 
.


---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
  3 files changed, 51 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 9704b0e1fd82..00123b0013d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2287,6 +2287,47 @@ static int amdgpu_device_fw_loading(struct amdgpu_device 
*adev)
  return r;
  }
  +static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
+{
+long timeout;
+int r, i;
+
+for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+struct amdgpu_ring *ring = adev->rings[i];
+
+/* No need to setup the GPU scheduler for rings that don't need it */
+if (!ring || ring->no_scheduler)
+continue;
+
+switch (ring->funcs->type) {
+case AMDGPU_RING_TYPE_GFX:
+timeout = adev->gfx_timeout;
+break;
+case AMDGPU_RING_TYPE_COMPUTE:
+timeout = adev->compute_timeout;
+break;
+case AMDGPU_RING_TYPE_SDMA:
+timeout = adev->sdma_timeout;
+break;
+default:
+timeout = adev->video_timeout;
+break;
+}
+
+r = drm_sched_init(>sched, _sched_ops,
+ ring->num_hw_submission, amdgpu_job_hang_limit,
+   timeout, adev->reset_domain.wq, ring->sched_score, 
ring->name);
+if (r) {
+DRM_ERROR("Failed to create schedule

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-03-02 Thread Chen, JingWen
Hi Andrey,

I don't have the bare mental environment, I can only test the SRIOV cases.

Best Regards,
JingWen Chen



> On Mar 3, 2022, at 01:55, Grodzovsky, Andrey  
> wrote:
> 
> The patch is acked-by: Andrey Grodzovsky 
> 
> If you also smoked tested bare metal feel free to apply all the patches, if 
> no let me know.
> 
> Andrey
> 
> On 2022-03-02 04:51, JingWen Chen wrote:
>> Hi Andrey,
>> 
>> Most part of the patches are OK, but the code will introduce a ib test fail 
>> on the disabled vcn of sienna_cichlid.
>> 
>> In SRIOV use case we will disable one vcn on sienna_cichlid, I have attached 
>> a patch to fix this issue, please check the attachment.
>> 
>> Best Regards,
>> 
>> Jingwen Chen
>> 
>> 
>> On 2/26/22 5:22 AM, Andrey Grodzovsky wrote:
>>> Hey, patches attached - i applied the patches and resolved merge conflicts 
>>> but weren't able to test as my on board's network card doesn't work with 
>>> 5.16 kernel (it does with 5.17, maybe it's Kconfig issue and i need to 
>>> check more).
>>> The patches are on top of 'cababde192b2 Yifan Zhang 2 days ago 
>>> drm/amd/pm: fix mode2 reset fail for smu 13.0.5 ' commit.
>>> 
>>> Please test and let me know. Maybe by Monday I will be able to resolve the 
>>> connectivity issue on 5.16.
>>> 
>>> Andrey
>>> 
>>> On 2022-02-24 22:13, JingWen Chen wrote:
>>>> Hi Andrey,
>>>> 
>>>> Sorry for the misleading, I mean the whole patch series. We are depending 
>>>> on this patch series to fix the concurrency issue within SRIOV TDR 
>>>> sequence.
>>>> 
>>>> 
>>>> 
>>>> On 2/25/22 1:26 AM, Andrey Grodzovsky wrote:
>>>>> No problem if so but before I do,
>>>>> 
>>>>> 
>>>>> JingWen - why you think this patch is needed as a standalone now ? It has 
>>>>> no use without the
>>>>> entire feature together with it. Is it some changes you want to do on top 
>>>>> of that code ?
>>>>> 
>>>>> 
>>>>> Andrey
>>>>> 
>>>>> 
>>>>> On 2022-02-24 12:12, Deucher, Alexander wrote:
>>>>>> [Public]
>>>>>> 
>>>>>> 
>>>>>> If it applies cleanly, feel free to drop it in.  I'll drop those patches 
>>>>>> for drm-next since they are already in drm-misc.
>>>>>> 
>>>>>> Alex
>>>>>> 
>>>>>> 
>>>>>> *From:* amd-gfx  on behalf of 
>>>>>> Andrey Grodzovsky 
>>>>>> *Sent:* Thursday, February 24, 2022 11:24 AM
>>>>>> *To:* Chen, JingWen ; Christian König 
>>>>>> ; dri-de...@lists.freedesktop.org 
>>>>>> ; amd-gfx@lists.freedesktop.org 
>>>>>> 
>>>>>> *Cc:* Liu, Monk ; Chen, Horace ; 
>>>>>> Lazar, Lijo ; Koenig, Christian 
>>>>>> ; dan...@ffwll.ch 
>>>>>> *Subject:* Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after 
>>>>>> XGMI is ready
>>>>>> No because all the patch-set including this patch was landed into
>>>>>> drm-misc-next and will reach amd-staging-drm-next on the next upstream
>>>>>> rebase i guess.
>>>>>> 
>>>>>> Andrey
>>>>>> 
>>>>>> On 2022-02-24 01:47, JingWen Chen wrote:
>>>>>>> Hi Andrey,
>>>>>>> 
>>>>>>> Will you port this patch into amd-staging-drm-next?
>>>>>>> 
>>>>>>> on 2/10/22 2:06 AM, Andrey Grodzovsky wrote:
>>>>>>>> All comments are fixed and code pushed. Thanks for everyone
>>>>>>>> who helped reviewing.
>>>>>>>> 
>>>>>>>> Andrey
>>>>>>>> 
>>>>>>>> On 2022-02-09 02:53, Christian König wrote:
>>>>>>>>> Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:
>>>>>>>>>> Before we initialize schedulers we must know which reset
>>>>>>>>>> domain are we in - for single device there iis a single
>>>>>>>>>> domain per device and so single wq per device. For XGMI
>>>>>>>>>>

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-03-02 Thread Andrey Grodzovsky

The patch is acked-by: Andrey Grodzovsky 

If you also smoked tested bare metal feel free to apply all the patches, 
if no let me know.


Andrey

On 2022-03-02 04:51, JingWen Chen wrote:

Hi Andrey,

Most part of the patches are OK, but the code will introduce a ib test fail on 
the disabled vcn of sienna_cichlid.

In SRIOV use case we will disable one vcn on sienna_cichlid, I have attached a 
patch to fix this issue, please check the attachment.

Best Regards,

Jingwen Chen


On 2/26/22 5:22 AM, Andrey Grodzovsky wrote:

Hey, patches attached - i applied the patches and resolved merge conflicts but 
weren't able to test as my on board's network card doesn't work with 5.16 
kernel (it does with 5.17, maybe it's Kconfig issue and i need to check more).
The patches are on top of 'cababde192b2 Yifan Zhang 2 days ago 
drm/amd/pm: fix mode2 reset fail for smu 13.0.5 ' commit.

Please test and let me know. Maybe by Monday I will be able to resolve the 
connectivity issue on 5.16.

Andrey

On 2022-02-24 22:13, JingWen Chen wrote:

Hi Andrey,

Sorry for the misleading, I mean the whole patch series. We are depending on 
this patch series to fix the concurrency issue within SRIOV TDR sequence.



On 2/25/22 1:26 AM, Andrey Grodzovsky wrote:

No problem if so but before I do,


JingWen - why you think this patch is needed as a standalone now ? It has no 
use without the
entire feature together with it. Is it some changes you want to do on top of 
that code ?


Andrey


On 2022-02-24 12:12, Deucher, Alexander wrote:

[Public]


If it applies cleanly, feel free to drop it in.  I'll drop those patches for 
drm-next since they are already in drm-misc.

Alex


*From:* amd-gfx  on behalf of Andrey 
Grodzovsky 
*Sent:* Thursday, February 24, 2022 11:24 AM
*To:* Chen, JingWen ; Christian König 
; dri-de...@lists.freedesktop.org 
; amd-gfx@lists.freedesktop.org 

*Cc:* Liu, Monk ; Chen, Horace ; Lazar, Lijo 
; Koenig, Christian ; dan...@ffwll.ch 

*Subject:* Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is 
ready
No because all the patch-set including this patch was landed into
drm-misc-next and will reach amd-staging-drm-next on the next upstream
rebase i guess.

Andrey

On 2022-02-24 01:47, JingWen Chen wrote:

Hi Andrey,

Will you port this patch into amd-staging-drm-next?

on 2/10/22 2:06 AM, Andrey Grodzovsky wrote:

All comments are fixed and code pushed. Thanks for everyone
who helped reviewing.

Andrey

On 2022-02-09 02:53, Christian König wrote:

Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:

Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.

Signed-off-by: Andrey Grodzovsky 

One more comment below, with that fixed Reviewed-by: Christian König 
.


---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
      3 files changed, 51 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 9704b0e1fd82..00123b0013d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2287,6 +2287,47 @@ static int amdgpu_device_fw_loading(struct amdgpu_device 
*adev)
      return r;
      }
      +static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
+{
+    long timeout;
+    int r, i;
+
+    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+    struct amdgpu_ring *ring = adev->rings[i];
+
+    /* No need to setup the GPU scheduler for rings that don't need it */
+    if (!ring || ring->no_scheduler)
+    continue;
+
+    switch (ring->funcs->type) {
+    case AMDGPU_RING_TYPE_GFX:
+    timeout = adev->gfx_timeout;
+    break;
+    case AMDGPU_RING_TYPE_COMPUTE:
+    timeout = adev->compute_timeout;
+    break;
+    case AMDGPU_RING_TYPE_SDMA:
+    timeout = adev->sdma_timeout;
+    break;
+    default:
+    timeout = adev->video_timeout;
+    break;
+    }
+
+    r = drm_sched_init(>sched, _sched_ops,
+ ring->num_hw_submission, amdgpu_job_hang_limit,
+   timeout, adev->reset_domain.wq, ring->sched_score, 
ring->name);
+    if (r) {
+    DRM_ERROR("Failed to create scheduler on ring %s.\n",
+  ring->name);
+    return r;
+    }
+    }
+
+    return 0;
+}
+
+
      /**
       * amdgpu_device_ip_init - run init for hardware IPs
       *
@@ -2419,6 +2460,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
      }
      

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-03-02 Thread JingWen Chen
Hi Andrey,

Most part of the patches are OK, but the code will introduce a ib test fail on 
the disabled vcn of sienna_cichlid.

In SRIOV use case we will disable one vcn on sienna_cichlid, I have attached a 
patch to fix this issue, please check the attachment.

Best Regards,

Jingwen Chen


On 2/26/22 5:22 AM, Andrey Grodzovsky wrote:
> Hey, patches attached - i applied the patches and resolved merge conflicts 
> but weren't able to test as my on board's network card doesn't work with 5.16 
> kernel (it does with 5.17, maybe it's Kconfig issue and i need to check more).
> The patches are on top of 'cababde192b2 Yifan Zhang 2 days ago 
> drm/amd/pm: fix mode2 reset fail for smu 13.0.5 ' commit.
>
> Please test and let me know. Maybe by Monday I will be able to resolve the 
> connectivity issue on 5.16.
>
> Andrey
>
> On 2022-02-24 22:13, JingWen Chen wrote:
>> Hi Andrey,
>>
>> Sorry for the misleading, I mean the whole patch series. We are depending on 
>> this patch series to fix the concurrency issue within SRIOV TDR sequence.
>>
>>
>>
>> On 2/25/22 1:26 AM, Andrey Grodzovsky wrote:
>>> No problem if so but before I do,
>>>
>>>
>>> JingWen - why you think this patch is needed as a standalone now ? It has 
>>> no use without the
>>> entire feature together with it. Is it some changes you want to do on top 
>>> of that code ?
>>>
>>>
>>> Andrey
>>>
>>>
>>> On 2022-02-24 12:12, Deucher, Alexander wrote:
>>>> [Public]
>>>>
>>>>
>>>> If it applies cleanly, feel free to drop it in.  I'll drop those patches 
>>>> for drm-next since they are already in drm-misc.
>>>>
>>>> Alex
>>>>
>>>> 
>>>> *From:* amd-gfx  on behalf of 
>>>> Andrey Grodzovsky 
>>>> *Sent:* Thursday, February 24, 2022 11:24 AM
>>>> *To:* Chen, JingWen ; Christian König 
>>>> ; dri-de...@lists.freedesktop.org 
>>>> ; amd-gfx@lists.freedesktop.org 
>>>> 
>>>> *Cc:* Liu, Monk ; Chen, Horace ; 
>>>> Lazar, Lijo ; Koenig, Christian 
>>>> ; dan...@ffwll.ch 
>>>> *Subject:* Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after 
>>>> XGMI is ready
>>>> No because all the patch-set including this patch was landed into
>>>> drm-misc-next and will reach amd-staging-drm-next on the next upstream
>>>> rebase i guess.
>>>>
>>>> Andrey
>>>>
>>>> On 2022-02-24 01:47, JingWen Chen wrote:
>>>>> Hi Andrey,
>>>>>
>>>>> Will you port this patch into amd-staging-drm-next?
>>>>>
>>>>> on 2/10/22 2:06 AM, Andrey Grodzovsky wrote:
>>>>>> All comments are fixed and code pushed. Thanks for everyone
>>>>>> who helped reviewing.
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>> On 2022-02-09 02:53, Christian König wrote:
>>>>>>> Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:
>>>>>>>> Before we initialize schedulers we must know which reset
>>>>>>>> domain are we in - for single device there iis a single
>>>>>>>> domain per device and so single wq per device. For XGMI
>>>>>>>> the reset domain spans the entire XGMI hive and so the
>>>>>>>> reset wq is per hive.
>>>>>>>>
>>>>>>>> Signed-off-by: Andrey Grodzovsky 
>>>>>>> One more comment below, with that fixed Reviewed-by: Christian König 
>>>>>>> .
>>>>>>>
>>>>>>>> ---
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 ++
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
>>>>>>>>      3 files changed, 51 insertions(+), 30 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index 9704b0e1fd82..00123b0013d3 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_de

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-02-25 Thread Andrey Grodzovsky
Hey, patches attached - i applied the patches and resolved merge 
conflicts but weren't able to test as my on board's network card doesn't 
work with 5.16 kernel (it does with 5.17, maybe it's Kconfig issue and i 
need to check more).
The patches are on top of 'cababde192b2 Yifan Zhang 2 days 
ago drm/amd/pm: fix mode2 reset fail for smu 13.0.5 ' commit.


Please test and let me know. Maybe by Monday I will be able to resolve 
the connectivity issue on 5.16.


Andrey

On 2022-02-24 22:13, JingWen Chen wrote:

Hi Andrey,

Sorry for the misleading, I mean the whole patch series. We are depending on 
this patch series to fix the concurrency issue within SRIOV TDR sequence.



On 2/25/22 1:26 AM, Andrey Grodzovsky wrote:

No problem if so but before I do,


JingWen - why you think this patch is needed as a standalone now ? It has no 
use without the
entire feature together with it. Is it some changes you want to do on top of 
that code ?


Andrey


On 2022-02-24 12:12, Deucher, Alexander wrote:

[Public]


If it applies cleanly, feel free to drop it in.  I'll drop those patches for 
drm-next since they are already in drm-misc.

Alex


*From:* amd-gfx  on behalf of Andrey 
Grodzovsky 
*Sent:* Thursday, February 24, 2022 11:24 AM
*To:* Chen, JingWen ; Christian König 
; dri-de...@lists.freedesktop.org 
; amd-gfx@lists.freedesktop.org 

*Cc:* Liu, Monk ; Chen, Horace ; Lazar, Lijo 
; Koenig, Christian ; dan...@ffwll.ch 

*Subject:* Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is 
ready
No because all the patch-set including this patch was landed into
drm-misc-next and will reach amd-staging-drm-next on the next upstream
rebase i guess.

Andrey

On 2022-02-24 01:47, JingWen Chen wrote:

Hi Andrey,

Will you port this patch into amd-staging-drm-next?

on 2/10/22 2:06 AM, Andrey Grodzovsky wrote:

All comments are fixed and code pushed. Thanks for everyone
who helped reviewing.

Andrey

On 2022-02-09 02:53, Christian König wrote:

Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:

Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.

Signed-off-by: Andrey Grodzovsky 

One more comment below, with that fixed Reviewed-by: Christian König 
.


---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 ++
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
     3 files changed, 51 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 9704b0e1fd82..00123b0013d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2287,6 +2287,47 @@ static int amdgpu_device_fw_loading(struct amdgpu_device 
*adev)
     return r;
     }
     +static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
+{
+    long timeout;
+    int r, i;
+
+    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+    struct amdgpu_ring *ring = adev->rings[i];
+
+    /* No need to setup the GPU scheduler for rings that don't need it */
+    if (!ring || ring->no_scheduler)
+    continue;
+
+    switch (ring->funcs->type) {
+    case AMDGPU_RING_TYPE_GFX:
+    timeout = adev->gfx_timeout;
+    break;
+    case AMDGPU_RING_TYPE_COMPUTE:
+    timeout = adev->compute_timeout;
+    break;
+    case AMDGPU_RING_TYPE_SDMA:
+    timeout = adev->sdma_timeout;
+    break;
+    default:
+    timeout = adev->video_timeout;
+    break;
+    }
+
+    r = drm_sched_init(>sched, _sched_ops,
+ ring->num_hw_submission, amdgpu_job_hang_limit,
+   timeout, adev->reset_domain.wq, ring->sched_score, 
ring->name);
+    if (r) {
+    DRM_ERROR("Failed to create scheduler on ring %s.\n",
+  ring->name);
+    return r;
+    }
+    }
+
+    return 0;
+}
+
+
     /**
      * amdgpu_device_ip_init - run init for hardware IPs
      *
@@ -2419,6 +2460,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
     }
     }
     +    r = amdgpu_device_init_schedulers(adev);
+    if (r)
+    goto init_failed;
+
     /* Don't init kfd if whole hive need to be reset during init */
     if (!adev->gmc.xgmi.pending_reset)
amdgpu_amdkfd_device_init(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 45977a72b5dd..fa302540c69a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -457,8 +457,6 @@ 

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-02-24 Thread JingWen Chen
Hi Andrey,

Sorry for the misleading, I mean the whole patch series. We are depending on 
this patch series to fix the concurrency issue within SRIOV TDR sequence.



On 2/25/22 1:26 AM, Andrey Grodzovsky wrote:
> No problem if so but before I do,
>
>
> JingWen - why you think this patch is needed as a standalone now ? It has no 
> use without the
> entire feature together with it. Is it some changes you want to do on top of 
> that code ?
>
>
> Andrey
>
>
> On 2022-02-24 12:12, Deucher, Alexander wrote:
>>
>> [Public]
>>
>>
>> If it applies cleanly, feel free to drop it in.  I'll drop those patches for 
>> drm-next since they are already in drm-misc.
>>
>> Alex
>>
>> 
>> *From:* amd-gfx  on behalf of Andrey 
>> Grodzovsky 
>> *Sent:* Thursday, February 24, 2022 11:24 AM
>> *To:* Chen, JingWen ; Christian König 
>> ; dri-de...@lists.freedesktop.org 
>> ; amd-gfx@lists.freedesktop.org 
>> 
>> *Cc:* Liu, Monk ; Chen, Horace ; 
>> Lazar, Lijo ; Koenig, Christian 
>> ; dan...@ffwll.ch 
>> *Subject:* Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI 
>> is ready
>> No because all the patch-set including this patch was landed into
>> drm-misc-next and will reach amd-staging-drm-next on the next upstream
>> rebase i guess.
>>
>> Andrey
>>
>> On 2022-02-24 01:47, JingWen Chen wrote:
>> > Hi Andrey,
>> >
>> > Will you port this patch into amd-staging-drm-next?
>> >
>> > on 2/10/22 2:06 AM, Andrey Grodzovsky wrote:
>> >> All comments are fixed and code pushed. Thanks for everyone
>> >> who helped reviewing.
>> >>
>> >> Andrey
>> >>
>> >> On 2022-02-09 02:53, Christian König wrote:
>> >>> Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:
>> >>>> Before we initialize schedulers we must know which reset
>> >>>> domain are we in - for single device there iis a single
>> >>>> domain per device and so single wq per device. For XGMI
>> >>>> the reset domain spans the entire XGMI hive and so the
>> >>>> reset wq is per hive.
>> >>>>
>> >>>> Signed-off-by: Andrey Grodzovsky 
>> >>> One more comment below, with that fixed Reviewed-by: Christian König 
>> >>> .
>> >>>
>> >>>> ---
>> >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 ++
>> >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--
>> >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
>> >>>>    3 files changed, 51 insertions(+), 30 deletions(-)
>> >>>>
>> >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> >>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >>>> index 9704b0e1fd82..00123b0013d3 100644
>> >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> >>>> @@ -2287,6 +2287,47 @@ static int amdgpu_device_fw_loading(struct 
>> >>>> amdgpu_device *adev)
>> >>>>    return r;
>> >>>>    }
>> >>>>    +static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
>> >>>> +{
>> >>>> +    long timeout;
>> >>>> +    int r, i;
>> >>>> +
>> >>>> +    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> >>>> +    struct amdgpu_ring *ring = adev->rings[i];
>> >>>> +
>> >>>> +    /* No need to setup the GPU scheduler for rings that don't 
>> >>>> need it */
>> >>>> +    if (!ring || ring->no_scheduler)
>> >>>> +    continue;
>> >>>> +
>> >>>> +    switch (ring->funcs->type) {
>> >>>> +    case AMDGPU_RING_TYPE_GFX:
>> >>>> +    timeout = adev->gfx_timeout;
>> >>>> +    break;
>> >>>> +    case AMDGPU_RING_TYPE_COMPUTE:
>> >>>> +    timeout = adev->compute_timeout;
>> >>>> +    break;
>> >>>> +    case AMDGPU_RING_TYPE_SDMA:
>> >>>> +    timeout = adev->sdma_timeout;
>> >>>> +    break;
>&

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-02-24 Thread Andrey Grodzovsky

No problem if so but before I do,


JingWen - why you think this patch is needed as a standalone now ? It 
has no use without the
entire feature together with it. Is it some changes you want to do on 
top of that code ?



Andrey


On 2022-02-24 12:12, Deucher, Alexander wrote:


[Public]


If it applies cleanly, feel free to drop it in.  I'll drop those 
patches for drm-next since they are already in drm-misc.


Alex


*From:* amd-gfx  on behalf of 
Andrey Grodzovsky 

*Sent:* Thursday, February 24, 2022 11:24 AM
*To:* Chen, JingWen ; Christian König 
; dri-de...@lists.freedesktop.org 
; amd-gfx@lists.freedesktop.org 

*Cc:* Liu, Monk ; Chen, Horace 
; Lazar, Lijo ; Koenig, 
Christian ; dan...@ffwll.ch 
*Subject:* Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after 
XGMI is ready

No because all the patch-set including this patch was landed into
drm-misc-next and will reach amd-staging-drm-next on the next upstream
rebase i guess.

Andrey

On 2022-02-24 01:47, JingWen Chen wrote:
> Hi Andrey,
>
> Will you port this patch into amd-staging-drm-next?
>
> on 2/10/22 2:06 AM, Andrey Grodzovsky wrote:
>> All comments are fixed and code pushed. Thanks for everyone
>> who helped reviewing.
>>
>> Andrey
>>
>> On 2022-02-09 02:53, Christian König wrote:
>>> Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:
>>>> Before we initialize schedulers we must know which reset
>>>> domain are we in - for single device there iis a single
>>>> domain per device and so single wq per device. For XGMI
>>>> the reset domain spans the entire XGMI hive and so the
>>>> reset wq is per hive.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky 
>>> One more comment below, with that fixed Reviewed-by: Christian 
König .

>>>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 
++

>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
>>>>    3 files changed, 51 insertions(+), 30 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

>>>> index 9704b0e1fd82..00123b0013d3 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -2287,6 +2287,47 @@ static int amdgpu_device_fw_loading(struct 
amdgpu_device *adev)

>>>>    return r;
>>>>    }
>>>>    +static int amdgpu_device_init_schedulers(struct amdgpu_device 
*adev)

>>>> +{
>>>> +    long timeout;
>>>> +    int r, i;
>>>> +
>>>> +    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>> +    struct amdgpu_ring *ring = adev->rings[i];
>>>> +
>>>> +    /* No need to setup the GPU scheduler for rings that 
don't need it */

>>>> +    if (!ring || ring->no_scheduler)
>>>> +    continue;
>>>> +
>>>> +    switch (ring->funcs->type) {
>>>> +    case AMDGPU_RING_TYPE_GFX:
>>>> +    timeout = adev->gfx_timeout;
>>>> +    break;
>>>> +    case AMDGPU_RING_TYPE_COMPUTE:
>>>> +    timeout = adev->compute_timeout;
>>>> +    break;
>>>> +    case AMDGPU_RING_TYPE_SDMA:
>>>> +    timeout = adev->sdma_timeout;
>>>> +    break;
>>>> +    default:
>>>> +    timeout = adev->video_timeout;
>>>> +    break;
>>>> +    }
>>>> +
>>>> +    r = drm_sched_init(>sched, _sched_ops,
>>>> + ring->num_hw_submission, amdgpu_job_hang_limit,
>>>> +   timeout, adev->reset_domain.wq, 
ring->sched_score, ring->name);

>>>> +    if (r) {
>>>> +    DRM_ERROR("Failed to create scheduler on ring %s.\n",
>>>> +  ring->name);
>>>> +    return r;
>>>> +    }
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +
>>>>    /**
>>>>     * amdgpu_device_ip_init - run init for hardware IPs
>>>>     *
>>>> @@ -2419,6 +2460,10 @@ static int amdgpu_device_ip_init(struct 
amdgpu_device *adev)

>>>>    }
>>>>    }
>>>&g

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-02-24 Thread Deucher, Alexander
[Public]

If it applies cleanly, feel free to drop it in.  I'll drop those patches for 
drm-next since they are already in drm-misc.

Alex


From: amd-gfx  on behalf of Andrey 
Grodzovsky 
Sent: Thursday, February 24, 2022 11:24 AM
To: Chen, JingWen ; Christian König 
; dri-de...@lists.freedesktop.org 
; amd-gfx@lists.freedesktop.org 

Cc: Liu, Monk ; Chen, Horace ; Lazar, 
Lijo ; Koenig, Christian ; 
dan...@ffwll.ch 
Subject: Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is 
ready

No because all the patch-set including this patch was landed into
drm-misc-next and will reach amd-staging-drm-next on the next upstream
rebase i guess.

Andrey

On 2022-02-24 01:47, JingWen Chen wrote:
> Hi Andrey,
>
> Will you port this patch into amd-staging-drm-next?
>
> on 2/10/22 2:06 AM, Andrey Grodzovsky wrote:
>> All comments are fixed and code pushed. Thanks for everyone
>> who helped reviewing.
>>
>> Andrey
>>
>> On 2022-02-09 02:53, Christian König wrote:
>>> Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:
>>>> Before we initialize schedulers we must know which reset
>>>> domain are we in - for single device there iis a single
>>>> domain per device and so single wq per device. For XGMI
>>>> the reset domain spans the entire XGMI hive and so the
>>>> reset wq is per hive.
>>>>
>>>> Signed-off-by: Andrey Grodzovsky 
>>> One more comment below, with that fixed Reviewed-by: Christian König 
>>> .
>>>
>>>> ---
>>>>drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 ++
>>>>drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--
>>>>drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
>>>>3 files changed, 51 insertions(+), 30 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 9704b0e1fd82..00123b0013d3 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -2287,6 +2287,47 @@ static int amdgpu_device_fw_loading(struct 
>>>> amdgpu_device *adev)
>>>>return r;
>>>>}
>>>>+static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
>>>> +{
>>>> +long timeout;
>>>> +int r, i;
>>>> +
>>>> +for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>> +struct amdgpu_ring *ring = adev->rings[i];
>>>> +
>>>> +/* No need to setup the GPU scheduler for rings that don't need 
>>>> it */
>>>> +if (!ring || ring->no_scheduler)
>>>> +continue;
>>>> +
>>>> +switch (ring->funcs->type) {
>>>> +case AMDGPU_RING_TYPE_GFX:
>>>> +timeout = adev->gfx_timeout;
>>>> +break;
>>>> +case AMDGPU_RING_TYPE_COMPUTE:
>>>> +timeout = adev->compute_timeout;
>>>> +break;
>>>> +case AMDGPU_RING_TYPE_SDMA:
>>>> +timeout = adev->sdma_timeout;
>>>> +break;
>>>> +default:
>>>> +timeout = adev->video_timeout;
>>>> +break;
>>>> +}
>>>> +
>>>> +r = drm_sched_init(>sched, _sched_ops,
>>>> +   ring->num_hw_submission, amdgpu_job_hang_limit,
>>>> +   timeout, adev->reset_domain.wq, ring->sched_score, 
>>>> ring->name);
>>>> +if (r) {
>>>> +DRM_ERROR("Failed to create scheduler on ring %s.\n",
>>>> +  ring->name);
>>>> +return r;
>>>> +}
>>>> +}
>>>> +
>>>> +return 0;
>>>> +}
>>>> +
>>>> +
>>>>/**
>>>> * amdgpu_device_ip_init - run init for hardware IPs
>>>> *
>>>> @@ -2419,6 +2460,10 @@ static int amdgpu_device_ip_init(struct 
>>>> amdgpu_device *adev)
>>>>}
>>>>}
>>>>+r = amdgpu_device_init_schedulers(adev);
>>>> +if (r)
>>>> +goto init_failed;
>>>> +
>>>>/* Don't init kfd if whole hive need to be reset during

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-02-24 Thread Andrey Grodzovsky
No because all the patch-set including this patch was landed into 
drm-misc-next and will reach amd-staging-drm-next on the next upstream 
rebase i guess.


Andrey

On 2022-02-24 01:47, JingWen Chen wrote:

Hi Andrey,

Will you port this patch into amd-staging-drm-next?

on 2/10/22 2:06 AM, Andrey Grodzovsky wrote:

All comments are fixed and code pushed. Thanks for everyone
who helped reviewing.

Andrey

On 2022-02-09 02:53, Christian König wrote:

Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:

Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.

Signed-off-by: Andrey Grodzovsky 

One more comment below, with that fixed Reviewed-by: Christian König 
.


---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 ++
   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--
   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
   3 files changed, 51 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 9704b0e1fd82..00123b0013d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2287,6 +2287,47 @@ static int amdgpu_device_fw_loading(struct amdgpu_device 
*adev)
   return r;
   }
   +static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
+{
+    long timeout;
+    int r, i;
+
+    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+    struct amdgpu_ring *ring = adev->rings[i];
+
+    /* No need to setup the GPU scheduler for rings that don't need it */
+    if (!ring || ring->no_scheduler)
+    continue;
+
+    switch (ring->funcs->type) {
+    case AMDGPU_RING_TYPE_GFX:
+    timeout = adev->gfx_timeout;
+    break;
+    case AMDGPU_RING_TYPE_COMPUTE:
+    timeout = adev->compute_timeout;
+    break;
+    case AMDGPU_RING_TYPE_SDMA:
+    timeout = adev->sdma_timeout;
+    break;
+    default:
+    timeout = adev->video_timeout;
+    break;
+    }
+
+    r = drm_sched_init(>sched, _sched_ops,
+   ring->num_hw_submission, amdgpu_job_hang_limit,
+   timeout, adev->reset_domain.wq, ring->sched_score, 
ring->name);
+    if (r) {
+    DRM_ERROR("Failed to create scheduler on ring %s.\n",
+  ring->name);
+    return r;
+    }
+    }
+
+    return 0;
+}
+
+
   /**
    * amdgpu_device_ip_init - run init for hardware IPs
    *
@@ -2419,6 +2460,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
   }
   }
   +    r = amdgpu_device_init_schedulers(adev);
+    if (r)
+    goto init_failed;
+
   /* Don't init kfd if whole hive need to be reset during init */
   if (!adev->gmc.xgmi.pending_reset)
   amdgpu_amdkfd_device_init(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 45977a72b5dd..fa302540c69a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -457,8 +457,6 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
     atomic_t *sched_score)
   {
   struct amdgpu_device *adev = ring->adev;
-    long timeout;
-    int r;
     if (!adev)
   return -EINVAL;
@@ -478,36 +476,12 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
*ring,
   spin_lock_init(>fence_drv.lock);
   ring->fence_drv.fences = kcalloc(num_hw_submission * 2, sizeof(void *),
    GFP_KERNEL);
-    if (!ring->fence_drv.fences)
-    return -ENOMEM;
   -    /* No need to setup the GPU scheduler for rings that don't need it */
-    if (ring->no_scheduler)
-    return 0;
+    ring->num_hw_submission = num_hw_submission;
+    ring->sched_score = sched_score;

Let's move this into the caller and then use ring->num_hw_submission in the 
fence code as well.

The maximum number of jobs on the ring is not really fence specific.

Regards,
Christian.


   -    switch (ring->funcs->type) {
-    case AMDGPU_RING_TYPE_GFX:
-    timeout = adev->gfx_timeout;
-    break;
-    case AMDGPU_RING_TYPE_COMPUTE:
-    timeout = adev->compute_timeout;
-    break;
-    case AMDGPU_RING_TYPE_SDMA:
-    timeout = adev->sdma_timeout;
-    break;
-    default:
-    timeout = adev->video_timeout;
-    break;
-    }
-
-    r = drm_sched_init(>sched, _sched_ops,
-   num_hw_submission, amdgpu_job_hang_limit,
-   timeout, NULL, sched_score, ring->name);
-    if (r) {
-    DRM_ERROR("Failed to create scheduler on ring %s.\n",
-  ring->name);
-    return r;
-    }
+    if (!ring->fence_drv.fences)
+    return -ENOMEM;
    

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-02-23 Thread JingWen Chen
Hi Andrey,

Will you port this patch into amd-staging-drm-next?

on 2/10/22 2:06 AM, Andrey Grodzovsky wrote:
> All comments are fixed and code pushed. Thanks for everyone
> who helped reviewing.
>
> Andrey
>
> On 2022-02-09 02:53, Christian König wrote:
>> Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:
>>> Before we initialize schedulers we must know which reset
>>> domain are we in - for single device there iis a single
>>> domain per device and so single wq per device. For XGMI
>>> the reset domain spans the entire XGMI hive and so the
>>> reset wq is per hive.
>>>
>>> Signed-off-by: Andrey Grodzovsky 
>>
>> One more comment below, with that fixed Reviewed-by: Christian König 
>> .
>>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 ++
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
>>>   3 files changed, 51 insertions(+), 30 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 9704b0e1fd82..00123b0013d3 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2287,6 +2287,47 @@ static int amdgpu_device_fw_loading(struct 
>>> amdgpu_device *adev)
>>>   return r;
>>>   }
>>>   +static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
>>> +{
>>> +    long timeout;
>>> +    int r, i;
>>> +
>>> +    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>> +    struct amdgpu_ring *ring = adev->rings[i];
>>> +
>>> +    /* No need to setup the GPU scheduler for rings that don't need it 
>>> */
>>> +    if (!ring || ring->no_scheduler)
>>> +    continue;
>>> +
>>> +    switch (ring->funcs->type) {
>>> +    case AMDGPU_RING_TYPE_GFX:
>>> +    timeout = adev->gfx_timeout;
>>> +    break;
>>> +    case AMDGPU_RING_TYPE_COMPUTE:
>>> +    timeout = adev->compute_timeout;
>>> +    break;
>>> +    case AMDGPU_RING_TYPE_SDMA:
>>> +    timeout = adev->sdma_timeout;
>>> +    break;
>>> +    default:
>>> +    timeout = adev->video_timeout;
>>> +    break;
>>> +    }
>>> +
>>> +    r = drm_sched_init(>sched, _sched_ops,
>>> +   ring->num_hw_submission, amdgpu_job_hang_limit,
>>> +   timeout, adev->reset_domain.wq, ring->sched_score, 
>>> ring->name);
>>> +    if (r) {
>>> +    DRM_ERROR("Failed to create scheduler on ring %s.\n",
>>> +  ring->name);
>>> +    return r;
>>> +    }
>>> +    }
>>> +
>>> +    return 0;
>>> +}
>>> +
>>> +
>>>   /**
>>>    * amdgpu_device_ip_init - run init for hardware IPs
>>>    *
>>> @@ -2419,6 +2460,10 @@ static int amdgpu_device_ip_init(struct 
>>> amdgpu_device *adev)
>>>   }
>>>   }
>>>   +    r = amdgpu_device_init_schedulers(adev);
>>> +    if (r)
>>> +    goto init_failed;
>>> +
>>>   /* Don't init kfd if whole hive need to be reset during init */
>>>   if (!adev->gmc.xgmi.pending_reset)
>>>   amdgpu_amdkfd_device_init(adev);
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index 45977a72b5dd..fa302540c69a 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -457,8 +457,6 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
>>> *ring,
>>>     atomic_t *sched_score)
>>>   {
>>>   struct amdgpu_device *adev = ring->adev;
>>> -    long timeout;
>>> -    int r;
>>>     if (!adev)
>>>   return -EINVAL;
>>> @@ -478,36 +476,12 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
>>> *ring,
>>>   spin_lock_init(>fence_drv.lock);
>>>   ring->fence_drv.fences = kcalloc(num_hw_submission * 2, sizeof(void 
>>> *),
>>>    GFP_KERNEL);
>>> -    if (!ring->fence_drv.fences)
>>> -    return -ENOMEM;
>>>   -    /* No need to setup the GPU scheduler for rings that don't need it */
>>> -    if (ring->no_scheduler)
>>> -    return 0;
>>> +    ring->num_hw_submission = num_hw_submission;
>>> +    ring->sched_score = sched_score;
>>
>> Let's move this into the caller and then use ring->num_hw_submission in the 
>> fence code as well.
>>
>> The maximum number of jobs on the ring is not really fence specific.
>>
>> Regards,
>> Christian.
>>
>>>   -    switch (ring->funcs->type) {
>>> -    case AMDGPU_RING_TYPE_GFX:
>>> -    timeout = adev->gfx_timeout;
>>> -    break;
>>> -    case AMDGPU_RING_TYPE_COMPUTE:
>>> -    timeout = adev->compute_timeout;
>>> -    break;
>>> -    case AMDGPU_RING_TYPE_SDMA:
>>> -    timeout = adev->sdma_timeout;
>>> -    break;
>>> -    default:
>>> -    timeout = adev->video_timeout;
>>> -    break;
>>> -    }
>>> -
>>> -    r = drm_sched_init(>sched, _sched_ops,
>>> -   

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-02-09 Thread Andrey Grodzovsky

All comments are fixed and code pushed. Thanks for everyone
who helped reviewing.

Andrey

On 2022-02-09 02:53, Christian König wrote:

Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:

Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.

Signed-off-by: Andrey Grodzovsky 


One more comment below, with that fixed Reviewed-by: Christian König 
.



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
  3 files changed, 51 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

index 9704b0e1fd82..00123b0013d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2287,6 +2287,47 @@ static int amdgpu_device_fw_loading(struct 
amdgpu_device *adev)

  return r;
  }
  +static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
+{
+    long timeout;
+    int r, i;
+
+    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+    struct amdgpu_ring *ring = adev->rings[i];
+
+    /* No need to setup the GPU scheduler for rings that don't 
need it */

+    if (!ring || ring->no_scheduler)
+    continue;
+
+    switch (ring->funcs->type) {
+    case AMDGPU_RING_TYPE_GFX:
+    timeout = adev->gfx_timeout;
+    break;
+    case AMDGPU_RING_TYPE_COMPUTE:
+    timeout = adev->compute_timeout;
+    break;
+    case AMDGPU_RING_TYPE_SDMA:
+    timeout = adev->sdma_timeout;
+    break;
+    default:
+    timeout = adev->video_timeout;
+    break;
+    }
+
+    r = drm_sched_init(>sched, _sched_ops,
+   ring->num_hw_submission, amdgpu_job_hang_limit,
+   timeout, adev->reset_domain.wq, 
ring->sched_score, ring->name);

+    if (r) {
+    DRM_ERROR("Failed to create scheduler on ring %s.\n",
+  ring->name);
+    return r;
+    }
+    }
+
+    return 0;
+}
+
+
  /**
   * amdgpu_device_ip_init - run init for hardware IPs
   *
@@ -2419,6 +2460,10 @@ static int amdgpu_device_ip_init(struct 
amdgpu_device *adev)

  }
  }
  +    r = amdgpu_device_init_schedulers(adev);
+    if (r)
+    goto init_failed;
+
  /* Don't init kfd if whole hive need to be reset during init */
  if (!adev->gmc.xgmi.pending_reset)
  amdgpu_amdkfd_device_init(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c

index 45977a72b5dd..fa302540c69a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -457,8 +457,6 @@ int amdgpu_fence_driver_init_ring(struct 
amdgpu_ring *ring,

    atomic_t *sched_score)
  {
  struct amdgpu_device *adev = ring->adev;
-    long timeout;
-    int r;
    if (!adev)
  return -EINVAL;
@@ -478,36 +476,12 @@ int amdgpu_fence_driver_init_ring(struct 
amdgpu_ring *ring,

  spin_lock_init(>fence_drv.lock);
  ring->fence_drv.fences = kcalloc(num_hw_submission * 2, 
sizeof(void *),

   GFP_KERNEL);
-    if (!ring->fence_drv.fences)
-    return -ENOMEM;
  -    /* No need to setup the GPU scheduler for rings that don't 
need it */

-    if (ring->no_scheduler)
-    return 0;
+    ring->num_hw_submission = num_hw_submission;
+    ring->sched_score = sched_score;


Let's move this into the caller and then use ring->num_hw_submission 
in the fence code as well.


The maximum number of jobs on the ring is not really fence specific.

Regards,
Christian.


  -    switch (ring->funcs->type) {
-    case AMDGPU_RING_TYPE_GFX:
-    timeout = adev->gfx_timeout;
-    break;
-    case AMDGPU_RING_TYPE_COMPUTE:
-    timeout = adev->compute_timeout;
-    break;
-    case AMDGPU_RING_TYPE_SDMA:
-    timeout = adev->sdma_timeout;
-    break;
-    default:
-    timeout = adev->video_timeout;
-    break;
-    }
-
-    r = drm_sched_init(>sched, _sched_ops,
-   num_hw_submission, amdgpu_job_hang_limit,
-   timeout, NULL, sched_score, ring->name);
-    if (r) {
-    DRM_ERROR("Failed to create scheduler on ring %s.\n",
-  ring->name);
-    return r;
-    }
+    if (!ring->fence_drv.fences)
+    return -ENOMEM;
    return 0;
  }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h

index fae7d185ad0d..7f20ce73a243 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -251,6 +251,8 @@ struct amdgpu_ring {
  bool    

Re: [RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-02-08 Thread Christian König

Am 09.02.22 um 01:23 schrieb Andrey Grodzovsky:

Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.

Signed-off-by: Andrey Grodzovsky 


One more comment below, with that fixed Reviewed-by: Christian König 
.



---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 ++
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
  3 files changed, 51 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 9704b0e1fd82..00123b0013d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2287,6 +2287,47 @@ static int amdgpu_device_fw_loading(struct amdgpu_device 
*adev)
return r;
  }
  
+static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)

+{
+   long timeout;
+   int r, i;
+
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = adev->rings[i];
+
+   /* No need to setup the GPU scheduler for rings that don't need 
it */
+   if (!ring || ring->no_scheduler)
+   continue;
+
+   switch (ring->funcs->type) {
+   case AMDGPU_RING_TYPE_GFX:
+   timeout = adev->gfx_timeout;
+   break;
+   case AMDGPU_RING_TYPE_COMPUTE:
+   timeout = adev->compute_timeout;
+   break;
+   case AMDGPU_RING_TYPE_SDMA:
+   timeout = adev->sdma_timeout;
+   break;
+   default:
+   timeout = adev->video_timeout;
+   break;
+   }
+
+   r = drm_sched_init(>sched, _sched_ops,
+  ring->num_hw_submission, 
amdgpu_job_hang_limit,
+  timeout, adev->reset_domain.wq, 
ring->sched_score, ring->name);
+   if (r) {
+   DRM_ERROR("Failed to create scheduler on ring %s.\n",
+ ring->name);
+   return r;
+   }
+   }
+
+   return 0;
+}
+
+
  /**
   * amdgpu_device_ip_init - run init for hardware IPs
   *
@@ -2419,6 +2460,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
}
}
  
+	r = amdgpu_device_init_schedulers(adev);

+   if (r)
+   goto init_failed;
+
/* Don't init kfd if whole hive need to be reset during init */
if (!adev->gmc.xgmi.pending_reset)
amdgpu_amdkfd_device_init(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 45977a72b5dd..fa302540c69a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -457,8 +457,6 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
  atomic_t *sched_score)
  {
struct amdgpu_device *adev = ring->adev;
-   long timeout;
-   int r;
  
  	if (!adev)

return -EINVAL;
@@ -478,36 +476,12 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
*ring,
spin_lock_init(>fence_drv.lock);
ring->fence_drv.fences = kcalloc(num_hw_submission * 2, sizeof(void *),
 GFP_KERNEL);
-   if (!ring->fence_drv.fences)
-   return -ENOMEM;
  
-	/* No need to setup the GPU scheduler for rings that don't need it */

-   if (ring->no_scheduler)
-   return 0;
+   ring->num_hw_submission = num_hw_submission;
+   ring->sched_score = sched_score;


Let's move this into the caller and then use ring->num_hw_submission in 
the fence code as well.


The maximum number of jobs on the ring is not really fence specific.

Regards,
Christian.

  
-	switch (ring->funcs->type) {

-   case AMDGPU_RING_TYPE_GFX:
-   timeout = adev->gfx_timeout;
-   break;
-   case AMDGPU_RING_TYPE_COMPUTE:
-   timeout = adev->compute_timeout;
-   break;
-   case AMDGPU_RING_TYPE_SDMA:
-   timeout = adev->sdma_timeout;
-   break;
-   default:
-   timeout = adev->video_timeout;
-   break;
-   }
-
-   r = drm_sched_init(>sched, _sched_ops,
-  num_hw_submission, amdgpu_job_hang_limit,
-  timeout, NULL, sched_score, ring->name);
-   if (r) {
-   DRM_ERROR("Failed to create scheduler on ring %s.\n",
- ring->name);
-   return r;
-   }
+   if (!ring->fence_drv.fences)
+   

[RFC v4 02/11] drm/amdgpu: Move scheduler init to after XGMI is ready

2022-02-08 Thread Andrey Grodzovsky
Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.

Signed-off-by: Andrey Grodzovsky 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 45 ++
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 34 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  2 +
 3 files changed, 51 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 9704b0e1fd82..00123b0013d3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2287,6 +2287,47 @@ static int amdgpu_device_fw_loading(struct amdgpu_device 
*adev)
return r;
 }
 
+static int amdgpu_device_init_schedulers(struct amdgpu_device *adev)
+{
+   long timeout;
+   int r, i;
+
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = adev->rings[i];
+
+   /* No need to setup the GPU scheduler for rings that don't need 
it */
+   if (!ring || ring->no_scheduler)
+   continue;
+
+   switch (ring->funcs->type) {
+   case AMDGPU_RING_TYPE_GFX:
+   timeout = adev->gfx_timeout;
+   break;
+   case AMDGPU_RING_TYPE_COMPUTE:
+   timeout = adev->compute_timeout;
+   break;
+   case AMDGPU_RING_TYPE_SDMA:
+   timeout = adev->sdma_timeout;
+   break;
+   default:
+   timeout = adev->video_timeout;
+   break;
+   }
+
+   r = drm_sched_init(>sched, _sched_ops,
+  ring->num_hw_submission, 
amdgpu_job_hang_limit,
+  timeout, adev->reset_domain.wq, 
ring->sched_score, ring->name);
+   if (r) {
+   DRM_ERROR("Failed to create scheduler on ring %s.\n",
+ ring->name);
+   return r;
+   }
+   }
+
+   return 0;
+}
+
+
 /**
  * amdgpu_device_ip_init - run init for hardware IPs
  *
@@ -2419,6 +2460,10 @@ static int amdgpu_device_ip_init(struct amdgpu_device 
*adev)
}
}
 
+   r = amdgpu_device_init_schedulers(adev);
+   if (r)
+   goto init_failed;
+
/* Don't init kfd if whole hive need to be reset during init */
if (!adev->gmc.xgmi.pending_reset)
amdgpu_amdkfd_device_init(adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 45977a72b5dd..fa302540c69a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -457,8 +457,6 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
  atomic_t *sched_score)
 {
struct amdgpu_device *adev = ring->adev;
-   long timeout;
-   int r;
 
if (!adev)
return -EINVAL;
@@ -478,36 +476,12 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
*ring,
spin_lock_init(>fence_drv.lock);
ring->fence_drv.fences = kcalloc(num_hw_submission * 2, sizeof(void *),
 GFP_KERNEL);
-   if (!ring->fence_drv.fences)
-   return -ENOMEM;
 
-   /* No need to setup the GPU scheduler for rings that don't need it */
-   if (ring->no_scheduler)
-   return 0;
+   ring->num_hw_submission = num_hw_submission;
+   ring->sched_score = sched_score;
 
-   switch (ring->funcs->type) {
-   case AMDGPU_RING_TYPE_GFX:
-   timeout = adev->gfx_timeout;
-   break;
-   case AMDGPU_RING_TYPE_COMPUTE:
-   timeout = adev->compute_timeout;
-   break;
-   case AMDGPU_RING_TYPE_SDMA:
-   timeout = adev->sdma_timeout;
-   break;
-   default:
-   timeout = adev->video_timeout;
-   break;
-   }
-
-   r = drm_sched_init(>sched, _sched_ops,
-  num_hw_submission, amdgpu_job_hang_limit,
-  timeout, NULL, sched_score, ring->name);
-   if (r) {
-   DRM_ERROR("Failed to create scheduler on ring %s.\n",
- ring->name);
-   return r;
-   }
+   if (!ring->fence_drv.fences)
+   return -ENOMEM;
 
return 0;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index fae7d185ad0d..7f20ce73a243 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -251,6 +251,8 @@ struct