RE: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)

2021-08-27 Thread Chen, Guchun
[Public]

Hi Andrey and Christian,

I just send out a new patch to address this, I am not sure if I understand your 
point correctly. Please review.

The patch is to stop scheduler in fence_hw_fini and start the scheduler in 
fence_hw_init.

Regards,
Guchun

-Original Message-
From: Grodzovsky, Andrey  
Sent: Monday, August 23, 2021 10:42 PM
To: Christian König ; Chen, Guchun 
; Alex Deucher ; Mike Lothian 
; Koenig, Christian 
Cc: amd-gfx list ; Gao, Likun 
; Zhang, Hawking ; Deucher, Alexander 

Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 
test (v2)


On 2021-08-23 2:50 a.m., Christian König wrote:
> Good mornings guys,
>
> Andrey has a rather valid concern here, but I think we need to 
> approach this from a more high level view.
>
> When hw_fini is called we should make sure that the scheduler can't 
> submit any more work to the hardware, because the hw is finalized and 
> not expected to response any more.
>
> As far as I can see the cleanest approach would be to stop the 
> scheduler in hw_fini and fully clean it up in sw_fini. That would also 
> fit quite nicely with how GPU reset is supposed to work I think.
>
> Problem is that this is currently done outside of the fence code for 
> the at least the reset case, so before we restructure that we need to 
> stick with what we have.
>
> Andrey do you think it would be any problem if we stop the scheduler 
> manually in the hot plug case as well?


As long as it's 'parked' inside HW fini - meaning the thread submitting to HW 
is done I think it should cover hot unplug as well.

Andrey


>
> Thanks,
> Christian.
>
> Am 23.08.21 um 08:36 schrieb Chen, Guchun:
>> [Public]
>>
>> Hi Andrey,
>>
>> Thanks for your notice. The cause why moving drm_sched_fini to 
>> sw_fini is it's a SW behavior and part of SW shutdown, so hw_fini 
>> should not touch it. But if the race, that scheduler on the ring 
>> possibly keeps submitting jobs which causes un-empty ring is there, 
>> possibly we still need to call drm_sched_fini first in hw_fini to 
>> stop job submission first.
>>
>> @Koenig, Christian what's your opinion?
>>
>> Regards,
>> Guchun
>>
>> -Original Message-
>> From: Alex Deucher 
>> Sent: Friday, August 20, 2021 2:13 AM
>> To: Mike Lothian 
>> Cc: Grodzovsky, Andrey ; Chen, Guchun 
>> ; amd-gfx list ; 
>> Gao, Likun ; Koenig, Christian 
>> ; Zhang, Hawking ; 
>> Deucher, Alexander 
>> Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver 
>> fini in s3 test (v2)
>>
>> Please go ahead.  Thanks!
>>
>> Alex
>>
>> On Thu, Aug 19, 2021 at 8:05 AM Mike Lothian 
>> wrote:
>>> Hi
>>>
>>> Do I need to open a new bug report for this?
>>>
>>> Cheers
>>>
>>> Mike
>>>
>>> On Wed, 18 Aug 2021 at 06:26, Andrey Grodzovsky 
>>>  wrote:
>>>>
>>>> On 2021-08-02 1:16 a.m., Guchun Chen wrote:
>>>>> In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to 
>>>>> stop scheduler in s3 test, otherwise, fence related failure will 
>>>>> arrive after resume. To fix this and for a better clean up, move 
>>>>> drm_sched_fini from fence_hw_fini to fence_sw_fini, as it's part 
>>>>> of driver shutdown, and should never be called in hw_fini.
>>>>>
>>>>> v2: rename amdgpu_fence_driver_init to 
>>>>> amdgpu_fence_driver_sw_init, to keep sw_init and sw_fini paired.
>>>>>
>>>>> Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable 
>>>>> sequence
>>>>> Suggested-by: Christian König 
>>>>> Signed-off-by: Guchun Chen 
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 ++---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 12 +++-
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  4 ++--
>>>>>    3 files changed, 11 insertions(+), 10 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index b1d2dc39e8be..9e53ff851496 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -3646,9 +3646,9 @@ int amdgpu_device_init(struct amdgpu_device 
>>>>> *adev,
>>>>>
>>>>>    fence_driver_init:
>>>>> 

Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)

2021-08-23 Thread Andrey Grodzovsky



On 2021-08-23 2:50 a.m., Christian König wrote:

Good mornings guys,

Andrey has a rather valid concern here, but I think we need to 
approach this from a more high level view.


When hw_fini is called we should make sure that the scheduler can't 
submit any more work to the hardware, because the hw is finalized and 
not expected to response any more.


As far as I can see the cleanest approach would be to stop the 
scheduler in hw_fini and fully clean it up in sw_fini. That would also 
fit quite nicely with how GPU reset is supposed to work I think.


Problem is that this is currently done outside of the fence code for 
the at least the reset case, so before we restructure that we need to 
stick with what we have.


Andrey do you think it would be any problem if we stop the scheduler 
manually in the hot plug case as well?



As long as it's 'parked' inside HW fini - meaning the thread submitting 
to HW is done I think it should cover hot unplug as well.


Andrey




Thanks,
Christian.

Am 23.08.21 um 08:36 schrieb Chen, Guchun:

[Public]

Hi Andrey,

Thanks for your notice. The cause why moving drm_sched_fini to 
sw_fini is it's a SW behavior and part of SW shutdown, so hw_fini 
should not touch it. But if the race, that scheduler on the ring 
possibly keeps submitting jobs which causes un-empty ring is there, 
possibly we still need to call drm_sched_fini first in hw_fini to 
stop job submission first.


@Koenig, Christian what's your opinion?

Regards,
Guchun

-Original Message-
From: Alex Deucher 
Sent: Friday, August 20, 2021 2:13 AM
To: Mike Lothian 
Cc: Grodzovsky, Andrey ; Chen, Guchun 
; amd-gfx list ; 
Gao, Likun ; Koenig, Christian 
; Zhang, Hawking ; 
Deucher, Alexander 
Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver 
fini in s3 test (v2)


Please go ahead.  Thanks!

Alex

On Thu, Aug 19, 2021 at 8:05 AM Mike Lothian  
wrote:

Hi

Do I need to open a new bug report for this?

Cheers

Mike

On Wed, 18 Aug 2021 at 06:26, Andrey Grodzovsky 
 wrote:


On 2021-08-02 1:16 a.m., Guchun Chen wrote:

In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to
stop scheduler in s3 test, otherwise, fence related failure will
arrive after resume. To fix this and for a better clean up, move
drm_sched_fini from fence_hw_fini to fence_sw_fini, as it's part of
driver shutdown, and should never be called in hw_fini.

v2: rename amdgpu_fence_driver_init to amdgpu_fence_driver_sw_init,
to keep sw_init and sw_fini paired.

Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence
Suggested-by: Christian König 
Signed-off-by: Guchun Chen 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 ++---
   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 12 +++-
   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  4 ++--
   3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b1d2dc39e8be..9e53ff851496 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3646,9 +3646,9 @@ int amdgpu_device_init(struct amdgpu_device
*adev,

   fence_driver_init:
   /* Fence driver */
- r = amdgpu_fence_driver_init(adev);
+ r = amdgpu_fence_driver_sw_init(adev);
   if (r) {
- dev_err(adev->dev, "amdgpu_fence_driver_init 
failed\n");

+ dev_err(adev->dev, "amdgpu_fence_driver_sw_init
+ failed\n");
   amdgpu_vf_error_put(adev, 
AMDGIM_ERROR_VF_FENCE_INIT_FAIL, 0, 0);

   goto failed;
   }
@@ -3988,7 +3988,6 @@ int amdgpu_device_resume(struct drm_device 
*dev, bool fbcon)

   }
   amdgpu_fence_driver_hw_init(adev);

-
   r = amdgpu_device_ip_late_init(adev);
   if (r)
   return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 49c5c7331c53..7495911516c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -498,7 +498,7 @@ int amdgpu_fence_driver_init_ring(struct 
amdgpu_ring *ring,

   }

   /**
- * amdgpu_fence_driver_init - init the fence driver
+ * amdgpu_fence_driver_sw_init - init the fence driver
    * for all possible rings.
    *
    * @adev: amdgpu device pointer
@@ -509,13 +509,13 @@ int amdgpu_fence_driver_init_ring(struct 
amdgpu_ring *ring,

    * amdgpu_fence_driver_start_ring().
    * Returns 0 for success.
    */
-int amdgpu_fence_driver_init(struct amdgpu_device *adev)
+int amdgpu_fence_driver_sw_init(struct amdgpu_device *adev)
   {
   return 0;
   }

   /**
- * amdgpu_fence_driver_fini - tear down the fence driver
+ * amdgpu_fence_driver_hw_fini - tear down the fence driver
    * for all possible rings.
    *
    * @adev: amdgpu device pointer
@@ -531,8 +531,7 @@ void amdgpu_fence_driver_hw_fini(struct
amdgpu_device *ad

Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)

2021-08-22 Thread Christian König

Good mornings guys,

Andrey has a rather valid concern here, but I think we need to approach 
this from a more high level view.


When hw_fini is called we should make sure that the scheduler can't 
submit any more work to the hardware, because the hw is finalized and 
not expected to response any more.


As far as I can see the cleanest approach would be to stop the scheduler 
in hw_fini and fully clean it up in sw_fini. That would also fit quite 
nicely with how GPU reset is supposed to work I think.


Problem is that this is currently done outside of the fence code for the 
at least the reset case, so before we restructure that we need to stick 
with what we have.


Andrey do you think it would be any problem if we stop the scheduler 
manually in the hot plug case as well?


Thanks,
Christian.

Am 23.08.21 um 08:36 schrieb Chen, Guchun:

[Public]

Hi Andrey,

Thanks for your notice. The cause why moving drm_sched_fini to sw_fini is it's 
a SW behavior and part of SW shutdown, so hw_fini should not touch it. But if 
the race, that scheduler on the ring possibly keeps submitting jobs which 
causes un-empty ring is there, possibly we still need to call drm_sched_fini 
first in hw_fini to stop job submission first.

@Koenig, Christian what's your opinion?

Regards,
Guchun

-Original Message-
From: Alex Deucher 
Sent: Friday, August 20, 2021 2:13 AM
To: Mike Lothian 
Cc: Grodzovsky, Andrey ; Chen, Guchun ; amd-gfx list 
; Gao, Likun ; Koenig, Christian 
; Zhang, Hawking ; Deucher, Alexander 

Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 
test (v2)

Please go ahead.  Thanks!

Alex

On Thu, Aug 19, 2021 at 8:05 AM Mike Lothian  wrote:

Hi

Do I need to open a new bug report for this?

Cheers

Mike

On Wed, 18 Aug 2021 at 06:26, Andrey Grodzovsky  
wrote:


On 2021-08-02 1:16 a.m., Guchun Chen wrote:

In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to
stop scheduler in s3 test, otherwise, fence related failure will
arrive after resume. To fix this and for a better clean up, move
drm_sched_fini from fence_hw_fini to fence_sw_fini, as it's part of
driver shutdown, and should never be called in hw_fini.

v2: rename amdgpu_fence_driver_init to amdgpu_fence_driver_sw_init,
to keep sw_init and sw_fini paired.

Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence
Suggested-by: Christian König 
Signed-off-by: Guchun Chen 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 ++---
   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 12 +++-
   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  4 ++--
   3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b1d2dc39e8be..9e53ff851496 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3646,9 +3646,9 @@ int amdgpu_device_init(struct amdgpu_device
*adev,

   fence_driver_init:
   /* Fence driver */
- r = amdgpu_fence_driver_init(adev);
+ r = amdgpu_fence_driver_sw_init(adev);
   if (r) {
- dev_err(adev->dev, "amdgpu_fence_driver_init failed\n");
+ dev_err(adev->dev, "amdgpu_fence_driver_sw_init
+ failed\n");
   amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_FENCE_INIT_FAIL, 0, 0);
   goto failed;
   }
@@ -3988,7 +3988,6 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
fbcon)
   }
   amdgpu_fence_driver_hw_init(adev);

-
   r = amdgpu_device_ip_late_init(adev);
   if (r)
   return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 49c5c7331c53..7495911516c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -498,7 +498,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
   }

   /**
- * amdgpu_fence_driver_init - init the fence driver
+ * amdgpu_fence_driver_sw_init - init the fence driver
* for all possible rings.
*
* @adev: amdgpu device pointer
@@ -509,13 +509,13 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
*ring,
* amdgpu_fence_driver_start_ring().
* Returns 0 for success.
*/
-int amdgpu_fence_driver_init(struct amdgpu_device *adev)
+int amdgpu_fence_driver_sw_init(struct amdgpu_device *adev)
   {
   return 0;
   }

   /**
- * amdgpu_fence_driver_fini - tear down the fence driver
+ * amdgpu_fence_driver_hw_fini - tear down the fence driver
* for all possible rings.
*
* @adev: amdgpu device pointer
@@ -531,8 +531,7 @@ void amdgpu_fence_driver_hw_fini(struct
amdgpu_device *adev)

   if (!ring || !ring->fence_drv.initialized)
   continue;
- if (!ring->no_scheduler)
- drm_sched_fini(&ring->sched);
+
   /* You can't wait for HW to

RE: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)

2021-08-22 Thread Chen, Guchun
[Public]

Hi Andrey,

Thanks for your notice. The cause why moving drm_sched_fini to sw_fini is it's 
a SW behavior and part of SW shutdown, so hw_fini should not touch it. But if 
the race, that scheduler on the ring possibly keeps submitting jobs which 
causes un-empty ring is there, possibly we still need to call drm_sched_fini 
first in hw_fini to stop job submission first.

@Koenig, Christian what's your opinion?

Regards,
Guchun

-Original Message-
From: Alex Deucher  
Sent: Friday, August 20, 2021 2:13 AM
To: Mike Lothian 
Cc: Grodzovsky, Andrey ; Chen, Guchun 
; amd-gfx list ; Gao, Likun 
; Koenig, Christian ; Zhang, 
Hawking ; Deucher, Alexander 
Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 
test (v2)

Please go ahead.  Thanks!

Alex

On Thu, Aug 19, 2021 at 8:05 AM Mike Lothian  wrote:
>
> Hi
>
> Do I need to open a new bug report for this?
>
> Cheers
>
> Mike
>
> On Wed, 18 Aug 2021 at 06:26, Andrey Grodzovsky  
> wrote:
>>
>>
>> On 2021-08-02 1:16 a.m., Guchun Chen wrote:
>> > In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to 
>> > stop scheduler in s3 test, otherwise, fence related failure will 
>> > arrive after resume. To fix this and for a better clean up, move 
>> > drm_sched_fini from fence_hw_fini to fence_sw_fini, as it's part of 
>> > driver shutdown, and should never be called in hw_fini.
>> >
>> > v2: rename amdgpu_fence_driver_init to amdgpu_fence_driver_sw_init, 
>> > to keep sw_init and sw_fini paired.
>> >
>> > Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence
>> > Suggested-by: Christian König 
>> > Signed-off-by: Guchun Chen 
>> > ---
>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 ++---
>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 12 +++-
>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  4 ++--
>> >   3 files changed, 11 insertions(+), 10 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> > index b1d2dc39e8be..9e53ff851496 100644
>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> > @@ -3646,9 +3646,9 @@ int amdgpu_device_init(struct amdgpu_device 
>> > *adev,
>> >
>> >   fence_driver_init:
>> >   /* Fence driver */
>> > - r = amdgpu_fence_driver_init(adev);
>> > + r = amdgpu_fence_driver_sw_init(adev);
>> >   if (r) {
>> > - dev_err(adev->dev, "amdgpu_fence_driver_init failed\n");
>> > + dev_err(adev->dev, "amdgpu_fence_driver_sw_init 
>> > + failed\n");
>> >   amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_FENCE_INIT_FAIL, 
>> > 0, 0);
>> >   goto failed;
>> >   }
>> > @@ -3988,7 +3988,6 @@ int amdgpu_device_resume(struct drm_device *dev, 
>> > bool fbcon)
>> >   }
>> >   amdgpu_fence_driver_hw_init(adev);
>> >
>> > -
>> >   r = amdgpu_device_ip_late_init(adev);
>> >   if (r)
>> >   return r;
>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> > index 49c5c7331c53..7495911516c2 100644
>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> > @@ -498,7 +498,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
>> > *ring,
>> >   }
>> >
>> >   /**
>> > - * amdgpu_fence_driver_init - init the fence driver
>> > + * amdgpu_fence_driver_sw_init - init the fence driver
>> >* for all possible rings.
>> >*
>> >* @adev: amdgpu device pointer
>> > @@ -509,13 +509,13 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
>> > *ring,
>> >* amdgpu_fence_driver_start_ring().
>> >* Returns 0 for success.
>> >*/
>> > -int amdgpu_fence_driver_init(struct amdgpu_device *adev)
>> > +int amdgpu_fence_driver_sw_init(struct amdgpu_device *adev)
>> >   {
>> >   return 0;
>> >   }
>> >
>> >   /**
>> > - * amdgpu_fence_driver_fini - tear down the fence driver
>> > + * amdgpu_fence_driver_hw_fini - tear down the fence driver
>> >* for all possible rings.
>> >*
>> >* @adev: amdgpu device pointer
>> > @@ -531,8 +531,7 @@ 

Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)

2021-08-19 Thread Alex Deucher
Please go ahead.  Thanks!

Alex

On Thu, Aug 19, 2021 at 8:05 AM Mike Lothian  wrote:
>
> Hi
>
> Do I need to open a new bug report for this?
>
> Cheers
>
> Mike
>
> On Wed, 18 Aug 2021 at 06:26, Andrey Grodzovsky  
> wrote:
>>
>>
>> On 2021-08-02 1:16 a.m., Guchun Chen wrote:
>> > In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to stop
>> > scheduler in s3 test, otherwise, fence related failure will arrive
>> > after resume. To fix this and for a better clean up, move drm_sched_fini
>> > from fence_hw_fini to fence_sw_fini, as it's part of driver shutdown, and
>> > should never be called in hw_fini.
>> >
>> > v2: rename amdgpu_fence_driver_init to amdgpu_fence_driver_sw_init,
>> > to keep sw_init and sw_fini paired.
>> >
>> > Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence
>> > Suggested-by: Christian König 
>> > Signed-off-by: Guchun Chen 
>> > ---
>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 ++---
>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 12 +++-
>> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  4 ++--
>> >   3 files changed, 11 insertions(+), 10 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> > index b1d2dc39e8be..9e53ff851496 100644
>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> > @@ -3646,9 +3646,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>> >
>> >   fence_driver_init:
>> >   /* Fence driver */
>> > - r = amdgpu_fence_driver_init(adev);
>> > + r = amdgpu_fence_driver_sw_init(adev);
>> >   if (r) {
>> > - dev_err(adev->dev, "amdgpu_fence_driver_init failed\n");
>> > + dev_err(adev->dev, "amdgpu_fence_driver_sw_init failed\n");
>> >   amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_FENCE_INIT_FAIL, 
>> > 0, 0);
>> >   goto failed;
>> >   }
>> > @@ -3988,7 +3988,6 @@ int amdgpu_device_resume(struct drm_device *dev, 
>> > bool fbcon)
>> >   }
>> >   amdgpu_fence_driver_hw_init(adev);
>> >
>> > -
>> >   r = amdgpu_device_ip_late_init(adev);
>> >   if (r)
>> >   return r;
>> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> > index 49c5c7331c53..7495911516c2 100644
>> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> > @@ -498,7 +498,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
>> > *ring,
>> >   }
>> >
>> >   /**
>> > - * amdgpu_fence_driver_init - init the fence driver
>> > + * amdgpu_fence_driver_sw_init - init the fence driver
>> >* for all possible rings.
>> >*
>> >* @adev: amdgpu device pointer
>> > @@ -509,13 +509,13 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
>> > *ring,
>> >* amdgpu_fence_driver_start_ring().
>> >* Returns 0 for success.
>> >*/
>> > -int amdgpu_fence_driver_init(struct amdgpu_device *adev)
>> > +int amdgpu_fence_driver_sw_init(struct amdgpu_device *adev)
>> >   {
>> >   return 0;
>> >   }
>> >
>> >   /**
>> > - * amdgpu_fence_driver_fini - tear down the fence driver
>> > + * amdgpu_fence_driver_hw_fini - tear down the fence driver
>> >* for all possible rings.
>> >*
>> >* @adev: amdgpu device pointer
>> > @@ -531,8 +531,7 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device 
>> > *adev)
>> >
>> >   if (!ring || !ring->fence_drv.initialized)
>> >   continue;
>> > - if (!ring->no_scheduler)
>> > - drm_sched_fini(&ring->sched);
>> > +
>> >   /* You can't wait for HW to signal if it's gone */
>> >   if (!drm_dev_is_unplugged(&adev->ddev))
>> >   r = amdgpu_fence_wait_empty(ring);
>>
>>
>> Sorry for late notice, missed this patch. By moving drm_sched_fini
>> past amdgpu_fence_wait_empty a race is created as even after you waited
>> for all fences on the ring to signal the sw scheduler will keep submitting
>> new jobs on the ring and so the ring won't stay empty.
>>
>> For hot device removal also we want to prevent any access to HW past PCI
>> removal
>> in order to not do any MMIO accesses inside the physical MMIO range that
>> no longer
>> belongs to this device after it's removal by the PCI core. Stopping all
>> the schedulers prevents any MMIO
>> accesses done during job submissions and that why drm_sched_fini was
>> done as part of amdgpu_fence_driver_hw_fini
>> and not amdgpu_fence_driver_sw_fini
>>
>> Andrey
>>
>> > @@ -560,6 +559,9 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device 
>> > *adev)
>> >   if (!ring || !ring->fence_drv.initialized)
>> >   continue;
>> >
>> > + if (!ring->no_scheduler)
>> > + drm_sched_fini(&ring->sched);
>> > +
>> >   for (j = 0; j <= ring->fence_drv.num_fences_mask; +

Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)

2021-08-19 Thread Mike Lothian
Hi

Do I need to open a new bug report for this?

Cheers

Mike

On Wed, 18 Aug 2021 at 06:26, Andrey Grodzovsky 
wrote:

>
> On 2021-08-02 1:16 a.m., Guchun Chen wrote:
> > In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to stop
> > scheduler in s3 test, otherwise, fence related failure will arrive
> > after resume. To fix this and for a better clean up, move drm_sched_fini
> > from fence_hw_fini to fence_sw_fini, as it's part of driver shutdown, and
> > should never be called in hw_fini.
> >
> > v2: rename amdgpu_fence_driver_init to amdgpu_fence_driver_sw_init,
> > to keep sw_init and sw_fini paired.
> >
> > Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence
> > Suggested-by: Christian König 
> > Signed-off-by: Guchun Chen 
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 ++---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 12 +++-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  4 ++--
> >   3 files changed, 11 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index b1d2dc39e8be..9e53ff851496 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -3646,9 +3646,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
> >
> >   fence_driver_init:
> >   /* Fence driver */
> > - r = amdgpu_fence_driver_init(adev);
> > + r = amdgpu_fence_driver_sw_init(adev);
> >   if (r) {
> > - dev_err(adev->dev, "amdgpu_fence_driver_init failed\n");
> > + dev_err(adev->dev, "amdgpu_fence_driver_sw_init failed\n");
> >   amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_FENCE_INIT_FAIL,
> 0, 0);
> >   goto failed;
> >   }
> > @@ -3988,7 +3988,6 @@ int amdgpu_device_resume(struct drm_device *dev,
> bool fbcon)
> >   }
> >   amdgpu_fence_driver_hw_init(adev);
> >
> > -
> >   r = amdgpu_device_ip_late_init(adev);
> >   if (r)
> >   return r;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > index 49c5c7331c53..7495911516c2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > @@ -498,7 +498,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring
> *ring,
> >   }
> >
> >   /**
> > - * amdgpu_fence_driver_init - init the fence driver
> > + * amdgpu_fence_driver_sw_init - init the fence driver
> >* for all possible rings.
> >*
> >* @adev: amdgpu device pointer
> > @@ -509,13 +509,13 @@ int amdgpu_fence_driver_init_ring(struct
> amdgpu_ring *ring,
> >* amdgpu_fence_driver_start_ring().
> >* Returns 0 for success.
> >*/
> > -int amdgpu_fence_driver_init(struct amdgpu_device *adev)
> > +int amdgpu_fence_driver_sw_init(struct amdgpu_device *adev)
> >   {
> >   return 0;
> >   }
> >
> >   /**
> > - * amdgpu_fence_driver_fini - tear down the fence driver
> > + * amdgpu_fence_driver_hw_fini - tear down the fence driver
> >* for all possible rings.
> >*
> >* @adev: amdgpu device pointer
> > @@ -531,8 +531,7 @@ void amdgpu_fence_driver_hw_fini(struct
> amdgpu_device *adev)
> >
> >   if (!ring || !ring->fence_drv.initialized)
> >   continue;
> > - if (!ring->no_scheduler)
> > - drm_sched_fini(&ring->sched);
> > +
> >   /* You can't wait for HW to signal if it's gone */
> >   if (!drm_dev_is_unplugged(&adev->ddev))
> >   r = amdgpu_fence_wait_empty(ring);
>
>
> Sorry for late notice, missed this patch. By moving drm_sched_fini
> past amdgpu_fence_wait_empty a race is created as even after you waited
> for all fences on the ring to signal the sw scheduler will keep submitting
> new jobs on the ring and so the ring won't stay empty.
>
> For hot device removal also we want to prevent any access to HW past PCI
> removal
> in order to not do any MMIO accesses inside the physical MMIO range that
> no longer
> belongs to this device after it's removal by the PCI core. Stopping all
> the schedulers prevents any MMIO
> accesses done during job submissions and that why drm_sched_fini was
> done as part of amdgpu_fence_driver_hw_fini
> and not amdgpu_fence_driver_sw_fini
>
> Andrey
>
> > @@ -560,6 +559,9 @@ void amdgpu_fence_driver_sw_fini(struct
> amdgpu_device *adev)
> >   if (!ring || !ring->fence_drv.initialized)
> >   continue;
> >
> > + if (!ring->no_scheduler)
> > + drm_sched_fini(&ring->sched);
> > +
> >   for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
> >   dma_fence_put(ring->fence_drv.fences[j]);
> >   kfree(ring->fence_drv.fences);
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> > index 27adffa7658d..9

Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)

2021-08-17 Thread Andrey Grodzovsky



On 2021-08-02 1:16 a.m., Guchun Chen wrote:

In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to stop
scheduler in s3 test, otherwise, fence related failure will arrive
after resume. To fix this and for a better clean up, move drm_sched_fini
from fence_hw_fini to fence_sw_fini, as it's part of driver shutdown, and
should never be called in hw_fini.

v2: rename amdgpu_fence_driver_init to amdgpu_fence_driver_sw_init,
to keep sw_init and sw_fini paired.

Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence
Suggested-by: Christian König 
Signed-off-by: Guchun Chen 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 ++---
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 12 +++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  4 ++--
  3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b1d2dc39e8be..9e53ff851496 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3646,9 +3646,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
  
  fence_driver_init:

/* Fence driver */
-   r = amdgpu_fence_driver_init(adev);
+   r = amdgpu_fence_driver_sw_init(adev);
if (r) {
-   dev_err(adev->dev, "amdgpu_fence_driver_init failed\n");
+   dev_err(adev->dev, "amdgpu_fence_driver_sw_init failed\n");
amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_FENCE_INIT_FAIL, 0, 
0);
goto failed;
}
@@ -3988,7 +3988,6 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
fbcon)
}
amdgpu_fence_driver_hw_init(adev);
  
-

r = amdgpu_device_ip_late_init(adev);
if (r)
return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 49c5c7331c53..7495911516c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -498,7 +498,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
  }
  
  /**

- * amdgpu_fence_driver_init - init the fence driver
+ * amdgpu_fence_driver_sw_init - init the fence driver
   * for all possible rings.
   *
   * @adev: amdgpu device pointer
@@ -509,13 +509,13 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
*ring,
   * amdgpu_fence_driver_start_ring().
   * Returns 0 for success.
   */
-int amdgpu_fence_driver_init(struct amdgpu_device *adev)
+int amdgpu_fence_driver_sw_init(struct amdgpu_device *adev)
  {
return 0;
  }
  
  /**

- * amdgpu_fence_driver_fini - tear down the fence driver
+ * amdgpu_fence_driver_hw_fini - tear down the fence driver
   * for all possible rings.
   *
   * @adev: amdgpu device pointer
@@ -531,8 +531,7 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev)
  
  		if (!ring || !ring->fence_drv.initialized)

continue;
-   if (!ring->no_scheduler)
-   drm_sched_fini(&ring->sched);
+
/* You can't wait for HW to signal if it's gone */
if (!drm_dev_is_unplugged(&adev->ddev))
r = amdgpu_fence_wait_empty(ring);



Sorry for late notice, missed this patch. By moving drm_sched_fini
past amdgpu_fence_wait_empty a race is created as even after you waited
for all fences on the ring to signal the sw scheduler will keep submitting
new jobs on the ring and so the ring won't stay empty.

For hot device removal also we want to prevent any access to HW past PCI 
removal
in order to not do any MMIO accesses inside the physical MMIO range that 
no longer
belongs to this device after it's removal by the PCI core. Stopping all 
the schedulers prevents any MMIO
accesses done during job submissions and that why drm_sched_fini was 
done as part of amdgpu_fence_driver_hw_fini

and not amdgpu_fence_driver_sw_fini

Andrey


@@ -560,6 +559,9 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
if (!ring || !ring->fence_drv.initialized)
continue;
  
+		if (!ring->no_scheduler)

+   drm_sched_fini(&ring->sched);
+
for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
dma_fence_put(ring->fence_drv.fences[j]);
kfree(ring->fence_drv.fences);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 27adffa7658d..9c11ced4312c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -106,7 +106,6 @@ struct amdgpu_fence_driver {
struct dma_fence**fences;
  };
  
-int amdgpu_fence_driver_init(struct amdgpu_device *adev);

  void amdgpu_fence_driver_force_completion(struct amdgpu_ring *ring);
  
  int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,

@@ -115,9 +114,10 @@ int amdgpu_fence_driver_init_ring

Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)

2021-08-17 Thread Mike Lothian
Hi

I've just noticed something similar when starting weston, I still see it
with this patch, but not on linus's tree

I'll confirm for sure tomorrow and send the stack trace if I can save it

Cheers

Mike

On Tue, 3 Aug 2021 at 02:56, Chen, Guchun  wrote:

> [Public]
>
> Hi Alex,
>
> I submitted the patch before your message, I will take care of this next
> time.
>
> Regards,
> Guchun
>
> -Original Message-
> From: Alex Deucher 
> Sent: Monday, August 2, 2021 9:35 PM
> To: Chen, Guchun 
> Cc: Christian König ;
> amd-gfx@lists.freedesktop.org; Gao, Likun ; Koenig,
> Christian ; Zhang, Hawking <
> hawking.zh...@amd.com>; Deucher, Alexander 
> Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in
> s3 test (v2)
>
> On Mon, Aug 2, 2021 at 4:23 AM Chen, Guchun  wrote:
> >
> > [Public]
> >
> > Thank you, Christian.
> >
> > Regarding fence_drv.initialized, it looks to a bit redundant, anyway let
> me look into this more.
>
> Does this patch fix this bug?
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1668&data=04%7C01%7CGuchun.Chen%40amd.com%7C2bf8bebf5b424751572408d955ba66e8%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637635081353279181%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=FuAo44Ws5SnuCxt45A%2Fqmu%2B3OfEkat1G%2BixO8G9uDVc%3D&reserved=0
>
> If so, please add:
> Bug:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1668&data=04%7C01%7CGuchun.Chen%40amd.com%7C2bf8bebf5b424751572408d955ba66e8%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637635081353279181%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=FuAo44Ws5SnuCxt45A%2Fqmu%2B3OfEkat1G%2BixO8G9uDVc%3D&reserved=0
> to the commit message.
>
> Alex
>
> >
> > Regards,
> > Guchun
> >
> > -Original Message-
> > From: Christian König 
> > Sent: Monday, August 2, 2021 2:56 PM
> > To: Chen, Guchun ; amd-gfx@lists.freedesktop.org;
> > Gao, Likun ; Koenig, Christian
> > ; Zhang, Hawking ;
> > Deucher, Alexander 
> > Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver
> > fini in s3 test (v2)
> >
> > Am 02.08.21 um 07:16 schrieb Guchun Chen:
> > > In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to
> > > stop scheduler in s3 test, otherwise, fence related failure will
> > > arrive after resume. To fix this and for a better clean up, move
> > > drm_sched_fini from fence_hw_fini to fence_sw_fini, as it's part of
> > > driver shutdown, and should never be called in hw_fini.
> > >
> > > v2: rename amdgpu_fence_driver_init to amdgpu_fence_driver_sw_init,
> > > to keep sw_init and sw_fini paired.
> > >
> > > Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence
> > > Suggested-by: Christian König 
> > > Signed-off-by: Guchun Chen 
> >
> > It's a bit ambiguous now what fence_drv.initialized means, but I think
> we can live with that for now.
> >
> > Patch is Reviewed-by: Christian König .
> >
> > Regards,
> > Christian.
> >
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 ++---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 12 +++-
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  4 ++--
> > >   3 files changed, 11 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index b1d2dc39e8be..9e53ff851496 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -3646,9 +3646,9 @@ int amdgpu_device_init(struct amdgpu_device
> > > *adev,
> > >
> > >   fence_driver_init:
> > >   /* Fence driver */
> > > - r = amdgpu_fence_driver_init(adev);
> > > + r = amdgpu_fence_driver_sw_init(adev);
> > >   if (r) {
> > > - dev_err(adev->dev, "amdgpu_fence_driver_init failed\n");
> > > + dev_err(adev->dev, "amdgpu_fence_driver_sw_init
> > > + failed\n");
> > >   amdgpu_vf_error_put(adev,
> AMDGIM_ERROR_VF_FENCE_INIT_FAIL, 0, 0);
> > >   goto failed;
> > >   }
> > > @@ -3988,7 +3988,6 @@ int am

RE: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)

2021-08-02 Thread Chen, Guchun
[Public]

Hi Alex,

I submitted the patch before your message, I will take care of this next time.

Regards,
Guchun

-Original Message-
From: Alex Deucher  
Sent: Monday, August 2, 2021 9:35 PM
To: Chen, Guchun 
Cc: Christian König ; 
amd-gfx@lists.freedesktop.org; Gao, Likun ; Koenig, 
Christian ; Zhang, Hawking ; 
Deucher, Alexander 
Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 
test (v2)

On Mon, Aug 2, 2021 at 4:23 AM Chen, Guchun  wrote:
>
> [Public]
>
> Thank you, Christian.
>
> Regarding fence_drv.initialized, it looks to a bit redundant, anyway let me 
> look into this more.

Does this patch fix this bug?
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1668&data=04%7C01%7CGuchun.Chen%40amd.com%7C2bf8bebf5b424751572408d955ba66e8%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637635081353279181%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=FuAo44Ws5SnuCxt45A%2Fqmu%2B3OfEkat1G%2BixO8G9uDVc%3D&reserved=0

If so, please add:
Bug: 
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fdrm%2Famd%2F-%2Fissues%2F1668&data=04%7C01%7CGuchun.Chen%40amd.com%7C2bf8bebf5b424751572408d955ba66e8%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637635081353279181%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=FuAo44Ws5SnuCxt45A%2Fqmu%2B3OfEkat1G%2BixO8G9uDVc%3D&reserved=0
to the commit message.

Alex

>
> Regards,
> Guchun
>
> -Original Message-
> From: Christian König 
> Sent: Monday, August 2, 2021 2:56 PM
> To: Chen, Guchun ; amd-gfx@lists.freedesktop.org; 
> Gao, Likun ; Koenig, Christian 
> ; Zhang, Hawking ; 
> Deucher, Alexander 
> Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver 
> fini in s3 test (v2)
>
> Am 02.08.21 um 07:16 schrieb Guchun Chen:
> > In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to 
> > stop scheduler in s3 test, otherwise, fence related failure will 
> > arrive after resume. To fix this and for a better clean up, move 
> > drm_sched_fini from fence_hw_fini to fence_sw_fini, as it's part of 
> > driver shutdown, and should never be called in hw_fini.
> >
> > v2: rename amdgpu_fence_driver_init to amdgpu_fence_driver_sw_init, 
> > to keep sw_init and sw_fini paired.
> >
> > Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence
> > Suggested-by: Christian König 
> > Signed-off-by: Guchun Chen 
>
> It's a bit ambiguous now what fence_drv.initialized means, but I think we can 
> live with that for now.
>
> Patch is Reviewed-by: Christian König .
>
> Regards,
> Christian.
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 ++---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 12 +++-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  4 ++--
> >   3 files changed, 11 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index b1d2dc39e8be..9e53ff851496 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -3646,9 +3646,9 @@ int amdgpu_device_init(struct amdgpu_device 
> > *adev,
> >
> >   fence_driver_init:
> >   /* Fence driver */
> > - r = amdgpu_fence_driver_init(adev);
> > + r = amdgpu_fence_driver_sw_init(adev);
> >   if (r) {
> > - dev_err(adev->dev, "amdgpu_fence_driver_init failed\n");
> > + dev_err(adev->dev, "amdgpu_fence_driver_sw_init 
> > + failed\n");
> >   amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_FENCE_INIT_FAIL, 0, 
> > 0);
> >   goto failed;
> >   }
> > @@ -3988,7 +3988,6 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
> > fbcon)
> >   }
> >   amdgpu_fence_driver_hw_init(adev);
> >
> > -
> >   r = amdgpu_device_ip_late_init(adev);
> >   if (r)
> >   return r;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > index 49c5c7331c53..7495911516c2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > @@ -498,7 +498,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
> > *ring,
> >   }
> >
> >   /**
> > - * amdgpu_fence_driv

Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)

2021-08-02 Thread Mike Lothian
I've just tested it and it seem to fix my issue

Feel free to add my

Tested-by: Mike Lothian 

On Mon, 2 Aug 2021 at 14:35, Alex Deucher  wrote:

> On Mon, Aug 2, 2021 at 4:23 AM Chen, Guchun  wrote:
> >
> > [Public]
> >
> > Thank you, Christian.
> >
> > Regarding fence_drv.initialized, it looks to a bit redundant, anyway let
> me look into this more.
>
> Does this patch fix this bug?
> https://gitlab.freedesktop.org/drm/amd/-/issues/1668
>
> If so, please add:
> Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1668
> to the commit message.
>
> Alex
>
> >
> > Regards,
> > Guchun
> >
> > -Original Message-
> > From: Christian König 
> > Sent: Monday, August 2, 2021 2:56 PM
> > To: Chen, Guchun ; amd-gfx@lists.freedesktop.org;
> Gao, Likun ; Koenig, Christian <
> christian.koe...@amd.com>; Zhang, Hawking ;
> Deucher, Alexander 
> > Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini
> in s3 test (v2)
> >
> > Am 02.08.21 um 07:16 schrieb Guchun Chen:
> > > In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to stop
> > > scheduler in s3 test, otherwise, fence related failure will arrive
> > > after resume. To fix this and for a better clean up, move
> > > drm_sched_fini from fence_hw_fini to fence_sw_fini, as it's part of
> > > driver shutdown, and should never be called in hw_fini.
> > >
> > > v2: rename amdgpu_fence_driver_init to amdgpu_fence_driver_sw_init, to
> > > keep sw_init and sw_fini paired.
> > >
> > > Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence
> > > Suggested-by: Christian König 
> > > Signed-off-by: Guchun Chen 
> >
> > It's a bit ambiguous now what fence_drv.initialized means, but I think
> we can live with that for now.
> >
> > Patch is Reviewed-by: Christian König .
> >
> > Regards,
> > Christian.
> >
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 ++---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 12 +++-
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  4 ++--
> > >   3 files changed, 11 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > index b1d2dc39e8be..9e53ff851496 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > > @@ -3646,9 +3646,9 @@ int amdgpu_device_init(struct amdgpu_device
> > > *adev,
> > >
> > >   fence_driver_init:
> > >   /* Fence driver */
> > > - r = amdgpu_fence_driver_init(adev);
> > > + r = amdgpu_fence_driver_sw_init(adev);
> > >   if (r) {
> > > - dev_err(adev->dev, "amdgpu_fence_driver_init failed\n");
> > > + dev_err(adev->dev, "amdgpu_fence_driver_sw_init
> failed\n");
> > >   amdgpu_vf_error_put(adev,
> AMDGIM_ERROR_VF_FENCE_INIT_FAIL, 0, 0);
> > >   goto failed;
> > >   }
> > > @@ -3988,7 +3988,6 @@ int amdgpu_device_resume(struct drm_device *dev,
> bool fbcon)
> > >   }
> > >   amdgpu_fence_driver_hw_init(adev);
> > >
> > > -
> > >   r = amdgpu_device_ip_late_init(adev);
> > >   if (r)
> > >   return r;
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > > index 49c5c7331c53..7495911516c2 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > > @@ -498,7 +498,7 @@ int amdgpu_fence_driver_init_ring(struct
> amdgpu_ring *ring,
> > >   }
> > >
> > >   /**
> > > - * amdgpu_fence_driver_init - init the fence driver
> > > + * amdgpu_fence_driver_sw_init - init the fence driver
> > >* for all possible rings.
> > >*
> > >* @adev: amdgpu device pointer
> > > @@ -509,13 +509,13 @@ int amdgpu_fence_driver_init_ring(struct
> amdgpu_ring *ring,
> > >* amdgpu_fence_driver_start_ring().
> > >* Returns 0 for success.
> > >*/
> > > -int amdgpu_fence_driver_init(struct amdgpu_device *adev)
> > > +int amdgpu_fence_driver_sw_init(struct amdgpu_device *adev)
> > >   {
> > >   return 0;
> > >   }
&

Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)

2021-08-02 Thread Alex Deucher
On Mon, Aug 2, 2021 at 4:23 AM Chen, Guchun  wrote:
>
> [Public]
>
> Thank you, Christian.
>
> Regarding fence_drv.initialized, it looks to a bit redundant, anyway let me 
> look into this more.

Does this patch fix this bug?
https://gitlab.freedesktop.org/drm/amd/-/issues/1668

If so, please add:
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1668
to the commit message.

Alex

>
> Regards,
> Guchun
>
> -Original Message-
> From: Christian König 
> Sent: Monday, August 2, 2021 2:56 PM
> To: Chen, Guchun ; amd-gfx@lists.freedesktop.org; Gao, 
> Likun ; Koenig, Christian ; 
> Zhang, Hawking ; Deucher, Alexander 
> 
> Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 
> test (v2)
>
> Am 02.08.21 um 07:16 schrieb Guchun Chen:
> > In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to stop
> > scheduler in s3 test, otherwise, fence related failure will arrive
> > after resume. To fix this and for a better clean up, move
> > drm_sched_fini from fence_hw_fini to fence_sw_fini, as it's part of
> > driver shutdown, and should never be called in hw_fini.
> >
> > v2: rename amdgpu_fence_driver_init to amdgpu_fence_driver_sw_init, to
> > keep sw_init and sw_fini paired.
> >
> > Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence
> > Suggested-by: Christian König 
> > Signed-off-by: Guchun Chen 
>
> It's a bit ambiguous now what fence_drv.initialized means, but I think we can 
> live with that for now.
>
> Patch is Reviewed-by: Christian König .
>
> Regards,
> Christian.
>
> > ---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 ++---
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 12 +++-
> >   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  4 ++--
> >   3 files changed, 11 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index b1d2dc39e8be..9e53ff851496 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -3646,9 +3646,9 @@ int amdgpu_device_init(struct amdgpu_device
> > *adev,
> >
> >   fence_driver_init:
> >   /* Fence driver */
> > - r = amdgpu_fence_driver_init(adev);
> > + r = amdgpu_fence_driver_sw_init(adev);
> >   if (r) {
> > - dev_err(adev->dev, "amdgpu_fence_driver_init failed\n");
> > + dev_err(adev->dev, "amdgpu_fence_driver_sw_init failed\n");
> >   amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_FENCE_INIT_FAIL, 0, 
> > 0);
> >   goto failed;
> >   }
> > @@ -3988,7 +3988,6 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
> > fbcon)
> >   }
> >   amdgpu_fence_driver_hw_init(adev);
> >
> > -
> >   r = amdgpu_device_ip_late_init(adev);
> >   if (r)
> >   return r;
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > index 49c5c7331c53..7495911516c2 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> > @@ -498,7 +498,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
> > *ring,
> >   }
> >
> >   /**
> > - * amdgpu_fence_driver_init - init the fence driver
> > + * amdgpu_fence_driver_sw_init - init the fence driver
> >* for all possible rings.
> >*
> >* @adev: amdgpu device pointer
> > @@ -509,13 +509,13 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
> > *ring,
> >* amdgpu_fence_driver_start_ring().
> >* Returns 0 for success.
> >*/
> > -int amdgpu_fence_driver_init(struct amdgpu_device *adev)
> > +int amdgpu_fence_driver_sw_init(struct amdgpu_device *adev)
> >   {
> >   return 0;
> >   }
> >
> >   /**
> > - * amdgpu_fence_driver_fini - tear down the fence driver
> > + * amdgpu_fence_driver_hw_fini - tear down the fence driver
> >* for all possible rings.
> >*
> >* @adev: amdgpu device pointer
> > @@ -531,8 +531,7 @@ void amdgpu_fence_driver_hw_fini(struct
> > amdgpu_device *adev)
> >
> >   if (!ring || !ring->fence_drv.initialized)
> >   continue;
> > - if (!ring->no_scheduler)
> > - drm_sched_fini(&ring->sched);
> > +
> >   /* You can't wait for HW to si

RE: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)

2021-08-02 Thread Chen, Guchun
[Public]

Thank you, Christian.

Regarding fence_drv.initialized, it looks to a bit redundant, anyway let me 
look into this more.

Regards,
Guchun

-Original Message-
From: Christian König  
Sent: Monday, August 2, 2021 2:56 PM
To: Chen, Guchun ; amd-gfx@lists.freedesktop.org; Gao, 
Likun ; Koenig, Christian ; Zhang, 
Hawking ; Deucher, Alexander 
Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 
test (v2)

Am 02.08.21 um 07:16 schrieb Guchun Chen:
> In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to stop 
> scheduler in s3 test, otherwise, fence related failure will arrive 
> after resume. To fix this and for a better clean up, move 
> drm_sched_fini from fence_hw_fini to fence_sw_fini, as it's part of 
> driver shutdown, and should never be called in hw_fini.
>
> v2: rename amdgpu_fence_driver_init to amdgpu_fence_driver_sw_init, to 
> keep sw_init and sw_fini paired.
>
> Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence
> Suggested-by: Christian König 
> Signed-off-by: Guchun Chen 

It's a bit ambiguous now what fence_drv.initialized means, but I think we can 
live with that for now.

Patch is Reviewed-by: Christian König .

Regards,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 ++---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 12 +++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  4 ++--
>   3 files changed, 11 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b1d2dc39e8be..9e53ff851496 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3646,9 +3646,9 @@ int amdgpu_device_init(struct amdgpu_device 
> *adev,
>   
>   fence_driver_init:
>   /* Fence driver */
> - r = amdgpu_fence_driver_init(adev);
> + r = amdgpu_fence_driver_sw_init(adev);
>   if (r) {
> - dev_err(adev->dev, "amdgpu_fence_driver_init failed\n");
> + dev_err(adev->dev, "amdgpu_fence_driver_sw_init failed\n");
>   amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_FENCE_INIT_FAIL, 0, 
> 0);
>   goto failed;
>   }
> @@ -3988,7 +3988,6 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
> fbcon)
>   }
>   amdgpu_fence_driver_hw_init(adev);
>   
> -
>   r = amdgpu_device_ip_late_init(adev);
>   if (r)
>   return r;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 49c5c7331c53..7495911516c2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -498,7 +498,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
> *ring,
>   }
>   
>   /**
> - * amdgpu_fence_driver_init - init the fence driver
> + * amdgpu_fence_driver_sw_init - init the fence driver
>* for all possible rings.
>*
>* @adev: amdgpu device pointer
> @@ -509,13 +509,13 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
> *ring,
>* amdgpu_fence_driver_start_ring().
>* Returns 0 for success.
>*/
> -int amdgpu_fence_driver_init(struct amdgpu_device *adev)
> +int amdgpu_fence_driver_sw_init(struct amdgpu_device *adev)
>   {
>   return 0;
>   }
>   
>   /**
> - * amdgpu_fence_driver_fini - tear down the fence driver
> + * amdgpu_fence_driver_hw_fini - tear down the fence driver
>* for all possible rings.
>*
>* @adev: amdgpu device pointer
> @@ -531,8 +531,7 @@ void amdgpu_fence_driver_hw_fini(struct 
> amdgpu_device *adev)
>   
>   if (!ring || !ring->fence_drv.initialized)
>   continue;
> - if (!ring->no_scheduler)
> - drm_sched_fini(&ring->sched);
> +
>   /* You can't wait for HW to signal if it's gone */
>   if (!drm_dev_is_unplugged(&adev->ddev))
>   r = amdgpu_fence_wait_empty(ring); @@ -560,6 +559,9 @@ 
> void 
> amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>   if (!ring || !ring->fence_drv.initialized)
>   continue;
>   
> + if (!ring->no_scheduler)
> + drm_sched_fini(&ring->sched);
> +
>   for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
>   dma_fence_put(ring->fence_drv.fences[j]);
>   kfree(ring->fence_drv.fences);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
&g

Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)

2021-08-01 Thread Christian König

Am 02.08.21 um 07:16 schrieb Guchun Chen:

In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to stop
scheduler in s3 test, otherwise, fence related failure will arrive
after resume. To fix this and for a better clean up, move drm_sched_fini
from fence_hw_fini to fence_sw_fini, as it's part of driver shutdown, and
should never be called in hw_fini.

v2: rename amdgpu_fence_driver_init to amdgpu_fence_driver_sw_init,
to keep sw_init and sw_fini paired.

Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence
Suggested-by: Christian König 
Signed-off-by: Guchun Chen 


It's a bit ambiguous now what fence_drv.initialized means, but I think 
we can live with that for now.


Patch is Reviewed-by: Christian König .

Regards,
Christian.


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  5 ++---
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 12 +++-
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   |  4 ++--
  3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b1d2dc39e8be..9e53ff851496 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3646,9 +3646,9 @@ int amdgpu_device_init(struct amdgpu_device *adev,
  
  fence_driver_init:

/* Fence driver */
-   r = amdgpu_fence_driver_init(adev);
+   r = amdgpu_fence_driver_sw_init(adev);
if (r) {
-   dev_err(adev->dev, "amdgpu_fence_driver_init failed\n");
+   dev_err(adev->dev, "amdgpu_fence_driver_sw_init failed\n");
amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_FENCE_INIT_FAIL, 0, 
0);
goto failed;
}
@@ -3988,7 +3988,6 @@ int amdgpu_device_resume(struct drm_device *dev, bool 
fbcon)
}
amdgpu_fence_driver_hw_init(adev);
  
-

r = amdgpu_device_ip_late_init(adev);
if (r)
return r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 49c5c7331c53..7495911516c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -498,7 +498,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
  }
  
  /**

- * amdgpu_fence_driver_init - init the fence driver
+ * amdgpu_fence_driver_sw_init - init the fence driver
   * for all possible rings.
   *
   * @adev: amdgpu device pointer
@@ -509,13 +509,13 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
*ring,
   * amdgpu_fence_driver_start_ring().
   * Returns 0 for success.
   */
-int amdgpu_fence_driver_init(struct amdgpu_device *adev)
+int amdgpu_fence_driver_sw_init(struct amdgpu_device *adev)
  {
return 0;
  }
  
  /**

- * amdgpu_fence_driver_fini - tear down the fence driver
+ * amdgpu_fence_driver_hw_fini - tear down the fence driver
   * for all possible rings.
   *
   * @adev: amdgpu device pointer
@@ -531,8 +531,7 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev)
  
  		if (!ring || !ring->fence_drv.initialized)

continue;
-   if (!ring->no_scheduler)
-   drm_sched_fini(&ring->sched);
+
/* You can't wait for HW to signal if it's gone */
if (!drm_dev_is_unplugged(&adev->ddev))
r = amdgpu_fence_wait_empty(ring);
@@ -560,6 +559,9 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
if (!ring || !ring->fence_drv.initialized)
continue;
  
+		if (!ring->no_scheduler)

+   drm_sched_fini(&ring->sched);
+
for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
dma_fence_put(ring->fence_drv.fences[j]);
kfree(ring->fence_drv.fences);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 27adffa7658d..9c11ced4312c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -106,7 +106,6 @@ struct amdgpu_fence_driver {
struct dma_fence**fences;
  };
  
-int amdgpu_fence_driver_init(struct amdgpu_device *adev);

  void amdgpu_fence_driver_force_completion(struct amdgpu_ring *ring);
  
  int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,

@@ -115,9 +114,10 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
  int amdgpu_fence_driver_start_ring(struct amdgpu_ring *ring,
   struct amdgpu_irq_src *irq_src,
   unsigned irq_type);
+void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev);
  void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev);
+int amdgpu_fence_driver_sw_init(struct amdgpu_device *adev);
  void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev);
-void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev);
  int amdgpu

RE: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test

2021-07-29 Thread Chen, Guchun
[Public]

Got you, so the target is to take this chance to make the code logic more clear 
instead of making it just workable on top of mixed functionality code.

I will post a more reasonable patch later on to address this. Thank you.

Regards,
Guchun

-Original Message-
From: Christian König  
Sent: Thursday, July 29, 2021 8:50 PM
To: Chen, Guchun ; Koenig, Christian 
; amd-gfx@lists.freedesktop.org; Gao, Likun 
; Zhang, Hawking ; Deucher, Alexander 

Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 
test

Hi Guchun,

see below.

Am 29.07.21 um 14:39 schrieb Chen, Guchun:
> [Public]
>
> Hi Christian,
>
> Thanks for your feedback.
>
> Originally, drm_sched_fini is part of amdgpu_fence_driver_hw_fini, I did not 
> move it.

Yeah, not saying that this is your fault, you should just clean that up more 
thoughtfully.

> Former patch " cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence " 
> has dropped amdgpu_fence_driver_suspend, and called 
> amdgpu_fence_driver_hw_fini instead in function amdgpu_device_suspend. I 
> checked the code difference between  amdgpu_fence_driver_hw_fini and 
> amdgpu_device_suspend, they are almost the same except drm_sched_fini part, 
> so I think we can leave it as it is, while skipping the execution of 
> drm_sched_fini in suspend/resume case.

And exactly that's a bad idea and the reason why I already said during the 
review of patch "cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence" 
that the callers of those functions need to be adjusted instead.

1. If not already done please rename the functions as Hawking suggested as 
well, they should be amdgpu_fence_driver_hw_(init|fini) and 
amdgpu_fence_driver_sw_(init|fini).

2. Remove drm_sched_fini() from amdgpu_fence_driver_hw_fini() and add that to 
sw_fini instead.

3. Adjust the callers so that we have the same functionality as before. 
E.g. by calling both hw_fini and sw_fini at the same time.

Regards,
Christian.

>
> Regards,
> Guchun
>
> -Original Message-
> From: Koenig, Christian 
> Sent: Thursday, July 29, 2021 7:11 PM
> To: Chen, Guchun ; amd-gfx@lists.freedesktop.org; 
> Gao, Likun ; Zhang, Hawking 
> ; Deucher, Alexander 
> 
> Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver 
> fini in s3 test
>
> Am 29.07.21 um 12:49 schrieb Guchun Chen:
>> In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to 
>> stop scheduler in s3 test, otherwise, fence errors will occur after resume.
>> So introduce a new parameter to distingiush the case in this API.
> NAK, the problem is rather that drm_sched_fini() is part of the sw shutdown 
> and should never be called during hw_fini.
>
> Christian.
>
>> Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence
>> Signed-off-by: Guchun Chen 
>> ---
>>drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
>>drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 8 +---
>>drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   | 2 +-
>>3 files changed, 8 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index b1d2dc39e8be..aaff8ebbb7dc 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -3844,7 +3844,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>>  else
>>  drm_atomic_helper_shutdown(adev_to_drm(adev));
>>  }
>> -amdgpu_fence_driver_hw_fini(adev);
>> +amdgpu_fence_driver_hw_fini(adev, false);
>>
>>  if (adev->pm_sysfs_en)
>>  amdgpu_pm_sysfs_fini(adev);
>> @@ -3941,7 +3941,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
>> fbcon)
>>  /* evict vram memory */
>>  amdgpu_bo_evict_vram(adev);
>>
>> -amdgpu_fence_driver_hw_fini(adev);
>> +amdgpu_fence_driver_hw_fini(adev, adev->in_suspend);
>>
>>  amdgpu_device_ip_suspend_phase2(adev);
>>  /* evict remaining vram memory
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index 49c5c7331c53..7e6a73c2919d 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -515,14 +515,15 @@ int amdgpu_fence_driver_init(struct amdgpu_device 
>> *adev)
>>}
>>
>>/**
>> - * amdgpu_fence_driver_fini - tear down the fence driver
>> + * amdgpu_fence_driver_hw_fini - tear down the fence driver
>> * for all possible rings.
>

RE: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test

2021-07-29 Thread Chen, Guchun
[Public]

Hi Christian,

Thanks for your feedback.

Originally, drm_sched_fini is part of amdgpu_fence_driver_hw_fini, I did not 
move it.
Former patch " cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence " 
has dropped amdgpu_fence_driver_suspend, and called amdgpu_fence_driver_hw_fini 
instead in function amdgpu_device_suspend. I checked the code difference 
between  amdgpu_fence_driver_hw_fini and amdgpu_device_suspend, they are almost 
the same except drm_sched_fini part, so I think we can leave it as it is, while 
skipping the execution of drm_sched_fini in suspend/resume case.

Regards,
Guchun

-Original Message-
From: Koenig, Christian  
Sent: Thursday, July 29, 2021 7:11 PM
To: Chen, Guchun ; amd-gfx@lists.freedesktop.org; Gao, 
Likun ; Zhang, Hawking ; Deucher, 
Alexander 
Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 
test

Am 29.07.21 um 12:49 schrieb Guchun Chen:
> In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to stop 
> scheduler in s3 test, otherwise, fence errors will occur after resume.
> So introduce a new parameter to distingiush the case in this API.

NAK, the problem is rather that drm_sched_fini() is part of the sw shutdown and 
should never be called during hw_fini.

Christian.

>
> Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence
> Signed-off-by: Guchun Chen 
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 8 +---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   | 2 +-
>   3 files changed, 8 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index b1d2dc39e8be..aaff8ebbb7dc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3844,7 +3844,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
>   else
>   drm_atomic_helper_shutdown(adev_to_drm(adev));
>   }
> - amdgpu_fence_driver_hw_fini(adev);
> + amdgpu_fence_driver_hw_fini(adev, false);
>   
>   if (adev->pm_sysfs_en)
>   amdgpu_pm_sysfs_fini(adev);
> @@ -3941,7 +3941,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
> fbcon)
>   /* evict vram memory */
>   amdgpu_bo_evict_vram(adev);
>   
> - amdgpu_fence_driver_hw_fini(adev);
> + amdgpu_fence_driver_hw_fini(adev, adev->in_suspend);
>   
>   amdgpu_device_ip_suspend_phase2(adev);
>   /* evict remaining vram memory
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 49c5c7331c53..7e6a73c2919d 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -515,14 +515,15 @@ int amdgpu_fence_driver_init(struct amdgpu_device *adev)
>   }
>   
>   /**
> - * amdgpu_fence_driver_fini - tear down the fence driver
> + * amdgpu_fence_driver_hw_fini - tear down the fence driver
>* for all possible rings.
>*
>* @adev: amdgpu device pointer
> + * @in_reset: indicator to distingiush device removal case or s3 case
>*
>* Tear down the fence driver for all possible rings (all asics).
>*/
> -void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev)
> +void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev, bool 
> +in_reset)
>   {
>   int i, r;
>   
> @@ -531,8 +532,9 @@ void amdgpu_fence_driver_hw_fini(struct 
> amdgpu_device *adev)
>   
>   if (!ring || !ring->fence_drv.initialized)
>   continue;
> - if (!ring->no_scheduler)
> + if (!ring->no_scheduler && !in_reset)
>   drm_sched_fini(&ring->sched);
> +
>   /* You can't wait for HW to signal if it's gone */
>   if (!drm_dev_is_unplugged(&adev->ddev))
>   r = amdgpu_fence_wait_empty(ring); diff --git 
> a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> index 27adffa7658d..42cbecfc26a3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
> @@ -115,7 +115,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring 
> *ring,
>   int amdgpu_fence_driver_start_ring(struct amdgpu_ring *ring,
>  struct amdgpu_irq_src *irq_src,
>  unsigned irq_type);
> -void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev);
> +void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev, bool 
> +in_reset);
>   void amdgpu_fence_driver_sw_fi

Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test

2021-07-29 Thread Christian König

Am 29.07.21 um 12:49 schrieb Guchun Chen:

In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to stop
scheduler in s3 test, otherwise, fence errors will occur after resume.
So introduce a new parameter to distingiush the case in this API.


NAK, the problem is rather that drm_sched_fini() is part of the sw 
shutdown and should never be called during hw_fini.


Christian.



Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence
Signed-off-by: Guchun Chen 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 8 +---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   | 2 +-
  3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b1d2dc39e8be..aaff8ebbb7dc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3844,7 +3844,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
else
drm_atomic_helper_shutdown(adev_to_drm(adev));
}
-   amdgpu_fence_driver_hw_fini(adev);
+   amdgpu_fence_driver_hw_fini(adev, false);
  
  	if (adev->pm_sysfs_en)

amdgpu_pm_sysfs_fini(adev);
@@ -3941,7 +3941,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
fbcon)
/* evict vram memory */
amdgpu_bo_evict_vram(adev);
  
-	amdgpu_fence_driver_hw_fini(adev);

+   amdgpu_fence_driver_hw_fini(adev, adev->in_suspend);
  
  	amdgpu_device_ip_suspend_phase2(adev);

/* evict remaining vram memory
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 49c5c7331c53..7e6a73c2919d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -515,14 +515,15 @@ int amdgpu_fence_driver_init(struct amdgpu_device *adev)
  }
  
  /**

- * amdgpu_fence_driver_fini - tear down the fence driver
+ * amdgpu_fence_driver_hw_fini - tear down the fence driver
   * for all possible rings.
   *
   * @adev: amdgpu device pointer
+ * @in_reset: indicator to distingiush device removal case or s3 case
   *
   * Tear down the fence driver for all possible rings (all asics).
   */
-void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev)
+void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev, bool in_reset)
  {
int i, r;
  
@@ -531,8 +532,9 @@ void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev)
  
  		if (!ring || !ring->fence_drv.initialized)

continue;
-   if (!ring->no_scheduler)
+   if (!ring->no_scheduler && !in_reset)
drm_sched_fini(&ring->sched);
+
/* You can't wait for HW to signal if it's gone */
if (!drm_dev_is_unplugged(&adev->ddev))
r = amdgpu_fence_wait_empty(ring);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 27adffa7658d..42cbecfc26a3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
@@ -115,7 +115,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
  int amdgpu_fence_driver_start_ring(struct amdgpu_ring *ring,
   struct amdgpu_irq_src *irq_src,
   unsigned irq_type);
-void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev);
+void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev, bool in_reset);
  void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev);
  void amdgpu_fence_driver_hw_init(struct amdgpu_device *adev);
  int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **fence,


___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test

2021-07-29 Thread Christian König

Exactly that yes.

Thanks,
Christian.

Am 29.07.21 um 14:56 schrieb Chen, Guchun:

[Public]

Got you, so the target is to take this chance to make the code logic more clear 
instead of making it just workable on top of mixed functionality code.

I will post a more reasonable patch later on to address this. Thank you.

Regards,
Guchun

-Original Message-
From: Christian König 
Sent: Thursday, July 29, 2021 8:50 PM
To: Chen, Guchun ; Koenig, Christian ; 
amd-gfx@lists.freedesktop.org; Gao, Likun ; Zhang, Hawking 
; Deucher, Alexander 
Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 
test

Hi Guchun,

see below.

Am 29.07.21 um 14:39 schrieb Chen, Guchun:

[Public]

Hi Christian,

Thanks for your feedback.

Originally, drm_sched_fini is part of amdgpu_fence_driver_hw_fini, I did not 
move it.

Yeah, not saying that this is your fault, you should just clean that up more 
thoughtfully.


Former patch " cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence " 
has dropped amdgpu_fence_driver_suspend, and called amdgpu_fence_driver_hw_fini instead 
in function amdgpu_device_suspend. I checked the code difference between  
amdgpu_fence_driver_hw_fini and amdgpu_device_suspend, they are almost the same except 
drm_sched_fini part, so I think we can leave it as it is, while skipping the execution of 
drm_sched_fini in suspend/resume case.

And exactly that's a bad idea and the reason why I already said during the review of 
patch "cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence" that the 
callers of those functions need to be adjusted instead.

1. If not already done please rename the functions as Hawking suggested as 
well, they should be amdgpu_fence_driver_hw_(init|fini) and 
amdgpu_fence_driver_sw_(init|fini).

2. Remove drm_sched_fini() from amdgpu_fence_driver_hw_fini() and add that to 
sw_fini instead.

3. Adjust the callers so that we have the same functionality as before.
E.g. by calling both hw_fini and sw_fini at the same time.

Regards,
Christian.


Regards,
Guchun

-Original Message-
From: Koenig, Christian 
Sent: Thursday, July 29, 2021 7:11 PM
To: Chen, Guchun ; amd-gfx@lists.freedesktop.org;
Gao, Likun ; Zhang, Hawking
; Deucher, Alexander

Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver
fini in s3 test

Am 29.07.21 um 12:49 schrieb Guchun Chen:

In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to
stop scheduler in s3 test, otherwise, fence errors will occur after resume.
So introduce a new parameter to distingiush the case in this API.

NAK, the problem is rather that drm_sched_fini() is part of the sw shutdown and 
should never be called during hw_fini.

Christian.


Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence
Signed-off-by: Guchun Chen 
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 8 +---
drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   | 2 +-
3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b1d2dc39e8be..aaff8ebbb7dc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3844,7 +3844,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
else
drm_atomic_helper_shutdown(adev_to_drm(adev));
}
-   amdgpu_fence_driver_hw_fini(adev);
+   amdgpu_fence_driver_hw_fini(adev, false);

	if (adev->pm_sysfs_en)

amdgpu_pm_sysfs_fini(adev);
@@ -3941,7 +3941,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
fbcon)
/* evict vram memory */
amdgpu_bo_evict_vram(adev);

-	amdgpu_fence_driver_hw_fini(adev);

+   amdgpu_fence_driver_hw_fini(adev, adev->in_suspend);

	amdgpu_device_ip_suspend_phase2(adev);

/* evict remaining vram memory
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 49c5c7331c53..7e6a73c2919d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -515,14 +515,15 @@ int amdgpu_fence_driver_init(struct amdgpu_device *adev)
}

/**

- * amdgpu_fence_driver_fini - tear down the fence driver
+ * amdgpu_fence_driver_hw_fini - tear down the fence driver
 * for all possible rings.
 *
 * @adev: amdgpu device pointer
+ * @in_reset: indicator to distingiush device removal case or s3
+ case
 *
 * Tear down the fence driver for all possible rings (all asics).
 */
-void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev)
+void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev, bool
+in_reset)
{
int i, r;

@@ -531,8 +532,9 @@ void amdgpu_fence_driver_hw_fini(struct

amdgpu_device *adev)

		if (!ring 

Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 test

2021-07-29 Thread Christian König

Hi Guchun,

see below.

Am 29.07.21 um 14:39 schrieb Chen, Guchun:

[Public]

Hi Christian,

Thanks for your feedback.

Originally, drm_sched_fini is part of amdgpu_fence_driver_hw_fini, I did not 
move it.


Yeah, not saying that this is your fault, you should just clean that up 
more thoughtfully.



Former patch " cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence " 
has dropped amdgpu_fence_driver_suspend, and called amdgpu_fence_driver_hw_fini instead 
in function amdgpu_device_suspend. I checked the code difference between  
amdgpu_fence_driver_hw_fini and amdgpu_device_suspend, they are almost the same except 
drm_sched_fini part, so I think we can leave it as it is, while skipping the execution of 
drm_sched_fini in suspend/resume case.


And exactly that's a bad idea and the reason why I already said during 
the review of patch "cd87a6dcf6af drm/amdgpu: adjust fence driver enable 
sequence" that the callers of those functions need to be adjusted instead.


1. If not already done please rename the functions as Hawking suggested 
as well, they should be amdgpu_fence_driver_hw_(init|fini) and 
amdgpu_fence_driver_sw_(init|fini).


2. Remove drm_sched_fini() from amdgpu_fence_driver_hw_fini() and add 
that to sw_fini instead.


3. Adjust the callers so that we have the same functionality as before. 
E.g. by calling both hw_fini and sw_fini at the same time.


Regards,
Christian.



Regards,
Guchun

-Original Message-
From: Koenig, Christian 
Sent: Thursday, July 29, 2021 7:11 PM
To: Chen, Guchun ; amd-gfx@lists.freedesktop.org; Gao, Likun 
; Zhang, Hawking ; Deucher, Alexander 

Subject: Re: [PATCH] drm/amdgpu: avoid over-handle of fence driver fini in s3 
test

Am 29.07.21 um 12:49 schrieb Guchun Chen:

In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to stop
scheduler in s3 test, otherwise, fence errors will occur after resume.
So introduce a new parameter to distingiush the case in this API.

NAK, the problem is rather that drm_sched_fini() is part of the sw shutdown and 
should never be called during hw_fini.

Christian.


Fixes: cd87a6dcf6af drm/amdgpu: adjust fence driver enable sequence
Signed-off-by: Guchun Chen 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++--
   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  | 8 +---
   drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h   | 2 +-
   3 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index b1d2dc39e8be..aaff8ebbb7dc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3844,7 +3844,7 @@ void amdgpu_device_fini_hw(struct amdgpu_device *adev)
else
drm_atomic_helper_shutdown(adev_to_drm(adev));
}
-   amdgpu_fence_driver_hw_fini(adev);
+   amdgpu_fence_driver_hw_fini(adev, false);
   
   	if (adev->pm_sysfs_en)

amdgpu_pm_sysfs_fini(adev);
@@ -3941,7 +3941,7 @@ int amdgpu_device_suspend(struct drm_device *dev, bool 
fbcon)
/* evict vram memory */
amdgpu_bo_evict_vram(adev);
   
-	amdgpu_fence_driver_hw_fini(adev);

+   amdgpu_fence_driver_hw_fini(adev, adev->in_suspend);
   
   	amdgpu_device_ip_suspend_phase2(adev);

/* evict remaining vram memory
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 49c5c7331c53..7e6a73c2919d 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -515,14 +515,15 @@ int amdgpu_fence_driver_init(struct amdgpu_device *adev)
   }
   
   /**

- * amdgpu_fence_driver_fini - tear down the fence driver
+ * amdgpu_fence_driver_hw_fini - tear down the fence driver
* for all possible rings.
*
* @adev: amdgpu device pointer
+ * @in_reset: indicator to distingiush device removal case or s3 case
*
* Tear down the fence driver for all possible rings (all asics).
*/
-void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev)
+void amdgpu_fence_driver_hw_fini(struct amdgpu_device *adev, bool
+in_reset)
   {
int i, r;
   
@@ -531,8 +532,9 @@ void amdgpu_fence_driver_hw_fini(struct

amdgpu_device *adev)
   
   		if (!ring || !ring->fence_drv.initialized)

continue;
-   if (!ring->no_scheduler)
+   if (!ring->no_scheduler && !in_reset)
drm_sched_fini(&ring->sched);
+
/* You can't wait for HW to signal if it's gone */
if (!drm_dev_is_unplugged(&adev->ddev))
r = amdgpu_fence_wait_empty(ring); diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
index 27adffa7658d..42cbecfc26a3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ring.h