Re: [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged.

2020-11-19 Thread Andrey Grodzovsky


On 11/19/20 10:29 AM, Daniel Vetter wrote:

On Thu, Nov 19, 2020 at 10:02:28AM -0500, Andrey Grodzovsky wrote:

On 11/19/20 2:55 AM, Christian König wrote:

Am 18.11.20 um 17:20 schrieb Andrey Grodzovsky:

On 11/18/20 7:01 AM, Christian König wrote:

Am 18.11.20 um 08:39 schrieb Daniel Vetter:

On Tue, Nov 17, 2020 at 9:07 PM Andrey Grodzovsky
 wrote:

On 11/17/20 2:49 PM, Daniel Vetter wrote:

On Tue, Nov 17, 2020 at 02:18:49PM -0500, Andrey Grodzovsky wrote:

On 11/17/20 1:52 PM, Daniel Vetter wrote:

On Tue, Nov 17, 2020 at 01:38:14PM -0500, Andrey Grodzovsky wrote:

On 6/22/20 5:53 AM, Daniel Vetter wrote:

On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey Grodzovsky wrote:

No point to try recovery if device is gone, just messes up things.

Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 
  2 files changed, 24 insertions(+)

diff --git
a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 6932d75..5d6d3d9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1129,12 +1129,28 @@ static
int amdgpu_pci_probe(struct
pci_dev *pdev,
   return ret;
  }
+static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
+{
+    int i;
+
+    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+    struct amdgpu_ring *ring = adev->rings[i];
+
+    if (!ring || !ring->sched.thread)
+    continue;
+
+ cancel_delayed_work_sync(&ring->sched.work_tdr);
+    }
+}

I think this is a function that's supposed to be in drm/scheduler, not
here. Might also just be your
cleanup code being ordered wrongly,
or your
split in one of the earlier patches not done quite right.
-Daniel

This function iterates across all the
schedulers per amdgpu device and
accesses
amdgpu specific structures ,
drm/scheduler deals with single
scheduler at most
so looks to me like this is the right place for this function

I guess we could keep track of all schedulers somewhere in a list in
struct drm_device and wrap this up. That was kinda the idea.

Minimally I think a tiny wrapper with docs for the
cancel_delayed_work_sync(&sched->work_tdr); which explains what you must
observe to make sure there's no race.

Will do



I'm not exactly sure there's no
guarantee here we won't get a new tdr work launched right afterwards at
least, so this looks a bit like a hack.

Note that for any TDR work happening post amdgpu_cancel_all_tdr
amdgpu_job_timedout->drm_dev_is_unplugged
will return true and so it will return early. To make it water proof tight
against race
i can switch from drm_dev_is_unplugged to drm_dev_enter/exit

Hm that's confusing. You do a work_cancel_sync, so that at least looks
like "tdr work must not run after this point"

If you only rely on drm_dev_enter/exit check with the tdr work, then
there's no need to cancel anything.

Agree, synchronize_srcu from drm_dev_unplug should play the role
of 'flushing' any earlier (in progress) tdr work which is
using drm_dev_enter/exit pair. Any later arising tdr
will terminate early when
drm_dev_enter
returns false.

Nope, anything you put into the work itself cannot close this race.
It's the schedule_work that matters here. Or I'm missing something ...
I thought that the tdr work you're cancelling here is launched by
drm/scheduler code, not by the amd callback?


My bad, you are right, I am supposed to put drm_dev_enter.exit pair
into drm_sched_job_timedout



Yes that is correct. Canceling the work item is not the right
approach at all, nor is adding dev_enter/exit pair in the
recovery handler.


Without adding the dev_enter/exit guarding pair in the recovery
handler you are ending up with GPU reset starting while
the device is already unplugged, this leads to multiple errors and general mess.



What we need to do here is to stop the scheduler thread and then
wait for any timeout handling to have finished.

Otherwise it can scheduler a new timeout just after we have canceled this one.

Regards,
Christian.


Schedulers are stopped from amdgpu_driver_unload_kms which indeed
happens after drm_dev_unplug
so yes, there is still a chance for new work being scheduler and
timeout armed after but, once i fix the code
to place drm_dev_enter/exit pair into drm_sched_job_timeout I don't
see why that not a good solution ?

Yeah that should work as well, but then you also don't need to cancel
the work item from the driver.


Indeed, as Daniel pointed out no need and I dropped it. One correction - I
previously said that w/o
dev_enter/exit guarding pair in scheduler's TO handler you will get GPU
reset starting while device already gone -
of course this is not fully preventing this as the device can be extracted
at any moment just after we
already entered GPU recovery. But it does saves us processing a futile  GPU
recovery which always
starts once you unplug the de

Re: [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged.

2020-11-19 Thread Daniel Vetter
On Thu, Nov 19, 2020 at 10:02:28AM -0500, Andrey Grodzovsky wrote:
> 
> On 11/19/20 2:55 AM, Christian König wrote:
> > Am 18.11.20 um 17:20 schrieb Andrey Grodzovsky:
> > > 
> > > On 11/18/20 7:01 AM, Christian König wrote:
> > > > Am 18.11.20 um 08:39 schrieb Daniel Vetter:
> > > > > On Tue, Nov 17, 2020 at 9:07 PM Andrey Grodzovsky
> > > > >  wrote:
> > > > > > 
> > > > > > On 11/17/20 2:49 PM, Daniel Vetter wrote:
> > > > > > > On Tue, Nov 17, 2020 at 02:18:49PM -0500, Andrey Grodzovsky wrote:
> > > > > > > > On 11/17/20 1:52 PM, Daniel Vetter wrote:
> > > > > > > > > On Tue, Nov 17, 2020 at 01:38:14PM -0500, Andrey Grodzovsky 
> > > > > > > > > wrote:
> > > > > > > > > > On 6/22/20 5:53 AM, Daniel Vetter wrote:
> > > > > > > > > > > On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey 
> > > > > > > > > > > Grodzovsky wrote:
> > > > > > > > > > > > No point to try recovery if device is gone, just messes 
> > > > > > > > > > > > up things.
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: Andrey Grodzovsky 
> > > > > > > > > > > > 
> > > > > > > > > > > > ---
> > > > > > > > > > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 
> > > > > > > > > > > > 
> > > > > > > > > > > >  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 
> > > > > > > > > > > > 
> > > > > > > > > > > >  2 files changed, 24 insertions(+)
> > > > > > > > > > > > 
> > > > > > > > > > > > diff --git
> > > > > > > > > > > > a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > > > > > > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > > > > > > > > > index 6932d75..5d6d3d9 100644
> > > > > > > > > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > > > > > > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > > > > > > > > > @@ -1129,12 +1129,28 @@ static
> > > > > > > > > > > > int amdgpu_pci_probe(struct
> > > > > > > > > > > > pci_dev *pdev,
> > > > > > > > > > > >   return ret;
> > > > > > > > > > > >  }
> > > > > > > > > > > > +static void amdgpu_cancel_all_tdr(struct amdgpu_device 
> > > > > > > > > > > > *adev)
> > > > > > > > > > > > +{
> > > > > > > > > > > > +    int i;
> > > > > > > > > > > > +
> > > > > > > > > > > > +    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> > > > > > > > > > > > +    struct amdgpu_ring *ring = 
> > > > > > > > > > > > adev->rings[i];
> > > > > > > > > > > > +
> > > > > > > > > > > > +    if (!ring || !ring->sched.thread)
> > > > > > > > > > > > +    continue;
> > > > > > > > > > > > +
> > > > > > > > > > > > + cancel_delayed_work_sync(&ring->sched.work_tdr);
> > > > > > > > > > > > +    }
> > > > > > > > > > > > +}
> > > > > > > > > > > I think this is a function that's supposed to be in 
> > > > > > > > > > > drm/scheduler, not
> > > > > > > > > > > here. Might also just be your
> > > > > > > > > > > cleanup code being ordered wrongly,
> > > > > > > > > > > or your
> > > > > > > > > > > split in one of the earlier patches not done quite right.
> > > > > > > > > > > -Daniel
> > > > > > > > > > This function iterates across all the
> > > > > > > > > > schedulers per amdgpu device and
> > > > > > > > > > accesses
> > > > > > > > > > amdgpu specific structures ,
> > > > > > > > > > drm/scheduler deals with single
> > > > > > > > > > scheduler at most
> > > > > > > > > > so looks to me like this is the right place for this 
> > > > > > > > > > function
> > > > > > > > > I guess we could keep track of all schedulers somewhere in a 
> > > > > > > > > list in
> > > > > > > > > struct drm_device and wrap this up. That was kinda the idea.
> > > > > > > > > 
> > > > > > > > > Minimally I think a tiny wrapper with docs for the
> > > > > > > > > cancel_delayed_work_sync(&sched->work_tdr); which explains 
> > > > > > > > > what you must
> > > > > > > > > observe to make sure there's no race.
> > > > > > > > Will do
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > I'm not exactly sure there's no
> > > > > > > > > guarantee here we won't get a new tdr work launched right 
> > > > > > > > > afterwards at
> > > > > > > > > least, so this looks a bit like a hack.
> > > > > > > > Note that for any TDR work happening post amdgpu_cancel_all_tdr
> > > > > > > > amdgpu_job_timedout->drm_dev_is_unplugged
> > > > > > > > will return true and so it will return early. To make it water 
> > > > > > > > proof tight
> > > > > > > > against race
> > > > > > > > i can switch from drm_dev_is_unplugged to drm_dev_enter/exit
> > > > > > > Hm that's confusing. You do a work_cancel_sync, so that at least 
> > > > > > > looks
> > > > > > > like "tdr work must not run after this point"
> > > > > > > 
> > > > > > > If you only rely on drm_dev_enter/exit check with the tdr work, 
> > > > > > > then
> > > > > > > there's no need to cancel anything.
> > > > > > 
> > > > > > Agree, synchronize_srcu from drm_dev_unplug should play the role
> > > > > > of 'flushing' any earlier (in progress) tdr work which is
> > > 

Re: [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged.

2020-11-19 Thread Andrey Grodzovsky


On 11/19/20 2:55 AM, Christian König wrote:

Am 18.11.20 um 17:20 schrieb Andrey Grodzovsky:


On 11/18/20 7:01 AM, Christian König wrote:

Am 18.11.20 um 08:39 schrieb Daniel Vetter:

On Tue, Nov 17, 2020 at 9:07 PM Andrey Grodzovsky
 wrote:


On 11/17/20 2:49 PM, Daniel Vetter wrote:

On Tue, Nov 17, 2020 at 02:18:49PM -0500, Andrey Grodzovsky wrote:

On 11/17/20 1:52 PM, Daniel Vetter wrote:

On Tue, Nov 17, 2020 at 01:38:14PM -0500, Andrey Grodzovsky wrote:

On 6/22/20 5:53 AM, Daniel Vetter wrote:

On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey Grodzovsky wrote:

No point to try recovery if device is gone, just messes up things.

Signed-off-by: Andrey Grodzovsky 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 
 2 files changed, 24 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c

index 6932d75..5d6d3d9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1129,12 +1129,28 @@ static int amdgpu_pci_probe(struct pci_dev 
*pdev,

  return ret;
 }
+static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
+{
+    int i;
+
+    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+    struct amdgpu_ring *ring = adev->rings[i];
+
+    if (!ring || !ring->sched.thread)
+    continue;
+
+ cancel_delayed_work_sync(&ring->sched.work_tdr);
+    }
+}

I think this is a function that's supposed to be in drm/scheduler, not
here. Might also just be your cleanup code being ordered wrongly, or 
your

split in one of the earlier patches not done quite right.
-Daniel
This function iterates across all the schedulers per amdgpu device and 
accesses
amdgpu specific structures , drm/scheduler deals with single scheduler 
at most

so looks to me like this is the right place for this function

I guess we could keep track of all schedulers somewhere in a list in
struct drm_device and wrap this up. That was kinda the idea.

Minimally I think a tiny wrapper with docs for the
cancel_delayed_work_sync(&sched->work_tdr); which explains what you must
observe to make sure there's no race.

Will do



I'm not exactly sure there's no
guarantee here we won't get a new tdr work launched right afterwards at
least, so this looks a bit like a hack.

Note that for any TDR work happening post amdgpu_cancel_all_tdr
amdgpu_job_timedout->drm_dev_is_unplugged
will return true and so it will return early. To make it water proof tight
against race
i can switch from drm_dev_is_unplugged to drm_dev_enter/exit

Hm that's confusing. You do a work_cancel_sync, so that at least looks
like "tdr work must not run after this point"

If you only rely on drm_dev_enter/exit check with the tdr work, then
there's no need to cancel anything.


Agree, synchronize_srcu from drm_dev_unplug should play the role
of 'flushing' any earlier (in progress) tdr work which is
using drm_dev_enter/exit pair. Any later arising tdr will terminate early 
when

drm_dev_enter
returns false.

Nope, anything you put into the work itself cannot close this race.
It's the schedule_work that matters here. Or I'm missing something ...
I thought that the tdr work you're cancelling here is launched by
drm/scheduler code, not by the amd callback?



My bad, you are right, I am supposed to put drm_dev_enter.exit pair into 
drm_sched_job_timedout





Yes that is correct. Canceling the work item is not the right approach at 
all, nor is adding dev_enter/exit pair in the recovery handler.



Without adding the dev_enter/exit guarding pair in the recovery handler you 
are ending up with GPU reset starting while

the device is already unplugged, this leads to multiple errors and general mess.




What we need to do here is to stop the scheduler thread and then wait for 
any timeout handling to have finished.


Otherwise it can scheduler a new timeout just after we have canceled this one.

Regards,
Christian.



Schedulers are stopped from amdgpu_driver_unload_kms which indeed happens 
after drm_dev_unplug
so yes, there is still a chance for new work being scheduler and timeout 
armed after but, once i fix the code
to place drm_dev_enter/exit pair into drm_sched_job_timeout I don't see why 
that not a good solution ?


Yeah that should work as well, but then you also don't need to cancel the work 
item from the driver.



Indeed, as Daniel pointed out no need and I dropped it. One correction - I 
previously said that w/o
dev_enter/exit guarding pair in scheduler's TO handler you will get GPU reset 
starting while device already gone -
of course this is not fully preventing this as the device can be extracted at 
any moment just after we
already entered GPU recovery. But it does saves us processing a futile  GPU 
recovery which always
starts once you unplug the device if there are active gobs in progress at the 
moment and so I think it's

stil

Re: [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged.

2020-11-19 Thread Christian König

Am 18.11.20 um 17:20 schrieb Andrey Grodzovsky:


On 11/18/20 7:01 AM, Christian König wrote:

Am 18.11.20 um 08:39 schrieb Daniel Vetter:

On Tue, Nov 17, 2020 at 9:07 PM Andrey Grodzovsky
 wrote:


On 11/17/20 2:49 PM, Daniel Vetter wrote:

On Tue, Nov 17, 2020 at 02:18:49PM -0500, Andrey Grodzovsky wrote:

On 11/17/20 1:52 PM, Daniel Vetter wrote:

On Tue, Nov 17, 2020 at 01:38:14PM -0500, Andrey Grodzovsky wrote:

On 6/22/20 5:53 AM, Daniel Vetter wrote:
On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey Grodzovsky 
wrote:
No point to try recovery if device is gone, just messes up 
things.


Signed-off-by: Andrey Grodzovsky 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 


 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 8 
 2 files changed, 24 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c

index 6932d75..5d6d3d9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1129,12 +1129,28 @@ static int amdgpu_pci_probe(struct 
pci_dev *pdev,

  return ret;
 }
+static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
+{
+    int i;
+
+    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+    struct amdgpu_ring *ring = adev->rings[i];
+
+    if (!ring || !ring->sched.thread)
+    continue;
+
+ cancel_delayed_work_sync(&ring->sched.work_tdr);
+    }
+}
I think this is a function that's supposed to be in 
drm/scheduler, not
here. Might also just be your cleanup code being ordered 
wrongly, or your

split in one of the earlier patches not done quite right.
-Daniel
This function iterates across all the schedulers per amdgpu 
device and accesses
amdgpu specific structures , drm/scheduler deals with single 
scheduler at most

so looks to me like this is the right place for this function
I guess we could keep track of all schedulers somewhere in a 
list in

struct drm_device and wrap this up. That was kinda the idea.

Minimally I think a tiny wrapper with docs for the
cancel_delayed_work_sync(&sched->work_tdr); which explains what 
you must

observe to make sure there's no race.

Will do



I'm not exactly sure there's no
guarantee here we won't get a new tdr work launched right 
afterwards at

least, so this looks a bit like a hack.

Note that for any TDR work happening post amdgpu_cancel_all_tdr
amdgpu_job_timedout->drm_dev_is_unplugged
will return true and so it will return early. To make it water 
proof tight

against race
i can switch from drm_dev_is_unplugged to drm_dev_enter/exit
Hm that's confusing. You do a work_cancel_sync, so that at least 
looks

like "tdr work must not run after this point"

If you only rely on drm_dev_enter/exit check with the tdr work, then
there's no need to cancel anything.


Agree, synchronize_srcu from drm_dev_unplug should play the role
of 'flushing' any earlier (in progress) tdr work which is
using drm_dev_enter/exit pair. Any later arising tdr will terminate 
early when

drm_dev_enter
returns false.

Nope, anything you put into the work itself cannot close this race.
It's the schedule_work that matters here. Or I'm missing something ...
I thought that the tdr work you're cancelling here is launched by
drm/scheduler code, not by the amd callback?



My bad, you are right, I am supposed to put drm_dev_enter.exit pair 
into drm_sched_job_timedout





Yes that is correct. Canceling the work item is not the right 
approach at all, nor is adding dev_enter/exit pair in the recovery 
handler.



Without adding the dev_enter/exit guarding pair in the recovery 
handler you are ending up with GPU reset starting while
the device is already unplugged, this leads to multiple errors and 
general mess.





What we need to do here is to stop the scheduler thread and then wait 
for any timeout handling to have finished.


Otherwise it can scheduler a new timeout just after we have canceled 
this one.


Regards,
Christian.



Schedulers are stopped from amdgpu_driver_unload_kms which indeed 
happens after drm_dev_unplug
so yes, there is still a chance for new work being scheduler and 
timeout armed after but, once i fix the code
to place drm_dev_enter/exit pair into drm_sched_job_timeout I don't 
see why that not a good solution ?


Yeah that should work as well, but then you also don't need to cancel 
the work item from the driver.


Any tdr work started after drm_dev_unplug finished will simply abort 
on entry to drm_sched_job_timedout
because drm_dev_enter will be false and the function will return 
without rearming the timeout timer and

so will have no impact.

The only issue i see here now is of possible use after free if some 
late tdr work will try to execute after
drm device already gone, for this we probably should add 
cancel_delayed_work_sync(sched.work_tdr)

to drm_sched_fini after sched->thread is stopped there.


Good point, that is indeed missing as far as I can see.

Chr

Re: [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged.

2020-11-18 Thread Andrey Grodzovsky


On 11/18/20 7:01 AM, Christian König wrote:

Am 18.11.20 um 08:39 schrieb Daniel Vetter:

On Tue, Nov 17, 2020 at 9:07 PM Andrey Grodzovsky
 wrote:


On 11/17/20 2:49 PM, Daniel Vetter wrote:

On Tue, Nov 17, 2020 at 02:18:49PM -0500, Andrey Grodzovsky wrote:

On 11/17/20 1:52 PM, Daniel Vetter wrote:

On Tue, Nov 17, 2020 at 01:38:14PM -0500, Andrey Grodzovsky wrote:

On 6/22/20 5:53 AM, Daniel Vetter wrote:

On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey Grodzovsky wrote:

No point to try recovery if device is gone, just messes up things.

Signed-off-by: Andrey Grodzovsky 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  8 
 2 files changed, 24 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c

index 6932d75..5d6d3d9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1129,12 +1129,28 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
  return ret;
 }
+static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
+{
+    int i;
+
+    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+    struct amdgpu_ring *ring = adev->rings[i];
+
+    if (!ring || !ring->sched.thread)
+    continue;
+
+ cancel_delayed_work_sync(&ring->sched.work_tdr);
+    }
+}

I think this is a function that's supposed to be in drm/scheduler, not
here. Might also just be your cleanup code being ordered wrongly, or your
split in one of the earlier patches not done quite right.
-Daniel
This function iterates across all the schedulers  per amdgpu device and 
accesses
amdgpu specific structures , drm/scheduler deals with single scheduler 
at most

so looks to me like this is the right place for this function

I guess we could keep track of all schedulers somewhere in a list in
struct drm_device and wrap this up. That was kinda the idea.

Minimally I think a tiny wrapper with docs for the
cancel_delayed_work_sync(&sched->work_tdr); which explains what you must
observe to make sure there's no race.

Will do



I'm not exactly sure there's no
guarantee here we won't get a new tdr work launched right afterwards at
least, so this looks a bit like a hack.

Note that for any TDR work happening post amdgpu_cancel_all_tdr
amdgpu_job_timedout->drm_dev_is_unplugged
will return true and so it will return early. To make it water proof tight
against race
i can switch from drm_dev_is_unplugged to drm_dev_enter/exit

Hm that's confusing. You do a work_cancel_sync, so that at least looks
like "tdr work must not run after this point"

If you only rely on drm_dev_enter/exit check with the tdr work, then
there's no need to cancel anything.


Agree, synchronize_srcu from drm_dev_unplug should play the role
of 'flushing' any earlier (in progress) tdr work which is
using drm_dev_enter/exit pair. Any later arising tdr will terminate early when
drm_dev_enter
returns false.

Nope, anything you put into the work itself cannot close this race.
It's the schedule_work that matters here. Or I'm missing something ...
I thought that the tdr work you're cancelling here is launched by
drm/scheduler code, not by the amd callback?



My bad, you are right, I am supposed to put drm_dev_enter.exit pair into 
drm_sched_job_timedout





Yes that is correct. Canceling the work item is not the right approach at all, 
nor is adding dev_enter/exit pair in the recovery handler.



Without adding the dev_enter/exit guarding pair in the recovery handler you are 
ending up with GPU reset starting while

the device is already unplugged, this leads to multiple errors and general mess.




What we need to do here is to stop the scheduler thread and then wait for any 
timeout handling to have finished.


Otherwise it can scheduler a new timeout just after we have canceled this one.

Regards,
Christian.



Schedulers are stopped from amdgpu_driver_unload_kms which indeed happens after 
drm_dev_unplug
so yes, there is still a chance for new work being scheduler and timeout armed 
after but, once i fix the code
to place drm_dev_enter/exit pair into drm_sched_job_timeout I don't see why that 
not a good solution ?
Any tdr work started after drm_dev_unplug finished will simply abort on entry to 
drm_sched_job_timedout
because drm_dev_enter will be false and the function will return without 
rearming the timeout timer and

so will have no impact.

The only issue i see here now is of possible use after free if some late tdr 
work will try to execute after
drm device already gone, for this we probably should add 
cancel_delayed_work_sync(sched.work_tdr)

to drm_sched_fini after sched->thread is stopped there.

Andrey





-Daniel


Will update.

Andrey



For race free cancel_work_sync you need:
1. make sure whatever is calling schedule_work is guaranteed to no longer
call schedule_work.
2. call cancel_work_sync

Anything else is cargo-culte

Re: [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged.

2020-11-18 Thread Luben Tuikov
On 2020-11-18 07:01, Christian König wrote:
> Am 18.11.20 um 08:39 schrieb Daniel Vetter:
>> On Tue, Nov 17, 2020 at 9:07 PM Andrey Grodzovsky
>>  wrote:
>>>
>>> On 11/17/20 2:49 PM, Daniel Vetter wrote:
 On Tue, Nov 17, 2020 at 02:18:49PM -0500, Andrey Grodzovsky wrote:
> On 11/17/20 1:52 PM, Daniel Vetter wrote:
>> On Tue, Nov 17, 2020 at 01:38:14PM -0500, Andrey Grodzovsky wrote:
>>> On 6/22/20 5:53 AM, Daniel Vetter wrote:
 On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey Grodzovsky wrote:
> No point to try recovery if device is gone, just messes up things.
>
> Signed-off-by: Andrey Grodzovsky 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  8 
>  2 files changed, 24 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 6932d75..5d6d3d9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -1129,12 +1129,28 @@ static int amdgpu_pci_probe(struct pci_dev 
> *pdev,
>   return ret;
>  }
> +static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
> +{
> +int i;
> +
> +for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +struct amdgpu_ring *ring = adev->rings[i];
> +
> +if (!ring || !ring->sched.thread)
> +continue;
> +
> +cancel_delayed_work_sync(&ring->sched.work_tdr);
> +}
> +}
 I think this is a function that's supposed to be in drm/scheduler, not
 here. Might also just be your cleanup code being ordered wrongly, or 
 your
 split in one of the earlier patches not done quite right.
 -Daniel
>>> This function iterates across all the schedulers  per amdgpu device and 
>>> accesses
>>> amdgpu specific structures , drm/scheduler deals with single scheduler 
>>> at most
>>> so looks to me like this is the right place for this function
>> I guess we could keep track of all schedulers somewhere in a list in
>> struct drm_device and wrap this up. That was kinda the idea.
>>
>> Minimally I think a tiny wrapper with docs for the
>> cancel_delayed_work_sync(&sched->work_tdr); which explains what you must
>> observe to make sure there's no race.
> Will do
>
>
>> I'm not exactly sure there's no
>> guarantee here we won't get a new tdr work launched right afterwards at
>> least, so this looks a bit like a hack.
> Note that for any TDR work happening post amdgpu_cancel_all_tdr
> amdgpu_job_timedout->drm_dev_is_unplugged
> will return true and so it will return early. To make it water proof tight
> against race
> i can switch from drm_dev_is_unplugged to drm_dev_enter/exit
 Hm that's confusing. You do a work_cancel_sync, so that at least looks
 like "tdr work must not run after this point"

 If you only rely on drm_dev_enter/exit check with the tdr work, then
 there's no need to cancel anything.
>>>
>>> Agree, synchronize_srcu from drm_dev_unplug should play the role
>>> of 'flushing' any earlier (in progress) tdr work which is
>>> using drm_dev_enter/exit pair. Any later arising tdr will terminate early 
>>> when
>>> drm_dev_enter
>>> returns false.
>> Nope, anything you put into the work itself cannot close this race.
>> It's the schedule_work that matters here. Or I'm missing something ...
>> I thought that the tdr work you're cancelling here is launched by
>> drm/scheduler code, not by the amd callback?
> 
> Yes that is correct. Canceling the work item is not the right approach 
> at all, nor is adding dev_enter/exit pair in the recovery handler.
> 
> What we need to do here is to stop the scheduler thread and then wait 
> for any timeout handling to have finished.
> 
> Otherwise it can scheduler a new timeout just after we have canceled 
> this one.

Yep, that's exactly what I said in my email above.

Regards,
Luben

> 
> Regards,
> Christian.
> 
>> -Daniel
>>
>>> Will update.
>>>
>>> Andrey
>>>
>>>
 For race free cancel_work_sync you need:
 1. make sure whatever is calling schedule_work is guaranteed to no longer
 call schedule_work.
 2. call cancel_work_sync

 Anything else is cargo-culted work cleanup:

 - 1. without 2. means if a work got scheduled right before it'll still be
 a problem.
 - 2. without 1. means a schedule_work right after makes you calling
 cancel_work_sync pointless.

 So either both or nothing.
 -Daniel

> Andrey
>
>
>> -Daniel
>>
>>> Andrey
>>>
>>>
> +
>  stati

Re: [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged.

2020-11-18 Thread Christian König

Am 18.11.20 um 08:39 schrieb Daniel Vetter:

On Tue, Nov 17, 2020 at 9:07 PM Andrey Grodzovsky
 wrote:


On 11/17/20 2:49 PM, Daniel Vetter wrote:

On Tue, Nov 17, 2020 at 02:18:49PM -0500, Andrey Grodzovsky wrote:

On 11/17/20 1:52 PM, Daniel Vetter wrote:

On Tue, Nov 17, 2020 at 01:38:14PM -0500, Andrey Grodzovsky wrote:

On 6/22/20 5:53 AM, Daniel Vetter wrote:

On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey Grodzovsky wrote:

No point to try recovery if device is gone, just messes up things.

Signed-off-by: Andrey Grodzovsky 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  8 
 2 files changed, 24 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 6932d75..5d6d3d9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1129,12 +1129,28 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
  return ret;
 }
+static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
+{
+int i;
+
+for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+struct amdgpu_ring *ring = adev->rings[i];
+
+if (!ring || !ring->sched.thread)
+continue;
+
+cancel_delayed_work_sync(&ring->sched.work_tdr);
+}
+}

I think this is a function that's supposed to be in drm/scheduler, not
here. Might also just be your cleanup code being ordered wrongly, or your
split in one of the earlier patches not done quite right.
-Daniel

This function iterates across all the schedulers  per amdgpu device and accesses
amdgpu specific structures , drm/scheduler deals with single scheduler at most
so looks to me like this is the right place for this function

I guess we could keep track of all schedulers somewhere in a list in
struct drm_device and wrap this up. That was kinda the idea.

Minimally I think a tiny wrapper with docs for the
cancel_delayed_work_sync(&sched->work_tdr); which explains what you must
observe to make sure there's no race.

Will do



I'm not exactly sure there's no
guarantee here we won't get a new tdr work launched right afterwards at
least, so this looks a bit like a hack.

Note that for any TDR work happening post amdgpu_cancel_all_tdr
amdgpu_job_timedout->drm_dev_is_unplugged
will return true and so it will return early. To make it water proof tight
against race
i can switch from drm_dev_is_unplugged to drm_dev_enter/exit

Hm that's confusing. You do a work_cancel_sync, so that at least looks
like "tdr work must not run after this point"

If you only rely on drm_dev_enter/exit check with the tdr work, then
there's no need to cancel anything.


Agree, synchronize_srcu from drm_dev_unplug should play the role
of 'flushing' any earlier (in progress) tdr work which is
using drm_dev_enter/exit pair. Any later arising tdr will terminate early when
drm_dev_enter
returns false.

Nope, anything you put into the work itself cannot close this race.
It's the schedule_work that matters here. Or I'm missing something ...
I thought that the tdr work you're cancelling here is launched by
drm/scheduler code, not by the amd callback?


Yes that is correct. Canceling the work item is not the right approach 
at all, nor is adding dev_enter/exit pair in the recovery handler.


What we need to do here is to stop the scheduler thread and then wait 
for any timeout handling to have finished.


Otherwise it can scheduler a new timeout just after we have canceled 
this one.


Regards,
Christian.


-Daniel


Will update.

Andrey



For race free cancel_work_sync you need:
1. make sure whatever is calling schedule_work is guaranteed to no longer
call schedule_work.
2. call cancel_work_sync

Anything else is cargo-culted work cleanup:

- 1. without 2. means if a work got scheduled right before it'll still be
a problem.
- 2. without 1. means a schedule_work right after makes you calling
cancel_work_sync pointless.

So either both or nothing.
-Daniel


Andrey



-Daniel


Andrey



+
 static void
 amdgpu_pci_remove(struct pci_dev *pdev)
 {
  struct drm_device *dev = pci_get_drvdata(pdev);
+struct amdgpu_device *adev = dev->dev_private;
  drm_dev_unplug(dev);
+amdgpu_cancel_all_tdr(adev);
  ttm_bo_unmap_virtual_address_space(&adev->mman.bdev);
  amdgpu_driver_unload_kms(dev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 4720718..87ff0c0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -28,6 +28,8 @@
 #include "amdgpu.h"
 #include "amdgpu_trace.h"
+#include 
+
 static void amdgpu_job_timedout(struct drm_sched_job *s_job)
 {
  struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
@@ -37,6 +39,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
  m

Re: [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged.

2020-11-17 Thread Daniel Vetter
On Tue, Nov 17, 2020 at 9:07 PM Andrey Grodzovsky
 wrote:
>
>
> On 11/17/20 2:49 PM, Daniel Vetter wrote:
> > On Tue, Nov 17, 2020 at 02:18:49PM -0500, Andrey Grodzovsky wrote:
> >> On 11/17/20 1:52 PM, Daniel Vetter wrote:
> >>> On Tue, Nov 17, 2020 at 01:38:14PM -0500, Andrey Grodzovsky wrote:
>  On 6/22/20 5:53 AM, Daniel Vetter wrote:
> > On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey Grodzovsky wrote:
> >> No point to try recovery if device is gone, just messes up things.
> >>
> >> Signed-off-by: Andrey Grodzovsky 
> >> ---
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 
> >> drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  8 
> >> 2 files changed, 24 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> index 6932d75..5d6d3d9 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> >> @@ -1129,12 +1129,28 @@ static int amdgpu_pci_probe(struct pci_dev 
> >> *pdev,
> >>  return ret;
> >> }
> >> +static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
> >> +{
> >> +int i;
> >> +
> >> +for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> >> +struct amdgpu_ring *ring = adev->rings[i];
> >> +
> >> +if (!ring || !ring->sched.thread)
> >> +continue;
> >> +
> >> +cancel_delayed_work_sync(&ring->sched.work_tdr);
> >> +}
> >> +}
> > I think this is a function that's supposed to be in drm/scheduler, not
> > here. Might also just be your cleanup code being ordered wrongly, or 
> > your
> > split in one of the earlier patches not done quite right.
> > -Daniel
>  This function iterates across all the schedulers  per amdgpu device and 
>  accesses
>  amdgpu specific structures , drm/scheduler deals with single scheduler 
>  at most
>  so looks to me like this is the right place for this function
> >>> I guess we could keep track of all schedulers somewhere in a list in
> >>> struct drm_device and wrap this up. That was kinda the idea.
> >>>
> >>> Minimally I think a tiny wrapper with docs for the
> >>> cancel_delayed_work_sync(&sched->work_tdr); which explains what you must
> >>> observe to make sure there's no race.
> >>
> >> Will do
> >>
> >>
> >>> I'm not exactly sure there's no
> >>> guarantee here we won't get a new tdr work launched right afterwards at
> >>> least, so this looks a bit like a hack.
> >>
> >> Note that for any TDR work happening post amdgpu_cancel_all_tdr
> >> amdgpu_job_timedout->drm_dev_is_unplugged
> >> will return true and so it will return early. To make it water proof tight
> >> against race
> >> i can switch from drm_dev_is_unplugged to drm_dev_enter/exit
> > Hm that's confusing. You do a work_cancel_sync, so that at least looks
> > like "tdr work must not run after this point"
> >
> > If you only rely on drm_dev_enter/exit check with the tdr work, then
> > there's no need to cancel anything.
>
>
> Agree, synchronize_srcu from drm_dev_unplug should play the role
> of 'flushing' any earlier (in progress) tdr work which is
> using drm_dev_enter/exit pair. Any later arising tdr will terminate early when
> drm_dev_enter
> returns false.

Nope, anything you put into the work itself cannot close this race.
It's the schedule_work that matters here. Or I'm missing something ...
I thought that the tdr work you're cancelling here is launched by
drm/scheduler code, not by the amd callback?
-Daniel

>
> Will update.
>
> Andrey
>
>
> >
> > For race free cancel_work_sync you need:
> > 1. make sure whatever is calling schedule_work is guaranteed to no longer
> > call schedule_work.
> > 2. call cancel_work_sync
> >
> > Anything else is cargo-culted work cleanup:
> >
> > - 1. without 2. means if a work got scheduled right before it'll still be
> >a problem.
> > - 2. without 1. means a schedule_work right after makes you calling
> >cancel_work_sync pointless.
> >
> > So either both or nothing.
> > -Daniel
> >
> >> Andrey
> >>
> >>
> >>> -Daniel
> >>>
>  Andrey
> 
> 
> >> +
> >> static void
> >> amdgpu_pci_remove(struct pci_dev *pdev)
> >> {
> >>  struct drm_device *dev = pci_get_drvdata(pdev);
> >> +struct amdgpu_device *adev = dev->dev_private;
> >>  drm_dev_unplug(dev);
> >> +amdgpu_cancel_all_tdr(adev);
> >>  ttm_bo_unmap_virtual_address_space(&adev->mman.bdev);
> >>  amdgpu_driver_unload_kms(dev);
> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >> index 4720718..87ff0c0 100644
> >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c

Re: [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged.

2020-11-17 Thread Luben Tuikov
On 2020-11-17 2:49 p.m., Daniel Vetter wrote:
> On Tue, Nov 17, 2020 at 02:18:49PM -0500, Andrey Grodzovsky wrote:
>>
>> On 11/17/20 1:52 PM, Daniel Vetter wrote:
>>> On Tue, Nov 17, 2020 at 01:38:14PM -0500, Andrey Grodzovsky wrote:
 On 6/22/20 5:53 AM, Daniel Vetter wrote:
> On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey Grodzovsky wrote:
>> No point to try recovery if device is gone, just messes up things.
>>
>> Signed-off-by: Andrey Grodzovsky 
>> ---
>>drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 
>>drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  8 
>>2 files changed, 24 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> index 6932d75..5d6d3d9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
>> @@ -1129,12 +1129,28 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
>>  return ret;
>>}
>> +static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
>> +{
>> +int i;
>> +
>> +for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> +struct amdgpu_ring *ring = adev->rings[i];
>> +
>> +if (!ring || !ring->sched.thread)
>> +continue;
>> +
>> +cancel_delayed_work_sync(&ring->sched.work_tdr);
>> +}
>> +}
> I think this is a function that's supposed to be in drm/scheduler, not
> here. Might also just be your cleanup code being ordered wrongly, or your
> split in one of the earlier patches not done quite right.
> -Daniel

 This function iterates across all the schedulers  per amdgpu device and 
 accesses
 amdgpu specific structures , drm/scheduler deals with single scheduler at 
 most
 so looks to me like this is the right place for this function
>>> I guess we could keep track of all schedulers somewhere in a list in
>>> struct drm_device and wrap this up. That was kinda the idea.
>>>
>>> Minimally I think a tiny wrapper with docs for the
>>> cancel_delayed_work_sync(&sched->work_tdr); which explains what you must
>>> observe to make sure there's no race.
>>
>>
>> Will do
>>
>>
>>> I'm not exactly sure there's no
>>> guarantee here we won't get a new tdr work launched right afterwards at
>>> least, so this looks a bit like a hack.
>>
>>
>> Note that for any TDR work happening post amdgpu_cancel_all_tdr
>> amdgpu_job_timedout->drm_dev_is_unplugged
>> will return true and so it will return early. To make it water proof tight
>> against race
>> i can switch from drm_dev_is_unplugged to drm_dev_enter/exit
> 
> Hm that's confusing. You do a work_cancel_sync, so that at least looks
> like "tdr work must not run after this point"
> 
> If you only rely on drm_dev_enter/exit check with the tdr work, then
> there's no need to cancel anything.
> 
> For race free cancel_work_sync you need:
> 1. make sure whatever is calling schedule_work is guaranteed to no longer
> call schedule_work.
> 2. call cancel_work_sync
> 
> Anything else is cargo-culted work cleanup:
> 
> - 1. without 2. means if a work got scheduled right before it'll still be
>   a problem.
> - 2. without 1. means a schedule_work right after makes you calling
>   cancel_work_sync pointless.

This is sound advice and I did something similar for SAS over a decade
ago where an expander could be disconnected from the domain via which
many IOs are flying to end devices.

You need a small tiny DRM function which low-level drivers (such as amdgpu)
call in order to tell DRM that this device is not accepting commands
any more (sets a flag) and starts a thread to clean up commands
which are "done" or "incoming". At the same time, the low-level driver
returns commands which are pending in the hardware back out to
DRM (thus those commands become "done" from "pending"), and
DRM cleans them up.(*)

The point is that you're not bubbling up the error, but
directly notifying the highest level of upper layer to hold off,
while you're cleaning up all incoming and pending commands.

Depending on the situation, case 1 above has two sub-cases:

a) the device will not come back--then cancel any new work
   back out to the application client, or
b) the device may come back again, i.e. it is being reset,
   then you can queue up work, assuming the device will
   come back on successfully and you'd be able to send
   the incoming requests down to it. Or cancel everything
   and let the application client do the queueing and
   resubmission, like in a). The latter will not work when this
   resubmission (and error recovery) is done without
   the knowledge of the application client, for instance
   communication or parity errors, protocol retries, etc.

(*) I've some work coming in, in the scheduler, which could make
this handling easier, or at least set a mechanism by wh

Re: [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged.

2020-11-17 Thread Andrey Grodzovsky


On 11/17/20 2:49 PM, Daniel Vetter wrote:

On Tue, Nov 17, 2020 at 02:18:49PM -0500, Andrey Grodzovsky wrote:

On 11/17/20 1:52 PM, Daniel Vetter wrote:

On Tue, Nov 17, 2020 at 01:38:14PM -0500, Andrey Grodzovsky wrote:

On 6/22/20 5:53 AM, Daniel Vetter wrote:

On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey Grodzovsky wrote:

No point to try recovery if device is gone, just messes up things.

Signed-off-by: Andrey Grodzovsky 
---
drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 
drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  8 
2 files changed, 24 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 6932d75..5d6d3d9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1129,12 +1129,28 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
return ret;
}
+static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
+{
+   int i;
+
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = adev->rings[i];
+
+   if (!ring || !ring->sched.thread)
+   continue;
+
+   cancel_delayed_work_sync(&ring->sched.work_tdr);
+   }
+}

I think this is a function that's supposed to be in drm/scheduler, not
here. Might also just be your cleanup code being ordered wrongly, or your
split in one of the earlier patches not done quite right.
-Daniel

This function iterates across all the schedulers  per amdgpu device and accesses
amdgpu specific structures , drm/scheduler deals with single scheduler at most
so looks to me like this is the right place for this function

I guess we could keep track of all schedulers somewhere in a list in
struct drm_device and wrap this up. That was kinda the idea.

Minimally I think a tiny wrapper with docs for the
cancel_delayed_work_sync(&sched->work_tdr); which explains what you must
observe to make sure there's no race.


Will do



I'm not exactly sure there's no
guarantee here we won't get a new tdr work launched right afterwards at
least, so this looks a bit like a hack.


Note that for any TDR work happening post amdgpu_cancel_all_tdr
amdgpu_job_timedout->drm_dev_is_unplugged
will return true and so it will return early. To make it water proof tight
against race
i can switch from drm_dev_is_unplugged to drm_dev_enter/exit

Hm that's confusing. You do a work_cancel_sync, so that at least looks
like "tdr work must not run after this point"

If you only rely on drm_dev_enter/exit check with the tdr work, then
there's no need to cancel anything.



Agree, synchronize_srcu from drm_dev_unplug should play the role
of 'flushing' any earlier (in progress) tdr work which is
using drm_dev_enter/exit pair. Any later arising tdr will terminate early when 
drm_dev_enter

returns false.

Will update.

Andrey




For race free cancel_work_sync you need:
1. make sure whatever is calling schedule_work is guaranteed to no longer
call schedule_work.
2. call cancel_work_sync

Anything else is cargo-culted work cleanup:

- 1. without 2. means if a work got scheduled right before it'll still be
   a problem.
- 2. without 1. means a schedule_work right after makes you calling
   cancel_work_sync pointless.

So either both or nothing.
-Daniel


Andrey



-Daniel


Andrey



+
static void
amdgpu_pci_remove(struct pci_dev *pdev)
{
struct drm_device *dev = pci_get_drvdata(pdev);
+   struct amdgpu_device *adev = dev->dev_private;
drm_dev_unplug(dev);
+   amdgpu_cancel_all_tdr(adev);
ttm_bo_unmap_virtual_address_space(&adev->mman.bdev);
amdgpu_driver_unload_kms(dev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 4720718..87ff0c0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -28,6 +28,8 @@
#include "amdgpu.h"
#include "amdgpu_trace.h"
+#include 
+
static void amdgpu_job_timedout(struct drm_sched_job *s_job)
{
struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
@@ -37,6 +39,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
memset(&ti, 0, sizeof(struct amdgpu_task_info));
+   if (drm_dev_is_unplugged(adev->ddev)) {
+   DRM_INFO("ring %s timeout, but device unplugged, skipping.\n",
+ s_job->sched->name);
+   return;
+   }
+
if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) 
{
DRM_ERROR("ring %s timeout, but soft recovered\n",
  s_job->sched->name);
--
2.7.4


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged.

2020-11-17 Thread Daniel Vetter
On Tue, Nov 17, 2020 at 02:18:49PM -0500, Andrey Grodzovsky wrote:
> 
> On 11/17/20 1:52 PM, Daniel Vetter wrote:
> > On Tue, Nov 17, 2020 at 01:38:14PM -0500, Andrey Grodzovsky wrote:
> > > On 6/22/20 5:53 AM, Daniel Vetter wrote:
> > > > On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey Grodzovsky wrote:
> > > > > No point to try recovery if device is gone, just messes up things.
> > > > > 
> > > > > Signed-off-by: Andrey Grodzovsky 
> > > > > ---
> > > > >drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 
> > > > >drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  8 
> > > > >2 files changed, 24 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > > index 6932d75..5d6d3d9 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > > > @@ -1129,12 +1129,28 @@ static int amdgpu_pci_probe(struct pci_dev 
> > > > > *pdev,
> > > > >   return ret;
> > > > >}
> > > > > +static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
> > > > > +{
> > > > > + int i;
> > > > > +
> > > > > + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> > > > > + struct amdgpu_ring *ring = adev->rings[i];
> > > > > +
> > > > > + if (!ring || !ring->sched.thread)
> > > > > + continue;
> > > > > +
> > > > > + cancel_delayed_work_sync(&ring->sched.work_tdr);
> > > > > + }
> > > > > +}
> > > > I think this is a function that's supposed to be in drm/scheduler, not
> > > > here. Might also just be your cleanup code being ordered wrongly, or 
> > > > your
> > > > split in one of the earlier patches not done quite right.
> > > > -Daniel
> > > 
> > > This function iterates across all the schedulers  per amdgpu device and 
> > > accesses
> > > amdgpu specific structures , drm/scheduler deals with single scheduler at 
> > > most
> > > so looks to me like this is the right place for this function
> > I guess we could keep track of all schedulers somewhere in a list in
> > struct drm_device and wrap this up. That was kinda the idea.
> > 
> > Minimally I think a tiny wrapper with docs for the
> > cancel_delayed_work_sync(&sched->work_tdr); which explains what you must
> > observe to make sure there's no race.
> 
> 
> Will do
> 
> 
> > I'm not exactly sure there's no
> > guarantee here we won't get a new tdr work launched right afterwards at
> > least, so this looks a bit like a hack.
> 
> 
> Note that for any TDR work happening post amdgpu_cancel_all_tdr
> amdgpu_job_timedout->drm_dev_is_unplugged
> will return true and so it will return early. To make it water proof tight
> against race
> i can switch from drm_dev_is_unplugged to drm_dev_enter/exit

Hm that's confusing. You do a work_cancel_sync, so that at least looks
like "tdr work must not run after this point"

If you only rely on drm_dev_enter/exit check with the tdr work, then
there's no need to cancel anything.

For race free cancel_work_sync you need:
1. make sure whatever is calling schedule_work is guaranteed to no longer
call schedule_work.
2. call cancel_work_sync

Anything else is cargo-culted work cleanup:

- 1. without 2. means if a work got scheduled right before it'll still be
  a problem.
- 2. without 1. means a schedule_work right after makes you calling
  cancel_work_sync pointless.

So either both or nothing.
-Daniel

> 
> Andrey
> 
> 
> > -Daniel
> > 
> > > Andrey
> > > 
> > > 
> > > > > +
> > > > >static void
> > > > >amdgpu_pci_remove(struct pci_dev *pdev)
> > > > >{
> > > > >   struct drm_device *dev = pci_get_drvdata(pdev);
> > > > > + struct amdgpu_device *adev = dev->dev_private;
> > > > >   drm_dev_unplug(dev);
> > > > > + amdgpu_cancel_all_tdr(adev);
> > > > >   ttm_bo_unmap_virtual_address_space(&adev->mman.bdev);
> > > > >   amdgpu_driver_unload_kms(dev);
> > > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> > > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > > > index 4720718..87ff0c0 100644
> > > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > > > @@ -28,6 +28,8 @@
> > > > >#include "amdgpu.h"
> > > > >#include "amdgpu_trace.h"
> > > > > +#include 
> > > > > +
> > > > >static void amdgpu_job_timedout(struct drm_sched_job *s_job)
> > > > >{
> > > > >   struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
> > > > > @@ -37,6 +39,12 @@ static void amdgpu_job_timedout(struct 
> > > > > drm_sched_job *s_job)
> > > > >   memset(&ti, 0, sizeof(struct amdgpu_task_info));
> > > > > + if (drm_dev_is_unplugged(adev->ddev)) {
> > > > > + DRM_INFO("ring %s timeout, but device unplugged, 
> > > > > skipping.\n",
> > > > > +   s_job->sched->name);
> > > > > + return;
> > > > > + }
> > > > > +
> > > > >   if (amdgpu

Re: [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged.

2020-11-17 Thread Andrey Grodzovsky


On 11/17/20 1:52 PM, Daniel Vetter wrote:

On Tue, Nov 17, 2020 at 01:38:14PM -0500, Andrey Grodzovsky wrote:

On 6/22/20 5:53 AM, Daniel Vetter wrote:

On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey Grodzovsky wrote:

No point to try recovery if device is gone, just messes up things.

Signed-off-by: Andrey Grodzovsky 
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 
   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  8 
   2 files changed, 24 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 6932d75..5d6d3d9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1129,12 +1129,28 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
return ret;
   }
+static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
+{
+   int i;
+
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = adev->rings[i];
+
+   if (!ring || !ring->sched.thread)
+   continue;
+
+   cancel_delayed_work_sync(&ring->sched.work_tdr);
+   }
+}

I think this is a function that's supposed to be in drm/scheduler, not
here. Might also just be your cleanup code being ordered wrongly, or your
split in one of the earlier patches not done quite right.
-Daniel


This function iterates across all the schedulers  per amdgpu device and accesses
amdgpu specific structures , drm/scheduler deals with single scheduler at most
so looks to me like this is the right place for this function

I guess we could keep track of all schedulers somewhere in a list in
struct drm_device and wrap this up. That was kinda the idea.

Minimally I think a tiny wrapper with docs for the
cancel_delayed_work_sync(&sched->work_tdr); which explains what you must
observe to make sure there's no race.



Will do



I'm not exactly sure there's no
guarantee here we won't get a new tdr work launched right afterwards at
least, so this looks a bit like a hack.



Note that for any TDR work happening post amdgpu_cancel_all_tdr 
amdgpu_job_timedout->drm_dev_is_unplugged
will return true and so it will return early. To make it water proof tight 
against race

i can switch from drm_dev_is_unplugged to drm_dev_enter/exit

Andrey



-Daniel


Andrey



+
   static void
   amdgpu_pci_remove(struct pci_dev *pdev)
   {
struct drm_device *dev = pci_get_drvdata(pdev);
+   struct amdgpu_device *adev = dev->dev_private;
drm_dev_unplug(dev);
+   amdgpu_cancel_all_tdr(adev);
ttm_bo_unmap_virtual_address_space(&adev->mman.bdev);
amdgpu_driver_unload_kms(dev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 4720718..87ff0c0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -28,6 +28,8 @@
   #include "amdgpu.h"
   #include "amdgpu_trace.h"
+#include 
+
   static void amdgpu_job_timedout(struct drm_sched_job *s_job)
   {
struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
@@ -37,6 +39,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
memset(&ti, 0, sizeof(struct amdgpu_task_info));
+   if (drm_dev_is_unplugged(adev->ddev)) {
+   DRM_INFO("ring %s timeout, but device unplugged, skipping.\n",
+ s_job->sched->name);
+   return;
+   }
+
if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) 
{
DRM_ERROR("ring %s timeout, but soft recovered\n",
  s_job->sched->name);
--
2.7.4


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged.

2020-11-17 Thread Daniel Vetter
On Tue, Nov 17, 2020 at 01:38:14PM -0500, Andrey Grodzovsky wrote:
> 
> On 6/22/20 5:53 AM, Daniel Vetter wrote:
> > On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey Grodzovsky wrote:
> > > No point to try recovery if device is gone, just messes up things.
> > > 
> > > Signed-off-by: Andrey Grodzovsky 
> > > ---
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 
> > >   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  8 
> > >   2 files changed, 24 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > index 6932d75..5d6d3d9 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> > > @@ -1129,12 +1129,28 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
> > >   return ret;
> > >   }
> > > +static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
> > > +{
> > > + int i;
> > > +
> > > + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> > > + struct amdgpu_ring *ring = adev->rings[i];
> > > +
> > > + if (!ring || !ring->sched.thread)
> > > + continue;
> > > +
> > > + cancel_delayed_work_sync(&ring->sched.work_tdr);
> > > + }
> > > +}
> > I think this is a function that's supposed to be in drm/scheduler, not
> > here. Might also just be your cleanup code being ordered wrongly, or your
> > split in one of the earlier patches not done quite right.
> > -Daniel
> 
> 
> This function iterates across all the schedulers  per amdgpu device and 
> accesses
> amdgpu specific structures , drm/scheduler deals with single scheduler at most
> so looks to me like this is the right place for this function

I guess we could keep track of all schedulers somewhere in a list in
struct drm_device and wrap this up. That was kinda the idea.

Minimally I think a tiny wrapper with docs for the
cancel_delayed_work_sync(&sched->work_tdr); which explains what you must
observe to make sure there's no race. I'm not exactly sure there's no
guarantee here we won't get a new tdr work launched right afterwards at
least, so this looks a bit like a hack.
-Daniel

> 
> Andrey
> 
> 
> > 
> > > +
> > >   static void
> > >   amdgpu_pci_remove(struct pci_dev *pdev)
> > >   {
> > >   struct drm_device *dev = pci_get_drvdata(pdev);
> > > + struct amdgpu_device *adev = dev->dev_private;
> > >   drm_dev_unplug(dev);
> > > + amdgpu_cancel_all_tdr(adev);
> > >   ttm_bo_unmap_virtual_address_space(&adev->mman.bdev);
> > >   amdgpu_driver_unload_kms(dev);
> > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > index 4720718..87ff0c0 100644
> > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > > @@ -28,6 +28,8 @@
> > >   #include "amdgpu.h"
> > >   #include "amdgpu_trace.h"
> > > +#include 
> > > +
> > >   static void amdgpu_job_timedout(struct drm_sched_job *s_job)
> > >   {
> > >   struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
> > > @@ -37,6 +39,12 @@ static void amdgpu_job_timedout(struct drm_sched_job 
> > > *s_job)
> > >   memset(&ti, 0, sizeof(struct amdgpu_task_info));
> > > + if (drm_dev_is_unplugged(adev->ddev)) {
> > > + DRM_INFO("ring %s timeout, but device unplugged, skipping.\n",
> > > +   s_job->sched->name);
> > > + return;
> > > + }
> > > +
> > >   if (amdgpu_ring_soft_recovery(ring, job->vmid, 
> > > s_job->s_fence->parent)) {
> > >   DRM_ERROR("ring %s timeout, but soft recovered\n",
> > > s_job->sched->name);
> > > -- 
> > > 2.7.4
> > > 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged.

2020-11-17 Thread Andrey Grodzovsky


On 6/22/20 5:53 AM, Daniel Vetter wrote:

On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey Grodzovsky wrote:

No point to try recovery if device is gone, just messes up things.

Signed-off-by: Andrey Grodzovsky 
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  8 
  2 files changed, 24 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 6932d75..5d6d3d9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1129,12 +1129,28 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
return ret;
  }
  
+static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)

+{
+   int i;
+
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = adev->rings[i];
+
+   if (!ring || !ring->sched.thread)
+   continue;
+
+   cancel_delayed_work_sync(&ring->sched.work_tdr);
+   }
+}

I think this is a function that's supposed to be in drm/scheduler, not
here. Might also just be your cleanup code being ordered wrongly, or your
split in one of the earlier patches not done quite right.
-Daniel



This function iterates across all the schedulers  per amdgpu device and accesses
amdgpu specific structures , drm/scheduler deals with single scheduler at most
so looks to me like this is the right place for this function

Andrey





+
  static void
  amdgpu_pci_remove(struct pci_dev *pdev)
  {
struct drm_device *dev = pci_get_drvdata(pdev);
+   struct amdgpu_device *adev = dev->dev_private;
  
  	drm_dev_unplug(dev);

+   amdgpu_cancel_all_tdr(adev);
ttm_bo_unmap_virtual_address_space(&adev->mman.bdev);
amdgpu_driver_unload_kms(dev);
  
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c

index 4720718..87ff0c0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -28,6 +28,8 @@
  #include "amdgpu.h"
  #include "amdgpu_trace.h"
  
+#include 

+
  static void amdgpu_job_timedout(struct drm_sched_job *s_job)
  {
struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
@@ -37,6 +39,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
  
  	memset(&ti, 0, sizeof(struct amdgpu_task_info));
  
+	if (drm_dev_is_unplugged(adev->ddev)) {

+   DRM_INFO("ring %s timeout, but device unplugged, skipping.\n",
+ s_job->sched->name);
+   return;
+   }
+
if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) 
{
DRM_ERROR("ring %s timeout, but soft recovered\n",
  s_job->sched->name);
--
2.7.4


___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged.

2020-06-22 Thread Daniel Vetter
On Sun, Jun 21, 2020 at 02:03:08AM -0400, Andrey Grodzovsky wrote:
> No point to try recovery if device is gone, just messes up things.
> 
> Signed-off-by: Andrey Grodzovsky 
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  8 
>  2 files changed, 24 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 6932d75..5d6d3d9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -1129,12 +1129,28 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
>   return ret;
>  }
>  
> +static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
> +{
> + int i;
> +
> + for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> + struct amdgpu_ring *ring = adev->rings[i];
> +
> + if (!ring || !ring->sched.thread)
> + continue;
> +
> + cancel_delayed_work_sync(&ring->sched.work_tdr);
> + }
> +}

I think this is a function that's supposed to be in drm/scheduler, not
here. Might also just be your cleanup code being ordered wrongly, or your
split in one of the earlier patches not done quite right.
-Daniel

> +
>  static void
>  amdgpu_pci_remove(struct pci_dev *pdev)
>  {
>   struct drm_device *dev = pci_get_drvdata(pdev);
> + struct amdgpu_device *adev = dev->dev_private;
>  
>   drm_dev_unplug(dev);
> + amdgpu_cancel_all_tdr(adev);
>   ttm_bo_unmap_virtual_address_space(&adev->mman.bdev);
>   amdgpu_driver_unload_kms(dev);
>  
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 4720718..87ff0c0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -28,6 +28,8 @@
>  #include "amdgpu.h"
>  #include "amdgpu_trace.h"
>  
> +#include 
> +
>  static void amdgpu_job_timedout(struct drm_sched_job *s_job)
>  {
>   struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
> @@ -37,6 +39,12 @@ static void amdgpu_job_timedout(struct drm_sched_job 
> *s_job)
>  
>   memset(&ti, 0, sizeof(struct amdgpu_task_info));
>  
> + if (drm_dev_is_unplugged(adev->ddev)) {
> + DRM_INFO("ring %s timeout, but device unplugged, skipping.\n",
> +   s_job->sched->name);
> + return;
> + }
> +
>   if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) 
> {
>   DRM_ERROR("ring %s timeout, but soft recovered\n",
> s_job->sched->name);
> -- 
> 2.7.4
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[PATCH v2 8/8] drm/amdgpu: Prevent any job recoveries after device is unplugged.

2020-06-20 Thread Andrey Grodzovsky
No point to try recovery if device is gone, just messes up things.

Signed-off-by: Andrey Grodzovsky 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 16 
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  8 
 2 files changed, 24 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 6932d75..5d6d3d9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -1129,12 +1129,28 @@ static int amdgpu_pci_probe(struct pci_dev *pdev,
return ret;
 }
 
+static void amdgpu_cancel_all_tdr(struct amdgpu_device *adev)
+{
+   int i;
+
+   for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+   struct amdgpu_ring *ring = adev->rings[i];
+
+   if (!ring || !ring->sched.thread)
+   continue;
+
+   cancel_delayed_work_sync(&ring->sched.work_tdr);
+   }
+}
+
 static void
 amdgpu_pci_remove(struct pci_dev *pdev)
 {
struct drm_device *dev = pci_get_drvdata(pdev);
+   struct amdgpu_device *adev = dev->dev_private;
 
drm_dev_unplug(dev);
+   amdgpu_cancel_all_tdr(adev);
ttm_bo_unmap_virtual_address_space(&adev->mman.bdev);
amdgpu_driver_unload_kms(dev);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 4720718..87ff0c0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -28,6 +28,8 @@
 #include "amdgpu.h"
 #include "amdgpu_trace.h"
 
+#include 
+
 static void amdgpu_job_timedout(struct drm_sched_job *s_job)
 {
struct amdgpu_ring *ring = to_amdgpu_ring(s_job->sched);
@@ -37,6 +39,12 @@ static void amdgpu_job_timedout(struct drm_sched_job *s_job)
 
memset(&ti, 0, sizeof(struct amdgpu_task_info));
 
+   if (drm_dev_is_unplugged(adev->ddev)) {
+   DRM_INFO("ring %s timeout, but device unplugged, skipping.\n",
+ s_job->sched->name);
+   return;
+   }
+
if (amdgpu_ring_soft_recovery(ring, job->vmid, s_job->s_fence->parent)) 
{
DRM_ERROR("ring %s timeout, but soft recovered\n",
  s_job->sched->name);
-- 
2.7.4

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel