subject:"RE\: \[PATCH 0\/4\] Refine GPU recovery sequence to enhance its stability"

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-15 Thread Christian König

[SNIP]

Maybe just empirically - let's try it and see under different
test scenarios what actually happens ?

Not a good idea in general, we have that approach way to often at
AMD and are then surprised that everything works in QA but fails
in production.

But Daniel already noted in his reply that waiting for a fence
while holding the SRCU is expected to work.

So let's stick with the approach of high level locking for hotplug.

To my understanding this is true for other devises, not the one
being extracted, for him you still need to do all the HW fence
signalling dance because the HW is gone and we block any TDRs
(which won't help anyway).

Andrey

Do you agree to the above ?

Yeah, I think that is correct.

But on the other hand what Daniel reminded me of is that the handling
needs to be consistent over different devices. And since some device
already go with the approach of canceling everything we simply have
to go down that route as well.

Christian.

What does it mean in our context ? What needs to be done which we are
not doing now ?

I think we are fine, we just need to continue with the approach of
forcefully signaling all fences on hotplug.

Christian.

Andrey

Christian.

Andrey

Christian.

Andrey

Regards,
Christian.

Andrey

BTW: Could it be that the device SRCU protects more
than one device and we deadlock because of this?

I haven't actually experienced any deadlock until now
but, yes, drm_unplug_srcu is defined as static in
drm_drv.c and so in the presence of multiple devices
from same or different drivers we in fact are dependent
on all their critical sections i guess.

Shit, yeah the devil is a squirrel. So for A+I laptops we
actually need to sync that up with Daniel and the rest of
the i915 guys.

IIRC we could actually have an amdgpu device in a docking
station which needs hotplug and the driver might depend
on waiting for the i915 driver as well.

Can't we propose a patch to make drm_unplug_srcu per
drm_device ? I don't see why it has to be global and not
per device thing.

I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the
drm_unplug_srcu is global.

Regards,
Christian.

Andrey

Christian.

Andrey

Christian.

Andrey

Christian.

/* Past this point no more fence are submitted to
HW ring and hence we can safely call force signal
on all that are currently there.
* Any subsequently created HW fences will be
returned signaled with an error code right away

for_each_ring(adev)
amdgpu_fence_process(ring)

drm_dev_unplug(dev);
Stop schedulers
cancel_sync(all timers and queued works);
hw_fini
unmap_mmio

}

Andrey

Alternatively grabbing the reset write side
and stopping and then restarting the
scheduler could work as well.

Christian.

I didn't get the above and I don't see why I
need to reuse the GPU reset rw_lock. I rely on
the SRCU unplug flag for unplug. Also, not
clear to me why are we focusing on the
scheduler threads, any code patch to generate
HW fences should be covered, so any code
leading to amdgpu_fence_emit needs to be taken
into account such as, direct IB submissions,
VM flushes e.t.c

You need to work together with the reset lock
anyway, cause a hotplug could run at the same
time as a reset.

For going my way indeed now I see now that I
have to take reset write side lock during HW
fences signalling in order to protect against
scheduler/HW fences detachment and reattachment
during schedulers stop/restart. But if we go
with your approach then calling drm_dev_unplug
and scoping amdgpu_job_timeout with
drm_dev_enter/exit should be enough to prevent
any concurrent GPU resets during unplug. In fact
I already do it anyway -
https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ca64b1f5e0df0403a656408d8ffdc7bdb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637540669732692484%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=pLcplnlDIESV998tLO7iydxEo5lh71BjQCbAOxKif2Q%3Dreserved=0

Yes, good point as well.

Christian.

Andrey

Christian.

Andrey

Christian.

Andrey

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-15 Thread Andrey Grodzovsky



On 2021-04-15 3:02 a.m., Christian König wrote:

Am 15.04.21 um 08:27 schrieb Andrey Grodzovsky:


On 2021-04-14 10:58 a.m., Christian König wrote:

Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:

 [SNIP]


We are racing here once more and need to handle that.



But why, I wrote above that we first stop the all schedulers, 
then only call drm_sched_entity_kill_jobs.


The schedulers consuming jobs is not the problem, we already 
handle that correct.


The problem is that the entities might continue feeding stuff into 
the scheduler.



Missed that.  Ok, can I just use non sleeping RCU with a flag 
around drm_sched_entity_push_job at the amdgpu level (only 2 
functions call it - amdgpu_cs_submit and amdgpu_job_submit) as a 
preliminary step to flush and block in flight and future 
submissions to entity queue ?


Double checking the code I think we can use the notifier_lock for this.

E.g. in amdgpu_cs.c see where we have the goto error_abort.

That is the place where such a check could be added without any 
additional overhead.



Sure, I will just have to add this lock to amdgpu_job_submit too.


Not ideal, but I think that's fine with me. You might want to rename 
the lock for this thought.





[SNIP]


Maybe just empirically - let's try it and see under different 
test scenarios what actually happens  ?


Not a good idea in general, we have that approach way to often at 
AMD and are then surprised that everything works in QA but fails 
in production.


But Daniel already noted in his reply that waiting for a fence 
while holding the SRCU is expected to work.


So let's stick with the approach of high level locking for hotplug.



To my understanding this is true for other devises, not the one 
being extracted, for him you still need to do all the HW fence 
signalling dance because the HW is gone and we block any TDRs 
(which won't help anyway).


Andrey



Do you agree to the above ?


Yeah, I think that is correct.

But on the other hand what Daniel reminded me of is that the handling 
needs to be consistent over different devices. And since some device 
already go with the approach of canceling everything we simply have to 
go down that route as well.


Christian.



What does it mean in our context ? What needs to be done which we are 
not doing now ?


Andrey






Andrey







Christian.



Andrey




Christian.



Andrey



Regards,
Christian.



Andrey









BTW: Could it be that the device SRCU protects more than 
one device and we deadlock because of this?



I haven't actually experienced any deadlock until now 
but, yes, drm_unplug_srcu is defined as static in 
drm_drv.c and so in the presence of multiple devices from 
same or different drivers we in fact are dependent on all 
their critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we 
actually need to sync that up with Daniel and the rest of 
the i915 guys.


IIRC we could actually have an amdgpu device in a docking 
station which needs hotplug and the driver might depend on 
waiting for the i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per 
drm_device ? I don't see why it has to be global and not 
per device thing.


I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the 
drm_unplug_srcu is global.


Regards,
Christian.



Andrey




Christian.


Andrey




Christian.


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted 
to HW ring and hence we can safely call force 
signal on all that are currently there.
 * Any subsequently created HW fences will be 
returned signaled with an error code right away

 */

    for_each_ring(adev)
amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side 
and stopping and then restarting the scheduler 
could work as well.


Christian.



I didn't get the above and I don't see why I 
need to reuse the GPU reset rw_lock. I rely on 
the SRCU unplug flag for unplug. Also, not 
clear to me why are we focusing on the 
scheduler threads, any code patch to generate 
HW fences should be covered, so any code 
leading to amdgpu_fence_emit needs to be taken 
into account such as, direct IB submissions, VM 
flushes e.t.c


You need to work together with the reset lock 
anyway, cause a hotplug could run at the same 
time as a reset.



For going my way indeed now I see now that I have 
to take reset write side lock during HW fences 
signalling in order to protect against 
scheduler/HW fences detachment and reattachment 
during schedulers stop/restart. But if we go with 
your approach then calling drm_dev_unplug and 
scoping amdgpu_job_timeout with 
drm_dev_enter/exit should be enough to prevent 
any concurrent GPU resets during unplug. In fact 
I already do it anyway -

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-15 Thread Christian König


Am 15.04.21 um 08:27 schrieb Andrey Grodzovsky:


On 2021-04-14 10:58 a.m., Christian König wrote:

Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:

 [SNIP]


We are racing here once more and need to handle that.



But why, I wrote above that we first stop the all schedulers, then 
only call drm_sched_entity_kill_jobs.


The schedulers consuming jobs is not the problem, we already handle 
that correct.


The problem is that the entities might continue feeding stuff into 
the scheduler.



Missed that.  Ok, can I just use non sleeping RCU with a flag around 
drm_sched_entity_push_job at the amdgpu level (only 2 functions call 
it - amdgpu_cs_submit and amdgpu_job_submit) as a preliminary step 
to flush and block in flight and future submissions to entity queue ?


Double checking the code I think we can use the notifier_lock for this.

E.g. in amdgpu_cs.c see where we have the goto error_abort.

That is the place where such a check could be added without any 
additional overhead.



Sure, I will just have to add this lock to amdgpu_job_submit too.


Not ideal, but I think that's fine with me. You might want to rename the 
lock for this thought.





[SNIP]


Maybe just empirically - let's try it and see under different test 
scenarios what actually happens  ?


Not a good idea in general, we have that approach way to often at 
AMD and are then surprised that everything works in QA but fails in 
production.


But Daniel already noted in his reply that waiting for a fence 
while holding the SRCU is expected to work.


So let's stick with the approach of high level locking for hotplug.



To my understanding this is true for other devises, not the one 
being extracted, for him you still need to do all the HW fence 
signalling dance because the HW is gone and we block any TDRs (which 
won't help anyway).


Andrey



Do you agree to the above ?


Yeah, I think that is correct.

But on the other hand what Daniel reminded me of is that the handling 
needs to be consistent over different devices. And since some device 
already go with the approach of canceling everything we simply have to 
go down that route as well.


Christian.



Andrey







Christian.



Andrey




Christian.



Andrey



Regards,
Christian.



Andrey









BTW: Could it be that the device SRCU protects more than 
one device and we deadlock because of this?



I haven't actually experienced any deadlock until now but, 
yes, drm_unplug_srcu is defined as static in drm_drv.c and 
so in the presence of multiple devices from same or 
different drivers we in fact are dependent on all their 
critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we 
actually need to sync that up with Daniel and the rest of 
the i915 guys.


IIRC we could actually have an amdgpu device in a docking 
station which needs hotplug and the driver might depend on 
waiting for the i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per 
drm_device ? I don't see why it has to be global and not per 
device thing.


I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu 
is global.


Regards,
Christian.



Andrey




Christian.


Andrey




Christian.


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted 
to HW ring and hence we can safely call force signal 
on all that are currently there.
 * Any subsequently created HW fences will be 
returned signaled with an error code right away

 */

    for_each_ring(adev)
amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side and 
stopping and then restarting the scheduler 
could work as well.


Christian.



I didn't get the above and I don't see why I 
need to reuse the GPU reset rw_lock. I rely on 
the SRCU unplug flag for unplug. Also, not clear 
to me why are we focusing on the scheduler 
threads, any code patch to generate HW fences 
should be covered, so any code leading to 
amdgpu_fence_emit needs to be taken into account 
such as, direct IB submissions, VM flushes e.t.c


You need to work together with the reset lock 
anyway, cause a hotplug could run at the same 
time as a reset.



For going my way indeed now I see now that I have 
to take reset write side lock during HW fences 
signalling in order to protect against 
scheduler/HW fences detachment and reattachment 
during schedulers stop/restart. But if we go with 
your approach then calling drm_dev_unplug and 
scoping amdgpu_job_timeout with drm_dev_enter/exit 
should be enough to prevent any concurrent GPU 
resets during unplug. In fact I already do it 
anyway -

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-15 Thread Andrey Grodzovsky



On 2021-04-14 10:58 a.m., Christian König wrote:

Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:

 [SNIP]


We are racing here once more and need to handle that.



But why, I wrote above that we first stop the all schedulers, then 
only call drm_sched_entity_kill_jobs.


The schedulers consuming jobs is not the problem, we already handle 
that correct.


The problem is that the entities might continue feeding stuff into 
the scheduler.



Missed that.  Ok, can I just use non sleeping RCU with a flag around 
drm_sched_entity_push_job at the amdgpu level (only 2 functions call 
it - amdgpu_cs_submit and amdgpu_job_submit) as a preliminary step to 
flush and block in flight and future submissions to entity queue ?


Double checking the code I think we can use the notifier_lock for this.

E.g. in amdgpu_cs.c see where we have the goto error_abort.

That is the place where such a check could be added without any 
additional overhead.



Sure, I will just have to add this lock to amdgpu_job_submit too.




Christian.










For waiting for other device I have no idea if that couldn't 
deadlock somehow.



Yea, not sure for imported fences and dma_bufs, I would assume 
the other devices should not be impacted by our device removal 
but, who knows...


So I guess we are NOT going with finalizing HW fences before 
drm_dev_unplug and instead will just call drm_dev_enter/exit at 
the back-ends all over the place where there are MMIO accesses ?


Good question. As you said that is really the hard path.

Handling it all at once at IOCTL level certainly has some appeal 
as well, but I have no idea if we can guarantee that this is lock 
free.



Maybe just empirically - let's try it and see under different test 
scenarios what actually happens  ?


Not a good idea in general, we have that approach way to often at 
AMD and are then surprised that everything works in QA but fails in 
production.


But Daniel already noted in his reply that waiting for a fence while 
holding the SRCU is expected to work.


So let's stick with the approach of high level locking for hotplug.



To my understanding this is true for other devises, not the one being 
extracted, for him you still need to do all the HW fence signalling 
dance because the HW is gone and we block any TDRs (which won't help 
anyway).


Andrey



Do you agree to the above ?

Andrey







Christian.



Andrey




Christian.



Andrey



Regards,
Christian.



Andrey









BTW: Could it be that the device SRCU protects more than 
one device and we deadlock because of this?



I haven't actually experienced any deadlock until now but, 
yes, drm_unplug_srcu is defined as static in drm_drv.c and 
so in the presence of multiple devices from same or 
different drivers we in fact are dependent on all their 
critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we 
actually need to sync that up with Daniel and the rest of 
the i915 guys.


IIRC we could actually have an amdgpu device in a docking 
station which needs hotplug and the driver might depend on 
waiting for the i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per 
drm_device ? I don't see why it has to be global and not per 
device thing.


I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu 
is global.


Regards,
Christian.



Andrey




Christian.


Andrey




Christian.


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted to 
HW ring and hence we can safely call force signal on 
all that are currently there.
 * Any subsequently created HW fences will be 
returned signaled with an error code right away

 */

    for_each_ring(adev)
amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side and 
stopping and then restarting the scheduler could 
work as well.


Christian.



I didn't get the above and I don't see why I need 
to reuse the GPU reset rw_lock. I rely on the 
SRCU unplug flag for unplug. Also, not clear to 
me why are we focusing on the scheduler threads, 
any code patch to generate HW fences should be 
covered, so any code leading to amdgpu_fence_emit 
needs to be taken into account such as, direct IB 
submissions, VM flushes e.t.c


You need to work together with the reset lock 
anyway, cause a hotplug could run at the same time 
as a reset.



For going my way indeed now I see now that I have 
to take reset write side lock during HW fences 
signalling in order to protect against scheduler/HW 
fences detachment and reattachment during 
schedulers stop/restart. But if we go with your 
approach then calling drm_dev_unplug and scoping 
amdgpu_job_timeout with drm_dev_enter/exit should 
be enough to prevent any concurrent GPU resets 
during unplug. In fact I already do it anyway

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-14 Thread Christian König


Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:

 [SNIP]


We are racing here once more and need to handle that.



But why, I wrote above that we first stop the all schedulers, then 
only call drm_sched_entity_kill_jobs.


The schedulers consuming jobs is not the problem, we already handle 
that correct.


The problem is that the entities might continue feeding stuff into 
the scheduler.



Missed that.  Ok, can I just use non sleeping RCU with a flag around 
drm_sched_entity_push_job at the amdgpu level (only 2 functions call 
it - amdgpu_cs_submit and amdgpu_job_submit) as a preliminary step to 
flush and block in flight and future submissions to entity queue ?


Double checking the code I think we can use the notifier_lock for this.

E.g. in amdgpu_cs.c see where we have the goto error_abort.

That is the place where such a check could be added without any 
additional overhead.


Christian.










For waiting for other device I have no idea if that couldn't 
deadlock somehow.



Yea, not sure for imported fences and dma_bufs, I would assume the 
other devices should not be impacted by our device removal but, 
who knows...


So I guess we are NOT going with finalizing HW fences before 
drm_dev_unplug and instead will just call drm_dev_enter/exit at 
the back-ends all over the place where there are MMIO accesses ?


Good question. As you said that is really the hard path.

Handling it all at once at IOCTL level certainly has some appeal as 
well, but I have no idea if we can guarantee that this is lock free.



Maybe just empirically - let's try it and see under different test 
scenarios what actually happens  ?


Not a good idea in general, we have that approach way to often at AMD 
and are then surprised that everything works in QA but fails in 
production.


But Daniel already noted in his reply that waiting for a fence while 
holding the SRCU is expected to work.


So let's stick with the approach of high level locking for hotplug.



To my understanding this is true for other devises, not the one being 
extracted, for him you still need to do all the HW fence signalling 
dance because the HW is gone and we block any TDRs (which won't help 
anyway).


Andrey




Christian.



Andrey




Christian.



Andrey



Regards,
Christian.



Andrey









BTW: Could it be that the device SRCU protects more than 
one device and we deadlock because of this?



I haven't actually experienced any deadlock until now but, 
yes, drm_unplug_srcu is defined as static in drm_drv.c and 
so in the presence of multiple devices from same or 
different drivers we in fact are dependent on all their 
critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we 
actually need to sync that up with Daniel and the rest of the 
i915 guys.


IIRC we could actually have an amdgpu device in a docking 
station which needs hotplug and the driver might depend on 
waiting for the i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per 
drm_device ? I don't see why it has to be global and not per 
device thing.


I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu 
is global.


Regards,
Christian.



Andrey




Christian.


Andrey




Christian.


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted to 
HW ring and hence we can safely call force signal on 
all that are currently there.
 * Any subsequently created HW fences will be 
returned signaled with an error code right away

 */

    for_each_ring(adev)
        amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side and 
stopping and then restarting the scheduler could 
work as well.


Christian.



I didn't get the above and I don't see why I need 
to reuse the GPU reset rw_lock. I rely on the SRCU 
unplug flag for unplug. Also, not clear to me why 
are we focusing on the scheduler threads, any code 
patch to generate HW fences should be covered, so 
any code leading to amdgpu_fence_emit needs to be 
taken into account such as, direct IB submissions, 
VM flushes e.t.c


You need to work together with the reset lock 
anyway, cause a hotplug could run at the same time 
as a reset.



For going my way indeed now I see now that I have to 
take reset write side lock during HW fences 
signalling in order to protect against scheduler/HW 
fences detachment and reattachment during schedulers 
stop/restart. But if we go with your approach then 
calling drm_dev_unplug and scoping 
amdgpu_job_timeout with drm_dev_enter/exit should be 
enough to prevent any concurrent GPU resets during 
unplug. In fact I already do it anyway -

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-14 Thread Andrey Grodzovsky



On 2021-04-14 3:01 a.m., Christian König wrote:

Am 13.04.21 um 20:30 schrieb Andrey Grodzovsky:


On 2021-04-13 2:25 p.m., Christian König wrote:



Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky:


On 2021-04-13 2:03 p.m., Christian König wrote:

Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:


On 2021-04-13 3:10 a.m., Christian König wrote:

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:


On 2021-04-12 3:18 p.m., Christian König wrote:

Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

[SNIP]


So what's the right approach ? How we guarantee that when 
running amdgpu_fence_driver_force_completion we will signal 
all the HW fences and not racing against some more fences 
insertion into that array ?




Well I would still say the best approach would be to insert 
this between the front end and the backend and not rely on 
signaling fences while holding the device srcu.



My question is, even now, when we run 
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
what there prevents a race with another fence being at the 
same time emitted and inserted into the fence array ? Looks 
like nothing.




Each ring can only be used by one thread at the same time, 
this includes emitting fences as well as other stuff.


During GPU reset we make sure nobody writes to the rings by 
stopping the scheduler and taking the GPU reset lock (so that 
nobody else can start the scheduler again).



What about direct submissions not through scheduler - 
amdgpu_job_submit_direct, I don't see how this is protected.


Those only happen during startup and GPU reset.



Ok, but then looks like I am missing something, see the following 
steps in amdgpu_pci_remove -


1) Use disable_irq API function to stop and flush all in flight 
HW interrupts handlers


2) Grab the reset lock and stop all the schedulers

After above 2 steps the HW fences array is idle, no more 
insertions and no more extractions from the array


3) Run one time amdgpu_fence_process to signal all current HW fences

4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), 
release the GPU reset lock and go on with the rest of the 
sequence (cancel timers, work items e.t.c)


What's problematic in this sequence ?


drm_dev_unplug() will wait for the IOCTLs to finish.

The IOCTLs in turn can wait for fences. That can be both hardware 
fences, scheduler fences, as well as fences from other devices 
(and KIQ fences for register writes under SRIOV, but we can 
hopefully ignore them for now).


We have handled the hardware fences, but we have no idea when the 
scheduler fences or the fences from other devices will signal.


Scheduler fences won't signal until the scheduler threads are 
restarted or we somehow cancel the submissions. Doable, but tricky 
as well.



For scheduler fences I am not worried, for the 
sched_fence->finished fence they are by definition attached to HW 
fences which already signaledfor sched_fence->scheduled we should 
run drm_sched_entity_kill_jobs for each entity after stopping the 
scheduler threads and before setting drm_dev_unplug.


Well exactly that is what is tricky here. 
drm_sched_entity_kill_jobs() assumes that there are no more jobs 
pushed into the entity.


We are racing here once more and need to handle that.



But why, I wrote above that we first stop the all schedulers, then 
only call drm_sched_entity_kill_jobs.


The schedulers consuming jobs is not the problem, we already handle 
that correct.


The problem is that the entities might continue feeding stuff into the 
scheduler.



Missed that.  Ok, can I just use non sleeping RCU with a flag around 
drm_sched_entity_push_job at the amdgpu level (only 2 functions call it 
- amdgpu_cs_submit and amdgpu_job_submit) as a preliminary step to flush 
and block in flight and future submissions to entity queue ?









For waiting for other device I have no idea if that couldn't 
deadlock somehow.



Yea, not sure for imported fences and dma_bufs, I would assume the 
other devices should not be impacted by our device removal but, who 
knows...


So I guess we are NOT going with finalizing HW fences before 
drm_dev_unplug and instead will just call drm_dev_enter/exit at the 
back-ends all over the place where there are MMIO accesses ?


Good question. As you said that is really the hard path.

Handling it all at once at IOCTL level certainly has some appeal as 
well, but I have no idea if we can guarantee that this is lock free.



Maybe just empirically - let's try it and see under different test 
scenarios what actually happens  ?


Not a good idea in general, we have that approach way to often at AMD 
and are then surprised that everything works in QA but fails in 
production.


But Daniel already noted in his reply that waiting for a fence while 
holding the SRCU is expected to work.


So let's stick with the approach of high level locking for hotplug.



To my understanding this is true for other

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-14 Thread Christian König


Am 13.04.21 um 20:30 schrieb Andrey Grodzovsky:


On 2021-04-13 2:25 p.m., Christian König wrote:



Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky:


On 2021-04-13 2:03 p.m., Christian König wrote:

Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:


On 2021-04-13 3:10 a.m., Christian König wrote:

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:


On 2021-04-12 3:18 p.m., Christian König wrote:

Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

[SNIP]


So what's the right approach ? How we guarantee that when 
running amdgpu_fence_driver_force_completion we will signal 
all the HW fences and not racing against some more fences 
insertion into that array ?




Well I would still say the best approach would be to insert 
this between the front end and the backend and not rely on 
signaling fences while holding the device srcu.



My question is, even now, when we run 
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
what there prevents a race with another fence being at the 
same time emitted and inserted into the fence array ? Looks 
like nothing.




Each ring can only be used by one thread at the same time, this 
includes emitting fences as well as other stuff.


During GPU reset we make sure nobody writes to the rings by 
stopping the scheduler and taking the GPU reset lock (so that 
nobody else can start the scheduler again).



What about direct submissions not through scheduler - 
amdgpu_job_submit_direct, I don't see how this is protected.


Those only happen during startup and GPU reset.



Ok, but then looks like I am missing something, see the following 
steps in amdgpu_pci_remove -


1) Use disable_irq API function to stop and flush all in flight HW 
interrupts handlers


2) Grab the reset lock and stop all the schedulers

After above 2 steps the HW fences array is idle, no more 
insertions and no more extractions from the array


3) Run one time amdgpu_fence_process to signal all current HW fences

4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release 
the GPU reset lock and go on with the rest of the sequence (cancel 
timers, work items e.t.c)


What's problematic in this sequence ?


drm_dev_unplug() will wait for the IOCTLs to finish.

The IOCTLs in turn can wait for fences. That can be both hardware 
fences, scheduler fences, as well as fences from other devices (and 
KIQ fences for register writes under SRIOV, but we can hopefully 
ignore them for now).


We have handled the hardware fences, but we have no idea when the 
scheduler fences or the fences from other devices will signal.


Scheduler fences won't signal until the scheduler threads are 
restarted or we somehow cancel the submissions. Doable, but tricky 
as well.



For scheduler fences I am not worried, for the sched_fence->finished 
fence they are by definition attached to HW fences which already 
signaledfor sched_fence->scheduled we should run 
drm_sched_entity_kill_jobs for each entity after stopping the 
scheduler threads and before setting drm_dev_unplug.


Well exactly that is what is tricky here. 
drm_sched_entity_kill_jobs() assumes that there are no more jobs 
pushed into the entity.


We are racing here once more and need to handle that.



But why, I wrote above that we first stop the all schedulers, then 
only call drm_sched_entity_kill_jobs.


The schedulers consuming jobs is not the problem, we already handle that 
correct.


The problem is that the entities might continue feeding stuff into the 
scheduler.






For waiting for other device I have no idea if that couldn't 
deadlock somehow.



Yea, not sure for imported fences and dma_bufs, I would assume the 
other devices should not be impacted by our device removal but, who 
knows...


So I guess we are NOT going with finalizing HW fences before 
drm_dev_unplug and instead will just call drm_dev_enter/exit at the 
back-ends all over the place where there are MMIO accesses ?


Good question. As you said that is really the hard path.

Handling it all at once at IOCTL level certainly has some appeal as 
well, but I have no idea if we can guarantee that this is lock free.



Maybe just empirically - let's try it and see under different test 
scenarios what actually happens  ?


Not a good idea in general, we have that approach way to often at AMD 
and are then surprised that everything works in QA but fails in production.


But Daniel already noted in his reply that waiting for a fence while 
holding the SRCU is expected to work.


So let's stick with the approach of high level locking for hotplug.

Christian.



Andrey




Christian.



Andrey



Regards,
Christian.



Andrey









BTW: Could it be that the device SRCU protects more than one 
device and we deadlock because of this?



I haven't actually experienced any deadlock until now but, 
yes, drm_unplug_srcu is defined as static in drm_drv.c and so 
in the presence  of multiple devices from same or different 
drivers

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Daniel Vetter

On Tue, Apr 13, 2021 at 11:13 AM Li, Dennis  wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi, Christian and Andrey,
>   We maybe try to implement "wait" callback function of dma_fence_ops, 
> when GPU reset or unplug happen, make this callback return - ENODEV, to 
> notify the caller device lost.
>
>  * Must return -ERESTARTSYS if the wait is intr = true and the wait 
> was
>  * interrupted, and remaining jiffies if fence has signaled, or 0 if 
> wait
>  * timed out. Can also return other error values on custom 
> implementations,
>  * which should be treated as if the fence is signaled. For example a 
> hardware
>  * lockup could be reported like that.
>  *
>  * This callback is optional.
>  */
> signed long (*wait)(struct dma_fence *fence,
> bool intr, signed long timeout);

Uh this callback is for old horros like unreliable irq delivery on
radeon. Please don't use it for anything, if we need to make fences
bail out on error then we need something that works for all fences.
Also TDR should recovery you here already and make sure the
dma_fence_wait() is bound in time.
-Daniel

>
> Best Regards
> Dennis Li
> -Original Message-
> From: Christian König 
> Sent: Tuesday, April 13, 2021 3:10 PM
> To: Grodzovsky, Andrey ; Koenig, Christian 
> ; Li, Dennis ; 
> amd-gfx@lists.freedesktop.org; Deucher, Alexander 
> ; Kuehling, Felix ; Zhang, 
> Hawking ; Daniel Vetter 
> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
>
> Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
> >
> > On 2021-04-12 3:18 p.m., Christian König wrote:
> >> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
> >>> [SNIP]
> >>>>>
> >>>>> So what's the right approach ? How we guarantee that when running
> >>>>> amdgpu_fence_driver_force_completion we will signal all the HW
> >>>>> fences and not racing against some more fences insertion into that
> >>>>> array ?
> >>>>>
> >>>>
> >>>> Well I would still say the best approach would be to insert this
> >>>> between the front end and the backend and not rely on signaling
> >>>> fences while holding the device srcu.
> >>>
> >>>
> >>> My question is, even now, when we run
> >>> amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or
> >>> amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion,
> >>> what there prevents a race with another fence being at the same time
> >>> emitted and inserted into the fence array ? Looks like nothing.
> >>>
> >>
> >> Each ring can only be used by one thread at the same time, this
> >> includes emitting fences as well as other stuff.
> >>
> >> During GPU reset we make sure nobody writes to the rings by stopping
> >> the scheduler and taking the GPU reset lock (so that nobody else can
> >> start the scheduler again).
> >
> >
> > What about direct submissions not through scheduler -
> > amdgpu_job_submit_direct, I don't see how this is protected.
>
> Those only happen during startup and GPU reset.
>
> >>
> >>>>
> >>>> BTW: Could it be that the device SRCU protects more than one device
> >>>> and we deadlock because of this?
> >>>
> >>>
> >>> I haven't actually experienced any deadlock until now but, yes,
> >>> drm_unplug_srcu is defined as static in drm_drv.c and so in the
> >>> presence  of multiple devices from same or different drivers we in
> >>> fact are dependent on all their critical sections i guess.
> >>>
> >>
> >> Shit, yeah the devil is a squirrel. So for A+I laptops we actually
> >> need to sync that up with Daniel and the rest of the i915 guys.
> >>
> >> IIRC we could actually have an amdgpu device in a docking station
> >> which needs hotplug and the driver might depend on waiting for the
> >> i915 driver as well.
> >
> >
> > Can't we propose a patch to make drm_unplug_srcu per drm_device ? I
> > don't see why it has to be global and not per device thing.
>
> I'm really wondering the same thing for quite a while now.
>
> Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global.
>
> Regards,
> Christian.
>
> >
> > Andrey
> >
> >
> >>

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Daniel Vetter

On Tue, Apr 13, 2021 at 9:10 AM Christian König
 wrote:
>
> Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
> >
> > On 2021-04-12 3:18 p.m., Christian König wrote:
> >> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
> >>> [SNIP]
> >
> > So what's the right approach ? How we guarantee that when running
> > amdgpu_fence_driver_force_completion we will signal all the HW
> > fences and not racing against some more fences insertion into that
> > array ?
> >
> 
>  Well I would still say the best approach would be to insert this
>  between the front end and the backend and not rely on signaling
>  fences while holding the device srcu.
> >>>
> >>>
> >>> My question is, even now, when we run
> >>> amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or
> >>> amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion,
> >>> what there prevents a race with another fence being at the same time
> >>> emitted and inserted into the fence array ? Looks like nothing.
> >>>
> >>
> >> Each ring can only be used by one thread at the same time, this
> >> includes emitting fences as well as other stuff.
> >>
> >> During GPU reset we make sure nobody writes to the rings by stopping
> >> the scheduler and taking the GPU reset lock (so that nobody else can
> >> start the scheduler again).
> >
> >
> > What about direct submissions not through scheduler -
> > amdgpu_job_submit_direct, I don't see how this is protected.
>
> Those only happen during startup and GPU reset.
>
> >>
> 
>  BTW: Could it be that the device SRCU protects more than one device
>  and we deadlock because of this?
> >>>
> >>>
> >>> I haven't actually experienced any deadlock until now but, yes,
> >>> drm_unplug_srcu is defined as static in drm_drv.c and so in the
> >>> presence  of multiple devices from same or different drivers we in
> >>> fact are dependent on all their critical sections i guess.
> >>>
> >>
> >> Shit, yeah the devil is a squirrel. So for A+I laptops we actually
> >> need to sync that up with Daniel and the rest of the i915 guys.
> >>
> >> IIRC we could actually have an amdgpu device in a docking station
> >> which needs hotplug and the driver might depend on waiting for the
> >> i915 driver as well.
> >
> >
> > Can't we propose a patch to make drm_unplug_srcu per drm_device ? I
> > don't see why it has to be global and not per device thing.
>
> I'm really wondering the same thing for quite a while now.
>
> Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global.

SRCU isn't exactly the cheapest thing, but aside from that we could
make it per-device. I'm not seeing the point much since if you do end
up being stuck on an ioctl this might happen with anything really.

Also note that dma_fence_waits are supposed to be time bound, so you
shouldn't end up waiting on them forever. It should all get sorted out
in due time with TDR I hope (e.g. if i915 is stuck on a fence because
you're unlucky).
-Daniel

>
> Regards,
> Christian.
>
> >
> > Andrey
> >
> >
> >>
> >> Christian.
> >>
> >>> Andrey
> >>>
> >>>
> 
>  Christian.
> 
> > Andrey
> >
> >
> >>
> >>> Andrey
> >>>
> >>>
> 
>  Christian.
> 
> > /* Past this point no more fence are submitted to HW ring
> > and hence we can safely call force signal on all that are
> > currently there.
> >  * Any subsequently created  HW fences will be returned
> > signaled with an error code right away
> >  */
> >
> > for_each_ring(adev)
> > amdgpu_fence_process(ring)
> >
> > drm_dev_unplug(dev);
> > Stop schedulers
> > cancel_sync(all timers and queued works);
> > hw_fini
> > unmap_mmio
> >
> > }
> >
> >
> > Andrey
> >
> >
> >>
> >>
> >>>
> >>
> >> Alternatively grabbing the reset write side and stopping
> >> and then restarting the scheduler could work as well.
> >>
> >> Christian.
> >
> >
> > I didn't get the above and I don't see why I need to reuse
> > the GPU reset rw_lock. I rely on the SRCU unplug flag for
> > unplug. Also, not clear to me why are we focusing on the
> > scheduler threads, any code patch to generate HW fences
> > should be covered, so any code leading to
> > amdgpu_fence_emit needs to be taken into account such as,
> > direct IB submissions, VM flushes e.t.c
> 
>  You need to work together with the reset lock anyway, cause
>  a hotplug could run at the same time as a reset.
> >>>
> >>>
> >>> For going my way indeed now I see now that I have to take
> >>> reset write side lock during HW

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Andrey Grodzovsky



On 2021-04-13 2:25 p.m., Christian König wrote:



Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky:


On 2021-04-13 2:03 p.m., Christian König wrote:

Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:


On 2021-04-13 3:10 a.m., Christian König wrote:

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:


On 2021-04-12 3:18 p.m., Christian König wrote:

Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

[SNIP]


So what's the right approach ? How we guarantee that when 
running amdgpu_fence_driver_force_completion we will signal 
all the HW fences and not racing against some more fences 
insertion into that array ?




Well I would still say the best approach would be to insert 
this between the front end and the backend and not rely on 
signaling fences while holding the device srcu.



My question is, even now, when we run 
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
what there prevents a race with another fence being at the same 
time emitted and inserted into the fence array ? Looks like 
nothing.




Each ring can only be used by one thread at the same time, this 
includes emitting fences as well as other stuff.


During GPU reset we make sure nobody writes to the rings by 
stopping the scheduler and taking the GPU reset lock (so that 
nobody else can start the scheduler again).



What about direct submissions not through scheduler - 
amdgpu_job_submit_direct, I don't see how this is protected.


Those only happen during startup and GPU reset.



Ok, but then looks like I am missing something, see the following 
steps in amdgpu_pci_remove -


1) Use disable_irq API function to stop and flush all in flight HW 
interrupts handlers


2) Grab the reset lock and stop all the schedulers

After above 2 steps the HW fences array is idle, no more insertions 
and no more extractions from the array


3) Run one time amdgpu_fence_process to signal all current HW fences

4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release 
the GPU reset lock and go on with the rest of the sequence (cancel 
timers, work items e.t.c)


What's problematic in this sequence ?


drm_dev_unplug() will wait for the IOCTLs to finish.

The IOCTLs in turn can wait for fences. That can be both hardware 
fences, scheduler fences, as well as fences from other devices (and 
KIQ fences for register writes under SRIOV, but we can hopefully 
ignore them for now).


We have handled the hardware fences, but we have no idea when the 
scheduler fences or the fences from other devices will signal.


Scheduler fences won't signal until the scheduler threads are 
restarted or we somehow cancel the submissions. Doable, but tricky 
as well.



For scheduler fences I am not worried, for the sched_fence->finished 
fence they are by definition attached to HW fences which already 
signaledfor sched_fence->scheduled we should run 
drm_sched_entity_kill_jobs for each entity after stopping the 
scheduler threads and before setting drm_dev_unplug.


Well exactly that is what is tricky here. drm_sched_entity_kill_jobs() 
assumes that there are no more jobs pushed into the entity.


We are racing here once more and need to handle that.



But why, I wrote above that we first stop the all schedulers, then only 
call drm_sched_entity_kill_jobs.







For waiting for other device I have no idea if that couldn't 
deadlock somehow.



Yea, not sure for imported fences and dma_bufs, I would assume the 
other devices should not be impacted by our device removal but, who 
knows...


So I guess we are NOT going with finalizing HW fences before 
drm_dev_unplug and instead will just call drm_dev_enter/exit at the 
back-ends all over the place where there are MMIO accesses ?


Good question. As you said that is really the hard path.

Handling it all at once at IOCTL level certainly has some appeal as 
well, but I have no idea if we can guarantee that this is lock free.



Maybe just empirically - let's try it and see under different test 
scenarios what actually happens  ?


Andrey




Christian.



Andrey



Regards,
Christian.



Andrey









BTW: Could it be that the device SRCU protects more than one 
device and we deadlock because of this?



I haven't actually experienced any deadlock until now but, yes, 
drm_unplug_srcu is defined as static in drm_drv.c and so in the 
presence  of multiple devices from same or different drivers we 
in fact are dependent on all their critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we 
actually need to sync that up with Daniel and the rest of the 
i915 guys.


IIRC we could actually have an amdgpu device in a docking 
station which needs hotplug and the driver might depend on 
waiting for the i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per drm_device ? 
I don't see why it has to be global and not per device thing.


I'm really wondering the same thing for quite a while now.

Adding

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Christian König




Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky:


On 2021-04-13 2:03 p.m., Christian König wrote:

Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:


On 2021-04-13 3:10 a.m., Christian König wrote:

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:


On 2021-04-12 3:18 p.m., Christian König wrote:

Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

[SNIP]


So what's the right approach ? How we guarantee that when 
running amdgpu_fence_driver_force_completion we will signal 
all the HW fences and not racing against some more fences 
insertion into that array ?




Well I would still say the best approach would be to insert 
this between the front end and the backend and not rely on 
signaling fences while holding the device srcu.



My question is, even now, when we run 
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
what there prevents a race with another fence being at the same 
time emitted and inserted into the fence array ? Looks like 
nothing.




Each ring can only be used by one thread at the same time, this 
includes emitting fences as well as other stuff.


During GPU reset we make sure nobody writes to the rings by 
stopping the scheduler and taking the GPU reset lock (so that 
nobody else can start the scheduler again).



What about direct submissions not through scheduler - 
amdgpu_job_submit_direct, I don't see how this is protected.


Those only happen during startup and GPU reset.



Ok, but then looks like I am missing something, see the following 
steps in amdgpu_pci_remove -


1) Use disable_irq API function to stop and flush all in flight HW 
interrupts handlers


2) Grab the reset lock and stop all the schedulers

After above 2 steps the HW fences array is idle, no more insertions 
and no more extractions from the array


3) Run one time amdgpu_fence_process to signal all current HW fences

4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release 
the GPU reset lock and go on with the rest of the sequence (cancel 
timers, work items e.t.c)


What's problematic in this sequence ?


drm_dev_unplug() will wait for the IOCTLs to finish.

The IOCTLs in turn can wait for fences. That can be both hardware 
fences, scheduler fences, as well as fences from other devices (and 
KIQ fences for register writes under SRIOV, but we can hopefully 
ignore them for now).


We have handled the hardware fences, but we have no idea when the 
scheduler fences or the fences from other devices will signal.


Scheduler fences won't signal until the scheduler threads are 
restarted or we somehow cancel the submissions. Doable, but tricky as 
well.



For scheduler fences I am not worried, for the sched_fence->finished 
fence they are by definition attached to HW fences which already 
signaledfor sched_fence->scheduled we should run 
drm_sched_entity_kill_jobs for each entity after stopping the 
scheduler threads and before setting drm_dev_unplug.


Well exactly that is what is tricky here. drm_sched_entity_kill_jobs() 
assumes that there are no more jobs pushed into the entity.


We are racing here once more and need to handle that.



For waiting for other device I have no idea if that couldn't deadlock 
somehow.



Yea, not sure for imported fences and dma_bufs, I would assume the 
other devices should not be impacted by our device removal but, who 
knows...


So I guess we are NOT going with finalizing HW fences before 
drm_dev_unplug and instead will just call drm_dev_enter/exit at the 
back-ends all over the place where there are MMIO accesses ?


Good question. As you said that is really the hard path.

Handling it all at once at IOCTL level certainly has some appeal as 
well, but I have no idea if we can guarantee that this is lock free.


Christian.



Andrey



Regards,
Christian.



Andrey









BTW: Could it be that the device SRCU protects more than one 
device and we deadlock because of this?



I haven't actually experienced any deadlock until now but, yes, 
drm_unplug_srcu is defined as static in drm_drv.c and so in the 
presence  of multiple devices from same or different drivers we 
in fact are dependent on all their critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we 
actually need to sync that up with Daniel and the rest of the 
i915 guys.


IIRC we could actually have an amdgpu device in a docking station 
which needs hotplug and the driver might depend on waiting for 
the i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per drm_device ? 
I don't see why it has to be global and not per device thing.


I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu is 
global.


Regards,
Christian.



Andrey




Christian.


Andrey




Christian.


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted to HW 
ring and hence we can safely call force signal

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Andrey Grodzovsky



On 2021-04-13 2:03 p.m., Christian König wrote:

Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:


On 2021-04-13 3:10 a.m., Christian König wrote:

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:


On 2021-04-12 3:18 p.m., Christian König wrote:

Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

[SNIP]


So what's the right approach ? How we guarantee that when 
running amdgpu_fence_driver_force_completion we will signal all 
the HW fences and not racing against some more fences insertion 
into that array ?




Well I would still say the best approach would be to insert this 
between the front end and the backend and not rely on signaling 
fences while holding the device srcu.



My question is, even now, when we run 
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
what there prevents a race with another fence being at the same 
time emitted and inserted into the fence array ? Looks like nothing.




Each ring can only be used by one thread at the same time, this 
includes emitting fences as well as other stuff.


During GPU reset we make sure nobody writes to the rings by 
stopping the scheduler and taking the GPU reset lock (so that 
nobody else can start the scheduler again).



What about direct submissions not through scheduler - 
amdgpu_job_submit_direct, I don't see how this is protected.


Those only happen during startup and GPU reset.



Ok, but then looks like I am missing something, see the following 
steps in amdgpu_pci_remove -


1) Use disable_irq API function to stop and flush all in flight HW 
interrupts handlers


2) Grab the reset lock and stop all the schedulers

After above 2 steps the HW fences array is idle, no more insertions 
and no more extractions from the array


3) Run one time amdgpu_fence_process to signal all current HW fences

4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release 
the GPU reset lock and go on with the rest of the sequence (cancel 
timers, work items e.t.c)


What's problematic in this sequence ?


drm_dev_unplug() will wait for the IOCTLs to finish.

The IOCTLs in turn can wait for fences. That can be both hardware 
fences, scheduler fences, as well as fences from other devices (and 
KIQ fences for register writes under SRIOV, but we can hopefully 
ignore them for now).


We have handled the hardware fences, but we have no idea when the 
scheduler fences or the fences from other devices will signal.


Scheduler fences won't signal until the scheduler threads are 
restarted or we somehow cancel the submissions. Doable, but tricky as 
well.



For scheduler fences I am not worried, for the sched_fence->finished 
fence they are by definition attached to HW fences which already 
signaled,for sched_fence->scheduled we should run 
drm_sched_entity_kill_jobs for each entity after stopping the scheduler 
threads and before setting drm_dev_unplug.





For waiting for other device I have no idea if that couldn't deadlock 
somehow.



Yea, not sure for imported fences and dma_bufs, I would assume the other 
devices should not be impacted by our device removal but, who knows...


So I guess we are NOT going with finalizing HW fences before 
drm_dev_unplug and instead will just call drm_dev_enter/exit at the 
back-ends all over the place where there are MMIO accesses ?


Andrey



Regards,
Christian.



Andrey









BTW: Could it be that the device SRCU protects more than one 
device and we deadlock because of this?



I haven't actually experienced any deadlock until now but, yes, 
drm_unplug_srcu is defined as static in drm_drv.c and so in the 
presence  of multiple devices from same or different drivers we 
in fact are dependent on all their critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we actually 
need to sync that up with Daniel and the rest of the i915 guys.


IIRC we could actually have an amdgpu device in a docking station 
which needs hotplug and the driver might depend on waiting for the 
i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per drm_device ? I 
don't see why it has to be global and not per device thing.


I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu is 
global.


Regards,
Christian.



Andrey




Christian.


Andrey




Christian.


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted to HW 
ring and hence we can safely call force signal on all that 
are currently there.
 * Any subsequently created  HW fences will be returned 
signaled with an error code right away

 */

    for_each_ring(adev)
        amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side and 
stopping and then restarting the scheduler could work 
as

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Christian König


Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:


On 2021-04-13 3:10 a.m., Christian König wrote:

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:


On 2021-04-12 3:18 p.m., Christian König wrote:

Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

[SNIP]


So what's the right approach ? How we guarantee that when 
running amdgpu_fence_driver_force_completion we will signal all 
the HW fences and not racing against some more fences insertion 
into that array ?




Well I would still say the best approach would be to insert this 
between the front end and the backend and not rely on signaling 
fences while holding the device srcu.



My question is, even now, when we run 
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
what there prevents a race with another fence being at the same 
time emitted and inserted into the fence array ? Looks like nothing.




Each ring can only be used by one thread at the same time, this 
includes emitting fences as well as other stuff.


During GPU reset we make sure nobody writes to the rings by 
stopping the scheduler and taking the GPU reset lock (so that 
nobody else can start the scheduler again).



What about direct submissions not through scheduler - 
amdgpu_job_submit_direct, I don't see how this is protected.


Those only happen during startup and GPU reset.



Ok, but then looks like I am missing something, see the following 
steps in amdgpu_pci_remove -


1) Use disable_irq API function to stop and flush all in flight HW 
interrupts handlers


2) Grab the reset lock and stop all the schedulers

After above 2 steps the HW fences array is idle, no more insertions 
and no more extractions from the array


3) Run one time amdgpu_fence_process to signal all current HW fences

4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release the 
GPU reset lock and go on with the rest of the sequence (cancel timers, 
work items e.t.c)


What's problematic in this sequence ?


drm_dev_unplug() will wait for the IOCTLs to finish.

The IOCTLs in turn can wait for fences. That can be both hardware 
fences, scheduler fences, as well as fences from other devices (and KIQ 
fences for register writes under SRIOV, but we can hopefully ignore them 
for now).


We have handled the hardware fences, but we have no idea when the 
scheduler fences or the fences from other devices will signal.


Scheduler fences won't signal until the scheduler threads are restarted 
or we somehow cancel the submissions. Doable, but tricky as well.


For waiting for other device I have no idea if that couldn't deadlock 
somehow.


Regards,
Christian.



Andrey









BTW: Could it be that the device SRCU protects more than one 
device and we deadlock because of this?



I haven't actually experienced any deadlock until now but, yes, 
drm_unplug_srcu is defined as static in drm_drv.c and so in the 
presence  of multiple devices from same or different drivers we in 
fact are dependent on all their critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we actually 
need to sync that up with Daniel and the rest of the i915 guys.


IIRC we could actually have an amdgpu device in a docking station 
which needs hotplug and the driver might depend on waiting for the 
i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per drm_device ? I 
don't see why it has to be global and not per device thing.


I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global.

Regards,
Christian.



Andrey




Christian.


Andrey




Christian.


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted to HW 
ring and hence we can safely call force signal on all that 
are currently there.
 * Any subsequently created  HW fences will be returned 
signaled with an error code right away

 */

    for_each_ring(adev)
        amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side and 
stopping and then restarting the scheduler could work 
as well.


Christian.



I didn't get the above and I don't see why I need to 
reuse the GPU reset rw_lock. I rely on the SRCU unplug 
flag for unplug. Also, not clear to me why are we 
focusing on the scheduler threads, any code patch to 
generate HW fences should be covered, so any code 
leading to amdgpu_fence_emit needs to be taken into 
account such as, direct IB submissions, VM flushes e.t.c


You need to work together with the reset lock anyway, 
cause a hotplug could run at the same time as a reset.



For going my way indeed now I see now that I have to take 
reset write side lock during HW fences signalling in order 
to protect against scheduler/HW fences detachment and 
reattachment during schedulers

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Andrey Grodzovsky



On 2021-04-13 3:10 a.m., Christian König wrote:

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:


On 2021-04-12 3:18 p.m., Christian König wrote:

Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

[SNIP]


So what's the right approach ? How we guarantee that when running 
amdgpu_fence_driver_force_completion we will signal all the HW 
fences and not racing against some more fences insertion into 
that array ?




Well I would still say the best approach would be to insert this 
between the front end and the backend and not rely on signaling 
fences while holding the device srcu.



My question is, even now, when we run 
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
what there prevents a race with another fence being at the same 
time emitted and inserted into the fence array ? Looks like nothing.




Each ring can only be used by one thread at the same time, this 
includes emitting fences as well as other stuff.


During GPU reset we make sure nobody writes to the rings by stopping 
the scheduler and taking the GPU reset lock (so that nobody else can 
start the scheduler again).



What about direct submissions not through scheduler - 
amdgpu_job_submit_direct, I don't see how this is protected.


Those only happen during startup and GPU reset.



Ok, but then looks like I am missing something, see the following steps 
in amdgpu_pci_remove -


1) Use disable_irq API function to stop and flush all in flight HW 
interrupts handlers


2) Grab the reset lock and stop all the schedulers

After above 2 steps the HW fences array is idle, no more insertions and 
no more extractions from the array


3) Run one time amdgpu_fence_process to signal all current HW fences

4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release the 
GPU reset lock and go on with the rest of the sequence (cancel timers, 
work items e.t.c)


What's problematic in this sequence ?

Andrey









BTW: Could it be that the device SRCU protects more than one 
device and we deadlock because of this?



I haven't actually experienced any deadlock until now but, yes, 
drm_unplug_srcu is defined as static in drm_drv.c and so in the 
presence  of multiple devices from same or different drivers we in 
fact are dependent on all their critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we actually 
need to sync that up with Daniel and the rest of the i915 guys.


IIRC we could actually have an amdgpu device in a docking station 
which needs hotplug and the driver might depend on waiting for the 
i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per drm_device ? I 
don't see why it has to be global and not per device thing.


I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global.

Regards,
Christian.



Andrey




Christian.


Andrey




Christian.


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted to HW ring 
and hence we can safely call force signal on all that are 
currently there.
 * Any subsequently created  HW fences will be returned 
signaled with an error code right away

 */

    for_each_ring(adev)
        amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side and stopping 
and then restarting the scheduler could work as well.


Christian.



I didn't get the above and I don't see why I need to 
reuse the GPU reset rw_lock. I rely on the SRCU unplug 
flag for unplug. Also, not clear to me why are we 
focusing on the scheduler threads, any code patch to 
generate HW fences should be covered, so any code leading 
to amdgpu_fence_emit needs to be taken into account such 
as, direct IB submissions, VM flushes e.t.c


You need to work together with the reset lock anyway, 
cause a hotplug could run at the same time as a reset.



For going my way indeed now I see now that I have to take 
reset write side lock during HW fences signalling in order 
to protect against scheduler/HW fences detachment and 
reattachment during schedulers stop/restart. But if we go 
with your approach  then calling drm_dev_unplug and scoping 
amdgpu_job_timeout with drm_dev_enter/exit should be enough 
to prevent any concurrent GPU resets during unplug. In fact 
I already do it anyway -

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Christian König


Hi Dennis,

yeah, that just has the same down side of a lot of additional overhead 
as the is_signaled callback.


Bouncing cache lines on the CPU isn't funny at all.

Christian.

Am 13.04.21 um 11:13 schrieb Li, Dennis:

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian and Andrey,
   We maybe try to implement "wait" callback function of dma_fence_ops, 
when GPU reset or unplug happen, make this callback return - ENODEV, to notify the caller 
device lost.

 * Must return -ERESTARTSYS if the wait is intr = true and the wait was
 * interrupted, and remaining jiffies if fence has signaled, or 0 if 
wait
 * timed out. Can also return other error values on custom 
implementations,
 * which should be treated as if the fence is signaled. For example a 
hardware
 * lockup could be reported like that.
 *
 * This callback is optional.
 */
signed long (*wait)(struct dma_fence *fence,
bool intr, signed long timeout);

Best Regards
Dennis Li
-Original Message-
From: Christian König 
Sent: Tuesday, April 13, 2021 3:10 PM
To: Grodzovsky, Andrey ; Koenig, Christian ; Li, Dennis 
; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, 
Felix ; Zhang, Hawking ; Daniel Vetter 
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:

On 2021-04-12 3:18 p.m., Christian König wrote:

Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

[SNIP]

So what's the right approach ? How we guarantee that when running
amdgpu_fence_driver_force_completion we will signal all the HW
fences and not racing against some more fences insertion into that
array ?


Well I would still say the best approach would be to insert this
between the front end and the backend and not rely on signaling
fences while holding the device srcu.


My question is, even now, when we run
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion,
what there prevents a race with another fence being at the same time
emitted and inserted into the fence array ? Looks like nothing.


Each ring can only be used by one thread at the same time, this
includes emitting fences as well as other stuff.

During GPU reset we make sure nobody writes to the rings by stopping
the scheduler and taking the GPU reset lock (so that nobody else can
start the scheduler again).


What about direct submissions not through scheduler -
amdgpu_job_submit_direct, I don't see how this is protected.

Those only happen during startup and GPU reset.


BTW: Could it be that the device SRCU protects more than one device
and we deadlock because of this?


I haven't actually experienced any deadlock until now but, yes,
drm_unplug_srcu is defined as static in drm_drv.c and so in the
presence  of multiple devices from same or different drivers we in
fact are dependent on all their critical sections i guess.


Shit, yeah the devil is a squirrel. So for A+I laptops we actually
need to sync that up with Daniel and the rest of the i915 guys.

IIRC we could actually have an amdgpu device in a docking station
which needs hotplug and the driver might depend on waiting for the
i915 driver as well.


Can't we propose a patch to make drm_unplug_srcu per drm_device ? I
don't see why it has to be global and not per device thing.

I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global.

Regards,
Christian.


Andrey



Christian.


Andrey



Christian.


Andrey



Andrey



Christian.


     /* Past this point no more fence are submitted to HW ring
and hence we can safely call force signal on all that are
currently there.
  * Any subsequently created  HW fences will be returned
signaled with an error code right away
  */

     for_each_ring(adev)
         amdgpu_fence_process(ring)

     drm_dev_unplug(dev);
     Stop schedulers
     cancel_sync(all timers and queued works);
     hw_fini
     unmap_mmio

}


Andrey





Alternatively grabbing the reset write side and stopping
and then restarting the scheduler could work as well.

Christian.


I didn't get the above and I don't see why I need to reuse
the GPU reset rw_lock. I rely on the SRCU unplug flag for
unplug. Also, not clear to me why are we focusing on the
scheduler threads, any code patch to generate HW fences
should be covered, so any code leading to
amdgpu_fence_emit needs to be taken into account such as,
direct IB submissions, VM flushes e.t.c

You need to work together with the reset lock anyway, cause
a hotplug could run at the same time as a reset.


For going my way indeed now I see now that I have to take
reset write side lock during HW fences signalling in order
to protect against scheduler/HW fences detachment and
reattachment during schedulers stop/restart. But if we go
wi

RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Li, Dennis

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian and Andrey,
  We maybe try to implement "wait" callback function of dma_fence_ops, when 
GPU reset or unplug happen, make this callback return - ENODEV, to notify the 
caller device lost. 

 * Must return -ERESTARTSYS if the wait is intr = true and the wait was
 * interrupted, and remaining jiffies if fence has signaled, or 0 if 
wait
 * timed out. Can also return other error values on custom 
implementations,
 * which should be treated as if the fence is signaled. For example a 
hardware
 * lockup could be reported like that.
 *
 * This callback is optional.
 */
signed long (*wait)(struct dma_fence *fence,
bool intr, signed long timeout);

Best Regards
Dennis Li
-Original Message-
From: Christian König  
Sent: Tuesday, April 13, 2021 3:10 PM
To: Grodzovsky, Andrey ; Koenig, Christian 
; Li, Dennis ; 
amd-gfx@lists.freedesktop.org; Deucher, Alexander ; 
Kuehling, Felix ; Zhang, Hawking 
; Daniel Vetter 
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
>
> On 2021-04-12 3:18 p.m., Christian König wrote:
>> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
>>> [SNIP]
>>>>>
>>>>> So what's the right approach ? How we guarantee that when running 
>>>>> amdgpu_fence_driver_force_completion we will signal all the HW 
>>>>> fences and not racing against some more fences insertion into that 
>>>>> array ?
>>>>>
>>>>
>>>> Well I would still say the best approach would be to insert this 
>>>> between the front end and the backend and not rely on signaling 
>>>> fences while holding the device srcu.
>>>
>>>
>>> My question is, even now, when we run 
>>> amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
>>> amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion,
>>> what there prevents a race with another fence being at the same time 
>>> emitted and inserted into the fence array ? Looks like nothing.
>>>
>>
>> Each ring can only be used by one thread at the same time, this 
>> includes emitting fences as well as other stuff.
>>
>> During GPU reset we make sure nobody writes to the rings by stopping 
>> the scheduler and taking the GPU reset lock (so that nobody else can 
>> start the scheduler again).
>
>
> What about direct submissions not through scheduler - 
> amdgpu_job_submit_direct, I don't see how this is protected.

Those only happen during startup and GPU reset.

>>
>>>>
>>>> BTW: Could it be that the device SRCU protects more than one device 
>>>> and we deadlock because of this?
>>>
>>>
>>> I haven't actually experienced any deadlock until now but, yes, 
>>> drm_unplug_srcu is defined as static in drm_drv.c and so in the 
>>> presence  of multiple devices from same or different drivers we in 
>>> fact are dependent on all their critical sections i guess.
>>>
>>
>> Shit, yeah the devil is a squirrel. So for A+I laptops we actually 
>> need to sync that up with Daniel and the rest of the i915 guys.
>>
>> IIRC we could actually have an amdgpu device in a docking station 
>> which needs hotplug and the driver might depend on waiting for the
>> i915 driver as well.
>
>
> Can't we propose a patch to make drm_unplug_srcu per drm_device ? I 
> don't see why it has to be global and not per device thing.

I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global.

Regards,
Christian.

>
> Andrey
>
>
>>
>> Christian.
>>
>>> Andrey
>>>
>>>
>>>>
>>>> Christian.
>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>     /* Past this point no more fence are submitted to HW ring 
>>>>>>>>> and hence we can safely call force signal on all that are 
>>>>>>>>> currently there.
>>>>>>>>>  * Any subsequently created  HW fences will be returned 
>>>>>>>>> signaled with an error code right away
>>>>>>>>>  */
>>>&g

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Christian König

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:

On 2021-04-12 3:18 p.m., Christian König wrote:

Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

[SNIP]

So what's the right approach ? How we guarantee that when running
amdgpu_fence_driver_force_completion we will signal all the HW
fences and not racing against some more fences insertion into that
array ?

Well I would still say the best approach would be to insert this
between the front end and the backend and not rely on signaling
fences while holding the device srcu.

My question is, even now, when we run
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion,
what there prevents a race with another fence being at the same time
emitted and inserted into the fence array ? Looks like nothing.

Each ring can only be used by one thread at the same time, this
includes emitting fences as well as other stuff.

During GPU reset we make sure nobody writes to the rings by stopping
the scheduler and taking the GPU reset lock (so that nobody else can
start the scheduler again).

What about direct submissions not through scheduler -
amdgpu_job_submit_direct, I don't see how this is protected.

Those only happen during startup and GPU reset.

BTW: Could it be that the device SRCU protects more than one device
and we deadlock because of this?

I haven't actually experienced any deadlock until now but, yes,
drm_unplug_srcu is defined as static in drm_drv.c and so in the
presence of multiple devices from same or different drivers we in
fact are dependent on all their critical sections i guess.

Shit, yeah the devil is a squirrel. So for A+I laptops we actually
need to sync that up with Daniel and the rest of the i915 guys.

IIRC we could actually have an amdgpu device in a docking station
which needs hotplug and the driver might depend on waiting for the
i915 driver as well.

Can't we propose a patch to make drm_unplug_srcu per drm_device ? I
don't see why it has to be global and not per device thing.

I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global.

Regards,
Christian.

Andrey

Christian.

Andrey

Christian.

Andrey

Christian.

/* Past this point no more fence are submitted to HW ring
and hence we can safely call force signal on all that are
currently there.
* Any subsequently created HW fences will be returned
signaled with an error code right away

for_each_ring(adev)
amdgpu_fence_process(ring)

drm_dev_unplug(dev);
Stop schedulers
cancel_sync(all timers and queued works);
hw_fini
unmap_mmio

}

Andrey

Alternatively grabbing the reset write side and stopping
and then restarting the scheduler could work as well.

Christian.

I didn't get the above and I don't see why I need to reuse
the GPU reset rw_lock. I rely on the SRCU unplug flag for
unplug. Also, not clear to me why are we focusing on the
scheduler threads, any code patch to generate HW fences
should be covered, so any code leading to
amdgpu_fence_emit needs to be taken into account such as,
direct IB submissions, VM flushes e.t.c

You need to work together with the reset lock anyway, cause
a hotplug could run at the same time as a reset.

For going my way indeed now I see now that I have to take
reset write side lock during HW fences signalling in order
to protect against scheduler/HW fences detachment and
reattachment during schedulers stop/restart. But if we go
with your approach then calling drm_dev_unplug and scoping
amdgpu_job_timeout with drm_dev_enter/exit should be enough
to prevent any concurrent GPU resets during unplug. In fact
I already do it anyway -
https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ceefa9c90ed8c405ec3b708d8fc46daaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637536728550884740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=UiNaJE%2BH45iYmbwSDnMSKZS5z0iak0fNlbbfYqKS2Jo%3Dreserved=0

Yes, good point as well.

Christian.

Andrey

Christian.

Andrey

Christian.

Andrey

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Christian König


Am 13.04.21 um 07:36 schrieb Andrey Grodzovsky:

[SNIP]



emit_fence(fence);


*/* We can't wait forever as the HW might be gone at any point*/**
       dma_fence_wait_timeout(old_fence, 5S);*



You can pretty much ignore this wait here. It is only as a last 
resort so that we never overwrite the ring buffers.



If device is present how can I ignore this ?



I think you missed my question here



Sorry I thought I answered that below.

See this is just the last resort so that we don't need to worry about 
ring buffer overflows during testing.


We should not get here in practice and if we get here generating a 
deadlock might actually be the best handling.


The alternative would be to call BUG().



BTW, I am not sure it's so improbable to get here in case of sudden 
device remove, if you are during rapid commands submission to the ring 
during this time  you could easily get to ring buffer overrun because 
EOP interrupts are gone and fences are not removed anymore but new 
ones keep arriving from new submissions which don't stop yet.




During normal operation hardware fences are only created by two code paths:
1. The scheduler when it pushes jobs to the hardware.
2. The KIQ when it does register access on SRIOV.

Both are limited in how many submissions could be made.

The only case where this here becomes necessary is during GPU reset when 
we do direct submission bypassing the scheduler for IB and other tests.


Christian.


Andrey



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Andrey Grodzovsky



On 2021-04-12 2:23 p.m., Christian König wrote:

Am 12.04.21 um 20:18 schrieb Andrey Grodzovsky:


On 2021-04-12 2:05 p.m., Christian König wrote:


Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky:


On 2021-04-12 1:44 p.m., Christian König wrote:



Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky:

On 2021-04-10 1:34 p.m., Christian König wrote:

Hi Andrey,

Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky:

[SNIP]


If we use a list and a flag called 'emit_allowed' under a 
lock such that in amdgpu_fence_emit we lock the list, check 
the flag and if true add the new HW fence to list and proceed 
to HW emition as normal, otherwise return with -ENODEV. In 
amdgpu_pci_remove we take the lock, set the flag to false, 
and then iterate the list and force signal it. Will this not 
prevent any new HW fence creation from now on from any place 
trying to do so ?


Way to much overhead. The fence processing is intentionally 
lock free to avoid cache line bouncing because the IRQ can 
move from CPU to CPU.


We need something which at least the processing of fences in 
the interrupt handler doesn't affect at all.



As far as I see in the code, amdgpu_fence_emit is only called 
from task context. Also, we can skip this list I proposed and 
just use amdgpu_fence_driver_force_completion for each ring to 
signal all created HW fences.


Ah, wait a second this gave me another idea.

See amdgpu_fence_driver_force_completion():

amdgpu_fence_write(ring, ring->fence_drv.sync_seq);

If we change that to something like:

amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFF);

Not only the currently submitted, but also the next 0x3FFF 
fences will be considered signaled.


This basically solves out problem of making sure that new fences 
are also signaled without any additional overhead whatsoever.



Problem with this is that the act of setting the sync_seq to some 
MAX value alone is not enough, you actually have to call 
amdgpu_fence_process to iterate and signal the fences currently 
stored in ring->fence_drv.fences array and to guarantee that once 
you done your signalling no more HW fences will be added to that 
array anymore. I was thinking to do something like bellow:




Well we could implement the is_signaled callback once more, but 
I'm not sure if that is a good idea.



This indeed could save the explicit signaling I am doing bellow but 
I also set an error code there which might be helpful to propagate 
to users






amdgpu_fence_emit()

{

    dma_fence_init(fence);

    srcu_read_lock(amdgpu_unplug_srcu)

    if (!adev->unplug)) {

        seq = ++ring->fence_drv.sync_seq;
        emit_fence(fence);

*/* We can't wait forever as the HW might be gone at any point*/**
       dma_fence_wait_timeout(old_fence, 5S);*



You can pretty much ignore this wait here. It is only as a last 
resort so that we never overwrite the ring buffers.



If device is present how can I ignore this ?



I think you missed my question here



Sorry I thought I answered that below.

See this is just the last resort so that we don't need to worry about 
ring buffer overflows during testing.


We should not get here in practice and if we get here generating a 
deadlock might actually be the best handling.


The alternative would be to call BUG().



BTW, I am not sure it's so improbable to get here in case of sudden 
device remove, if you are during rapid commands submission to the ring 
during this time  you could easily get to ring buffer overrun because 
EOP interrupts are gone and fences are not removed anymore but new ones 
keep arriving from new submissions which don't stop yet.


Andrey








But it should not have a timeout as far as I can see.



Without timeout wait the who approach falls apart as I can't call 
srcu_synchronize on this scope because once device is physically 
gone the wait here will be forever




Yeah, but this is intentional. The only alternative to avoid 
corruption is to wait with a timeout and call BUG() if that 
triggers. That isn't much better.






        ring->fence_drv.fences[seq & 
ring->fence_drv.num_fences_mask] = fence;


    } else {

        dma_fence_set_error(fence, -ENODEV);
        DMA_fence_signal(fence)

    }

    srcu_read_unlock(amdgpu_unplug_srcu)
    return fence;

}

amdgpu_pci_remove

{

    adev->unplug = true;
    synchronize_srcu(amdgpu_unplug_srcu)



Well that is just duplicating what drm_dev_unplug() should be 
doing on a different level.



drm_dev_unplug is on a much wider scope, for everything in the 
device including 'flushing' in flight IOCTLs, this deals 
specifically with the issue of force signalling HW fences




Yeah, but it adds the same overhead as the device srcu.

Christian.



So what's the right approach ? How we guarantee that when running 
amdgpu_fence_driver_force_completion we will signal all the HW fences 
and not racing against some more fences insertion into that array ?




Well I would still say the best approach would be to insert