Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-15 Thread Christian König





[SNIP]


Maybe just empirically - let's try it and see under different 
test scenarios what actually happens  ?


Not a good idea in general, we have that approach way to often at 
AMD and are then surprised that everything works in QA but fails 
in production.


But Daniel already noted in his reply that waiting for a fence 
while holding the SRCU is expected to work.


So let's stick with the approach of high level locking for hotplug.



To my understanding this is true for other devises, not the one 
being extracted, for him you still need to do all the HW fence 
signalling dance because the HW is gone and we block any TDRs 
(which won't help anyway).


Andrey



Do you agree to the above ?


Yeah, I think that is correct.

But on the other hand what Daniel reminded me of is that the handling 
needs to be consistent over different devices. And since some device 
already go with the approach of canceling everything we simply have 
to go down that route as well.


Christian.



What does it mean in our context ? What needs to be done which we are 
not doing now ?


I think we are fine, we just need to continue with the approach of 
forcefully signaling all fences on hotplug.


Christian.



Andrey






Andrey







Christian.



Andrey




Christian.



Andrey



Regards,
Christian.



Andrey









BTW: Could it be that the device SRCU protects more 
than one device and we deadlock because of this?



I haven't actually experienced any deadlock until now 
but, yes, drm_unplug_srcu is defined as static in 
drm_drv.c and so in the presence of multiple devices 
from same or different drivers we in fact are dependent 
on all their critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we 
actually need to sync that up with Daniel and the rest of 
the i915 guys.


IIRC we could actually have an amdgpu device in a docking 
station which needs hotplug and the driver might depend 
on waiting for the i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per 
drm_device ? I don't see why it has to be global and not 
per device thing.


I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the 
drm_unplug_srcu is global.


Regards,
Christian.



Andrey




Christian.


Andrey




Christian.


Andrey





Andrey




Christian.

/* Past this point no more fence are submitted to 
HW ring and hence we can safely call force signal 
on all that are currently there.
 * Any subsequently created HW fences will be 
returned signaled with an error code right away

 */

    for_each_ring(adev)
amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side 
and stopping and then restarting the 
scheduler could work as well.


Christian.



I didn't get the above and I don't see why I 
need to reuse the GPU reset rw_lock. I rely on 
the SRCU unplug flag for unplug. Also, not 
clear to me why are we focusing on the 
scheduler threads, any code patch to generate 
HW fences should be covered, so any code 
leading to amdgpu_fence_emit needs to be taken 
into account such as, direct IB submissions, 
VM flushes e.t.c


You need to work together with the reset lock 
anyway, cause a hotplug could run at the same 
time as a reset.



For going my way indeed now I see now that I 
have to take reset write side lock during HW 
fences signalling in order to protect against 
scheduler/HW fences detachment and reattachment 
during schedulers stop/restart. But if we go 
with your approach then calling drm_dev_unplug 
and scoping amdgpu_job_timeout with 
drm_dev_enter/exit should be enough to prevent 
any concurrent GPU resets during unplug. In fact 
I already do it anyway - 
https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ca64b1f5e0df0403a656408d8ffdc7bdb%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637540669732692484%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=pLcplnlDIESV998tLO7iydxEo5lh71BjQCbAOxKif2Q%3Dreserved=0 



Yes, good point as well.

Christian.



Andrey





Christian.



Andrey






Christian.



Andrey





Andrey


































___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-15 Thread Andrey Grodzovsky


On 2021-04-15 3:02 a.m., Christian König wrote:

Am 15.04.21 um 08:27 schrieb Andrey Grodzovsky:


On 2021-04-14 10:58 a.m., Christian König wrote:

Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:

 [SNIP]


We are racing here once more and need to handle that.



But why, I wrote above that we first stop the all schedulers, 
then only call drm_sched_entity_kill_jobs.


The schedulers consuming jobs is not the problem, we already 
handle that correct.


The problem is that the entities might continue feeding stuff into 
the scheduler.



Missed that.  Ok, can I just use non sleeping RCU with a flag 
around drm_sched_entity_push_job at the amdgpu level (only 2 
functions call it - amdgpu_cs_submit and amdgpu_job_submit) as a 
preliminary step to flush and block in flight and future 
submissions to entity queue ?


Double checking the code I think we can use the notifier_lock for this.

E.g. in amdgpu_cs.c see where we have the goto error_abort.

That is the place where such a check could be added without any 
additional overhead.



Sure, I will just have to add this lock to amdgpu_job_submit too.


Not ideal, but I think that's fine with me. You might want to rename 
the lock for this thought.





[SNIP]


Maybe just empirically - let's try it and see under different 
test scenarios what actually happens  ?


Not a good idea in general, we have that approach way to often at 
AMD and are then surprised that everything works in QA but fails 
in production.


But Daniel already noted in his reply that waiting for a fence 
while holding the SRCU is expected to work.


So let's stick with the approach of high level locking for hotplug.



To my understanding this is true for other devises, not the one 
being extracted, for him you still need to do all the HW fence 
signalling dance because the HW is gone and we block any TDRs 
(which won't help anyway).


Andrey



Do you agree to the above ?


Yeah, I think that is correct.

But on the other hand what Daniel reminded me of is that the handling 
needs to be consistent over different devices. And since some device 
already go with the approach of canceling everything we simply have to 
go down that route as well.


Christian.



What does it mean in our context ? What needs to be done which we are 
not doing now ?


Andrey






Andrey







Christian.



Andrey




Christian.



Andrey



Regards,
Christian.



Andrey









BTW: Could it be that the device SRCU protects more than 
one device and we deadlock because of this?



I haven't actually experienced any deadlock until now 
but, yes, drm_unplug_srcu is defined as static in 
drm_drv.c and so in the presence of multiple devices from 
same or different drivers we in fact are dependent on all 
their critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we 
actually need to sync that up with Daniel and the rest of 
the i915 guys.


IIRC we could actually have an amdgpu device in a docking 
station which needs hotplug and the driver might depend on 
waiting for the i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per 
drm_device ? I don't see why it has to be global and not 
per device thing.


I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the 
drm_unplug_srcu is global.


Regards,
Christian.



Andrey




Christian.


Andrey




Christian.


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted 
to HW ring and hence we can safely call force 
signal on all that are currently there.
 * Any subsequently created HW fences will be 
returned signaled with an error code right away

 */

    for_each_ring(adev)
amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side 
and stopping and then restarting the scheduler 
could work as well.


Christian.



I didn't get the above and I don't see why I 
need to reuse the GPU reset rw_lock. I rely on 
the SRCU unplug flag for unplug. Also, not 
clear to me why are we focusing on the 
scheduler threads, any code patch to generate 
HW fences should be covered, so any code 
leading to amdgpu_fence_emit needs to be taken 
into account such as, direct IB submissions, VM 
flushes e.t.c


You need to work together with the reset lock 
anyway, cause a hotplug could run at the same 
time as a reset.



For going my way indeed now I see now that I have 
to take reset write side lock during HW fences 
signalling in order to protect against 
scheduler/HW fences detachment and reattachment 
during schedulers stop/restart. But if we go with 
your approach then calling drm_dev_unplug and 
scoping amdgpu_job_timeout with 
drm_dev_enter/exit should be enough to prevent 
any concurrent GPU resets during unplug. In fact 
I already do it anyway - 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-15 Thread Christian König

Am 15.04.21 um 08:27 schrieb Andrey Grodzovsky:


On 2021-04-14 10:58 a.m., Christian König wrote:

Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:

 [SNIP]


We are racing here once more and need to handle that.



But why, I wrote above that we first stop the all schedulers, then 
only call drm_sched_entity_kill_jobs.


The schedulers consuming jobs is not the problem, we already handle 
that correct.


The problem is that the entities might continue feeding stuff into 
the scheduler.



Missed that.  Ok, can I just use non sleeping RCU with a flag around 
drm_sched_entity_push_job at the amdgpu level (only 2 functions call 
it - amdgpu_cs_submit and amdgpu_job_submit) as a preliminary step 
to flush and block in flight and future submissions to entity queue ?


Double checking the code I think we can use the notifier_lock for this.

E.g. in amdgpu_cs.c see where we have the goto error_abort.

That is the place where such a check could be added without any 
additional overhead.



Sure, I will just have to add this lock to amdgpu_job_submit too.


Not ideal, but I think that's fine with me. You might want to rename the 
lock for this thought.





[SNIP]


Maybe just empirically - let's try it and see under different test 
scenarios what actually happens  ?


Not a good idea in general, we have that approach way to often at 
AMD and are then surprised that everything works in QA but fails in 
production.


But Daniel already noted in his reply that waiting for a fence 
while holding the SRCU is expected to work.


So let's stick with the approach of high level locking for hotplug.



To my understanding this is true for other devises, not the one 
being extracted, for him you still need to do all the HW fence 
signalling dance because the HW is gone and we block any TDRs (which 
won't help anyway).


Andrey



Do you agree to the above ?


Yeah, I think that is correct.

But on the other hand what Daniel reminded me of is that the handling 
needs to be consistent over different devices. And since some device 
already go with the approach of canceling everything we simply have to 
go down that route as well.


Christian.



Andrey







Christian.



Andrey




Christian.



Andrey



Regards,
Christian.



Andrey









BTW: Could it be that the device SRCU protects more than 
one device and we deadlock because of this?



I haven't actually experienced any deadlock until now but, 
yes, drm_unplug_srcu is defined as static in drm_drv.c and 
so in the presence of multiple devices from same or 
different drivers we in fact are dependent on all their 
critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we 
actually need to sync that up with Daniel and the rest of 
the i915 guys.


IIRC we could actually have an amdgpu device in a docking 
station which needs hotplug and the driver might depend on 
waiting for the i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per 
drm_device ? I don't see why it has to be global and not per 
device thing.


I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu 
is global.


Regards,
Christian.



Andrey




Christian.


Andrey




Christian.


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted 
to HW ring and hence we can safely call force signal 
on all that are currently there.
 * Any subsequently created HW fences will be 
returned signaled with an error code right away

 */

    for_each_ring(adev)
amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side and 
stopping and then restarting the scheduler 
could work as well.


Christian.



I didn't get the above and I don't see why I 
need to reuse the GPU reset rw_lock. I rely on 
the SRCU unplug flag for unplug. Also, not clear 
to me why are we focusing on the scheduler 
threads, any code patch to generate HW fences 
should be covered, so any code leading to 
amdgpu_fence_emit needs to be taken into account 
such as, direct IB submissions, VM flushes e.t.c


You need to work together with the reset lock 
anyway, cause a hotplug could run at the same 
time as a reset.



For going my way indeed now I see now that I have 
to take reset write side lock during HW fences 
signalling in order to protect against 
scheduler/HW fences detachment and reattachment 
during schedulers stop/restart. But if we go with 
your approach then calling drm_dev_unplug and 
scoping amdgpu_job_timeout with drm_dev_enter/exit 
should be enough to prevent any concurrent GPU 
resets during unplug. In fact I already do it 
anyway - 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-15 Thread Andrey Grodzovsky


On 2021-04-14 10:58 a.m., Christian König wrote:

Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:

 [SNIP]


We are racing here once more and need to handle that.



But why, I wrote above that we first stop the all schedulers, then 
only call drm_sched_entity_kill_jobs.


The schedulers consuming jobs is not the problem, we already handle 
that correct.


The problem is that the entities might continue feeding stuff into 
the scheduler.



Missed that.  Ok, can I just use non sleeping RCU with a flag around 
drm_sched_entity_push_job at the amdgpu level (only 2 functions call 
it - amdgpu_cs_submit and amdgpu_job_submit) as a preliminary step to 
flush and block in flight and future submissions to entity queue ?


Double checking the code I think we can use the notifier_lock for this.

E.g. in amdgpu_cs.c see where we have the goto error_abort.

That is the place where such a check could be added without any 
additional overhead.



Sure, I will just have to add this lock to amdgpu_job_submit too.




Christian.










For waiting for other device I have no idea if that couldn't 
deadlock somehow.



Yea, not sure for imported fences and dma_bufs, I would assume 
the other devices should not be impacted by our device removal 
but, who knows...


So I guess we are NOT going with finalizing HW fences before 
drm_dev_unplug and instead will just call drm_dev_enter/exit at 
the back-ends all over the place where there are MMIO accesses ?


Good question. As you said that is really the hard path.

Handling it all at once at IOCTL level certainly has some appeal 
as well, but I have no idea if we can guarantee that this is lock 
free.



Maybe just empirically - let's try it and see under different test 
scenarios what actually happens  ?


Not a good idea in general, we have that approach way to often at 
AMD and are then surprised that everything works in QA but fails in 
production.


But Daniel already noted in his reply that waiting for a fence while 
holding the SRCU is expected to work.


So let's stick with the approach of high level locking for hotplug.



To my understanding this is true for other devises, not the one being 
extracted, for him you still need to do all the HW fence signalling 
dance because the HW is gone and we block any TDRs (which won't help 
anyway).


Andrey



Do you agree to the above ?

Andrey







Christian.



Andrey




Christian.



Andrey



Regards,
Christian.



Andrey









BTW: Could it be that the device SRCU protects more than 
one device and we deadlock because of this?



I haven't actually experienced any deadlock until now but, 
yes, drm_unplug_srcu is defined as static in drm_drv.c and 
so in the presence of multiple devices from same or 
different drivers we in fact are dependent on all their 
critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we 
actually need to sync that up with Daniel and the rest of 
the i915 guys.


IIRC we could actually have an amdgpu device in a docking 
station which needs hotplug and the driver might depend on 
waiting for the i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per 
drm_device ? I don't see why it has to be global and not per 
device thing.


I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu 
is global.


Regards,
Christian.



Andrey




Christian.


Andrey




Christian.


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted to 
HW ring and hence we can safely call force signal on 
all that are currently there.
 * Any subsequently created HW fences will be 
returned signaled with an error code right away

 */

    for_each_ring(adev)
amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side and 
stopping and then restarting the scheduler could 
work as well.


Christian.



I didn't get the above and I don't see why I need 
to reuse the GPU reset rw_lock. I rely on the 
SRCU unplug flag for unplug. Also, not clear to 
me why are we focusing on the scheduler threads, 
any code patch to generate HW fences should be 
covered, so any code leading to amdgpu_fence_emit 
needs to be taken into account such as, direct IB 
submissions, VM flushes e.t.c


You need to work together with the reset lock 
anyway, cause a hotplug could run at the same time 
as a reset.



For going my way indeed now I see now that I have 
to take reset write side lock during HW fences 
signalling in order to protect against scheduler/HW 
fences detachment and reattachment during 
schedulers stop/restart. But if we go with your 
approach then calling drm_dev_unplug and scoping 
amdgpu_job_timeout with drm_dev_enter/exit should 
be enough to prevent any concurrent GPU resets 
during unplug. In fact I already do it anyway 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-14 Thread Christian König

Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:

 [SNIP]


We are racing here once more and need to handle that.



But why, I wrote above that we first stop the all schedulers, then 
only call drm_sched_entity_kill_jobs.


The schedulers consuming jobs is not the problem, we already handle 
that correct.


The problem is that the entities might continue feeding stuff into 
the scheduler.



Missed that.  Ok, can I just use non sleeping RCU with a flag around 
drm_sched_entity_push_job at the amdgpu level (only 2 functions call 
it - amdgpu_cs_submit and amdgpu_job_submit) as a preliminary step to 
flush and block in flight and future submissions to entity queue ?


Double checking the code I think we can use the notifier_lock for this.

E.g. in amdgpu_cs.c see where we have the goto error_abort.

That is the place where such a check could be added without any 
additional overhead.


Christian.










For waiting for other device I have no idea if that couldn't 
deadlock somehow.



Yea, not sure for imported fences and dma_bufs, I would assume the 
other devices should not be impacted by our device removal but, 
who knows...


So I guess we are NOT going with finalizing HW fences before 
drm_dev_unplug and instead will just call drm_dev_enter/exit at 
the back-ends all over the place where there are MMIO accesses ?


Good question. As you said that is really the hard path.

Handling it all at once at IOCTL level certainly has some appeal as 
well, but I have no idea if we can guarantee that this is lock free.



Maybe just empirically - let's try it and see under different test 
scenarios what actually happens  ?


Not a good idea in general, we have that approach way to often at AMD 
and are then surprised that everything works in QA but fails in 
production.


But Daniel already noted in his reply that waiting for a fence while 
holding the SRCU is expected to work.


So let's stick with the approach of high level locking for hotplug.



To my understanding this is true for other devises, not the one being 
extracted, for him you still need to do all the HW fence signalling 
dance because the HW is gone and we block any TDRs (which won't help 
anyway).


Andrey




Christian.



Andrey




Christian.



Andrey



Regards,
Christian.



Andrey









BTW: Could it be that the device SRCU protects more than 
one device and we deadlock because of this?



I haven't actually experienced any deadlock until now but, 
yes, drm_unplug_srcu is defined as static in drm_drv.c and 
so in the presence of multiple devices from same or 
different drivers we in fact are dependent on all their 
critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we 
actually need to sync that up with Daniel and the rest of the 
i915 guys.


IIRC we could actually have an amdgpu device in a docking 
station which needs hotplug and the driver might depend on 
waiting for the i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per 
drm_device ? I don't see why it has to be global and not per 
device thing.


I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu 
is global.


Regards,
Christian.



Andrey




Christian.


Andrey




Christian.


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted to 
HW ring and hence we can safely call force signal on 
all that are currently there.
 * Any subsequently created HW fences will be 
returned signaled with an error code right away

 */

    for_each_ring(adev)
        amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side and 
stopping and then restarting the scheduler could 
work as well.


Christian.



I didn't get the above and I don't see why I need 
to reuse the GPU reset rw_lock. I rely on the SRCU 
unplug flag for unplug. Also, not clear to me why 
are we focusing on the scheduler threads, any code 
patch to generate HW fences should be covered, so 
any code leading to amdgpu_fence_emit needs to be 
taken into account such as, direct IB submissions, 
VM flushes e.t.c


You need to work together with the reset lock 
anyway, cause a hotplug could run at the same time 
as a reset.



For going my way indeed now I see now that I have to 
take reset write side lock during HW fences 
signalling in order to protect against scheduler/HW 
fences detachment and reattachment during schedulers 
stop/restart. But if we go with your approach then 
calling drm_dev_unplug and scoping 
amdgpu_job_timeout with drm_dev_enter/exit should be 
enough to prevent any concurrent GPU resets during 
unplug. In fact I already do it anyway - 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-14 Thread Andrey Grodzovsky


On 2021-04-14 3:01 a.m., Christian König wrote:

Am 13.04.21 um 20:30 schrieb Andrey Grodzovsky:


On 2021-04-13 2:25 p.m., Christian König wrote:



Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky:


On 2021-04-13 2:03 p.m., Christian König wrote:

Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:


On 2021-04-13 3:10 a.m., Christian König wrote:

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:


On 2021-04-12 3:18 p.m., Christian König wrote:

Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

[SNIP]


So what's the right approach ? How we guarantee that when 
running amdgpu_fence_driver_force_completion we will signal 
all the HW fences and not racing against some more fences 
insertion into that array ?




Well I would still say the best approach would be to insert 
this between the front end and the backend and not rely on 
signaling fences while holding the device srcu.



My question is, even now, when we run 
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
what there prevents a race with another fence being at the 
same time emitted and inserted into the fence array ? Looks 
like nothing.




Each ring can only be used by one thread at the same time, 
this includes emitting fences as well as other stuff.


During GPU reset we make sure nobody writes to the rings by 
stopping the scheduler and taking the GPU reset lock (so that 
nobody else can start the scheduler again).



What about direct submissions not through scheduler - 
amdgpu_job_submit_direct, I don't see how this is protected.


Those only happen during startup and GPU reset.



Ok, but then looks like I am missing something, see the following 
steps in amdgpu_pci_remove -


1) Use disable_irq API function to stop and flush all in flight 
HW interrupts handlers


2) Grab the reset lock and stop all the schedulers

After above 2 steps the HW fences array is idle, no more 
insertions and no more extractions from the array


3) Run one time amdgpu_fence_process to signal all current HW fences

4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), 
release the GPU reset lock and go on with the rest of the 
sequence (cancel timers, work items e.t.c)


What's problematic in this sequence ?


drm_dev_unplug() will wait for the IOCTLs to finish.

The IOCTLs in turn can wait for fences. That can be both hardware 
fences, scheduler fences, as well as fences from other devices 
(and KIQ fences for register writes under SRIOV, but we can 
hopefully ignore them for now).


We have handled the hardware fences, but we have no idea when the 
scheduler fences or the fences from other devices will signal.


Scheduler fences won't signal until the scheduler threads are 
restarted or we somehow cancel the submissions. Doable, but tricky 
as well.



For scheduler fences I am not worried, for the 
sched_fence->finished fence they are by definition attached to HW 
fences which already signaledfor sched_fence->scheduled we should 
run drm_sched_entity_kill_jobs for each entity after stopping the 
scheduler threads and before setting drm_dev_unplug.


Well exactly that is what is tricky here. 
drm_sched_entity_kill_jobs() assumes that there are no more jobs 
pushed into the entity.


We are racing here once more and need to handle that.



But why, I wrote above that we first stop the all schedulers, then 
only call drm_sched_entity_kill_jobs.


The schedulers consuming jobs is not the problem, we already handle 
that correct.


The problem is that the entities might continue feeding stuff into the 
scheduler.



Missed that.  Ok, can I just use non sleeping RCU with a flag around 
drm_sched_entity_push_job at the amdgpu level (only 2 functions call it 
- amdgpu_cs_submit and amdgpu_job_submit) as a preliminary step to flush 
and block in flight and future submissions to entity queue ?









For waiting for other device I have no idea if that couldn't 
deadlock somehow.



Yea, not sure for imported fences and dma_bufs, I would assume the 
other devices should not be impacted by our device removal but, who 
knows...


So I guess we are NOT going with finalizing HW fences before 
drm_dev_unplug and instead will just call drm_dev_enter/exit at the 
back-ends all over the place where there are MMIO accesses ?


Good question. As you said that is really the hard path.

Handling it all at once at IOCTL level certainly has some appeal as 
well, but I have no idea if we can guarantee that this is lock free.



Maybe just empirically - let's try it and see under different test 
scenarios what actually happens  ?


Not a good idea in general, we have that approach way to often at AMD 
and are then surprised that everything works in QA but fails in 
production.


But Daniel already noted in his reply that waiting for a fence while 
holding the SRCU is expected to work.


So let's stick with the approach of high level locking for hotplug.



To my understanding this is true for other 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-14 Thread Christian König

Am 13.04.21 um 20:30 schrieb Andrey Grodzovsky:


On 2021-04-13 2:25 p.m., Christian König wrote:



Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky:


On 2021-04-13 2:03 p.m., Christian König wrote:

Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:


On 2021-04-13 3:10 a.m., Christian König wrote:

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:


On 2021-04-12 3:18 p.m., Christian König wrote:

Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

[SNIP]


So what's the right approach ? How we guarantee that when 
running amdgpu_fence_driver_force_completion we will signal 
all the HW fences and not racing against some more fences 
insertion into that array ?




Well I would still say the best approach would be to insert 
this between the front end and the backend and not rely on 
signaling fences while holding the device srcu.



My question is, even now, when we run 
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
what there prevents a race with another fence being at the 
same time emitted and inserted into the fence array ? Looks 
like nothing.




Each ring can only be used by one thread at the same time, this 
includes emitting fences as well as other stuff.


During GPU reset we make sure nobody writes to the rings by 
stopping the scheduler and taking the GPU reset lock (so that 
nobody else can start the scheduler again).



What about direct submissions not through scheduler - 
amdgpu_job_submit_direct, I don't see how this is protected.


Those only happen during startup and GPU reset.



Ok, but then looks like I am missing something, see the following 
steps in amdgpu_pci_remove -


1) Use disable_irq API function to stop and flush all in flight HW 
interrupts handlers


2) Grab the reset lock and stop all the schedulers

After above 2 steps the HW fences array is idle, no more 
insertions and no more extractions from the array


3) Run one time amdgpu_fence_process to signal all current HW fences

4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release 
the GPU reset lock and go on with the rest of the sequence (cancel 
timers, work items e.t.c)


What's problematic in this sequence ?


drm_dev_unplug() will wait for the IOCTLs to finish.

The IOCTLs in turn can wait for fences. That can be both hardware 
fences, scheduler fences, as well as fences from other devices (and 
KIQ fences for register writes under SRIOV, but we can hopefully 
ignore them for now).


We have handled the hardware fences, but we have no idea when the 
scheduler fences or the fences from other devices will signal.


Scheduler fences won't signal until the scheduler threads are 
restarted or we somehow cancel the submissions. Doable, but tricky 
as well.



For scheduler fences I am not worried, for the sched_fence->finished 
fence they are by definition attached to HW fences which already 
signaledfor sched_fence->scheduled we should run 
drm_sched_entity_kill_jobs for each entity after stopping the 
scheduler threads and before setting drm_dev_unplug.


Well exactly that is what is tricky here. 
drm_sched_entity_kill_jobs() assumes that there are no more jobs 
pushed into the entity.


We are racing here once more and need to handle that.



But why, I wrote above that we first stop the all schedulers, then 
only call drm_sched_entity_kill_jobs.


The schedulers consuming jobs is not the problem, we already handle that 
correct.


The problem is that the entities might continue feeding stuff into the 
scheduler.






For waiting for other device I have no idea if that couldn't 
deadlock somehow.



Yea, not sure for imported fences and dma_bufs, I would assume the 
other devices should not be impacted by our device removal but, who 
knows...


So I guess we are NOT going with finalizing HW fences before 
drm_dev_unplug and instead will just call drm_dev_enter/exit at the 
back-ends all over the place where there are MMIO accesses ?


Good question. As you said that is really the hard path.

Handling it all at once at IOCTL level certainly has some appeal as 
well, but I have no idea if we can guarantee that this is lock free.



Maybe just empirically - let's try it and see under different test 
scenarios what actually happens  ?


Not a good idea in general, we have that approach way to often at AMD 
and are then surprised that everything works in QA but fails in production.


But Daniel already noted in his reply that waiting for a fence while 
holding the SRCU is expected to work.


So let's stick with the approach of high level locking for hotplug.

Christian.



Andrey




Christian.



Andrey



Regards,
Christian.



Andrey









BTW: Could it be that the device SRCU protects more than one 
device and we deadlock because of this?



I haven't actually experienced any deadlock until now but, 
yes, drm_unplug_srcu is defined as static in drm_drv.c and so 
in the presence  of multiple devices from same or different 
drivers 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Daniel Vetter
On Tue, Apr 13, 2021 at 11:13 AM Li, Dennis  wrote:
>
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi, Christian and Andrey,
>   We maybe try to implement "wait" callback function of dma_fence_ops, 
> when GPU reset or unplug happen, make this callback return - ENODEV, to 
> notify the caller device lost.
>
>  * Must return -ERESTARTSYS if the wait is intr = true and the wait 
> was
>  * interrupted, and remaining jiffies if fence has signaled, or 0 if 
> wait
>  * timed out. Can also return other error values on custom 
> implementations,
>  * which should be treated as if the fence is signaled. For example a 
> hardware
>  * lockup could be reported like that.
>  *
>  * This callback is optional.
>  */
> signed long (*wait)(struct dma_fence *fence,
> bool intr, signed long timeout);

Uh this callback is for old horros like unreliable irq delivery on
radeon. Please don't use it for anything, if we need to make fences
bail out on error then we need something that works for all fences.
Also TDR should recovery you here already and make sure the
dma_fence_wait() is bound in time.
-Daniel

>
> Best Regards
> Dennis Li
> -Original Message-
> From: Christian König 
> Sent: Tuesday, April 13, 2021 3:10 PM
> To: Grodzovsky, Andrey ; Koenig, Christian 
> ; Li, Dennis ; 
> amd-gfx@lists.freedesktop.org; Deucher, Alexander 
> ; Kuehling, Felix ; Zhang, 
> Hawking ; Daniel Vetter 
> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
>
> Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
> >
> > On 2021-04-12 3:18 p.m., Christian König wrote:
> >> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
> >>> [SNIP]
> >>>>>
> >>>>> So what's the right approach ? How we guarantee that when running
> >>>>> amdgpu_fence_driver_force_completion we will signal all the HW
> >>>>> fences and not racing against some more fences insertion into that
> >>>>> array ?
> >>>>>
> >>>>
> >>>> Well I would still say the best approach would be to insert this
> >>>> between the front end and the backend and not rely on signaling
> >>>> fences while holding the device srcu.
> >>>
> >>>
> >>> My question is, even now, when we run
> >>> amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or
> >>> amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion,
> >>> what there prevents a race with another fence being at the same time
> >>> emitted and inserted into the fence array ? Looks like nothing.
> >>>
> >>
> >> Each ring can only be used by one thread at the same time, this
> >> includes emitting fences as well as other stuff.
> >>
> >> During GPU reset we make sure nobody writes to the rings by stopping
> >> the scheduler and taking the GPU reset lock (so that nobody else can
> >> start the scheduler again).
> >
> >
> > What about direct submissions not through scheduler -
> > amdgpu_job_submit_direct, I don't see how this is protected.
>
> Those only happen during startup and GPU reset.
>
> >>
> >>>>
> >>>> BTW: Could it be that the device SRCU protects more than one device
> >>>> and we deadlock because of this?
> >>>
> >>>
> >>> I haven't actually experienced any deadlock until now but, yes,
> >>> drm_unplug_srcu is defined as static in drm_drv.c and so in the
> >>> presence  of multiple devices from same or different drivers we in
> >>> fact are dependent on all their critical sections i guess.
> >>>
> >>
> >> Shit, yeah the devil is a squirrel. So for A+I laptops we actually
> >> need to sync that up with Daniel and the rest of the i915 guys.
> >>
> >> IIRC we could actually have an amdgpu device in a docking station
> >> which needs hotplug and the driver might depend on waiting for the
> >> i915 driver as well.
> >
> >
> > Can't we propose a patch to make drm_unplug_srcu per drm_device ? I
> > don't see why it has to be global and not per device thing.
>
> I'm really wondering the same thing for quite a while now.
>
> Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global.
>
> Regards,
> Christian.
>
> >
> > Andrey
> >
> >
> >>

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Daniel Vetter
On Tue, Apr 13, 2021 at 9:10 AM Christian König
 wrote:
>
> Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
> >
> > On 2021-04-12 3:18 p.m., Christian König wrote:
> >> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
> >>> [SNIP]
> >
> > So what's the right approach ? How we guarantee that when running
> > amdgpu_fence_driver_force_completion we will signal all the HW
> > fences and not racing against some more fences insertion into that
> > array ?
> >
> 
>  Well I would still say the best approach would be to insert this
>  between the front end and the backend and not rely on signaling
>  fences while holding the device srcu.
> >>>
> >>>
> >>> My question is, even now, when we run
> >>> amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or
> >>> amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion,
> >>> what there prevents a race with another fence being at the same time
> >>> emitted and inserted into the fence array ? Looks like nothing.
> >>>
> >>
> >> Each ring can only be used by one thread at the same time, this
> >> includes emitting fences as well as other stuff.
> >>
> >> During GPU reset we make sure nobody writes to the rings by stopping
> >> the scheduler and taking the GPU reset lock (so that nobody else can
> >> start the scheduler again).
> >
> >
> > What about direct submissions not through scheduler -
> > amdgpu_job_submit_direct, I don't see how this is protected.
>
> Those only happen during startup and GPU reset.
>
> >>
> 
>  BTW: Could it be that the device SRCU protects more than one device
>  and we deadlock because of this?
> >>>
> >>>
> >>> I haven't actually experienced any deadlock until now but, yes,
> >>> drm_unplug_srcu is defined as static in drm_drv.c and so in the
> >>> presence  of multiple devices from same or different drivers we in
> >>> fact are dependent on all their critical sections i guess.
> >>>
> >>
> >> Shit, yeah the devil is a squirrel. So for A+I laptops we actually
> >> need to sync that up with Daniel and the rest of the i915 guys.
> >>
> >> IIRC we could actually have an amdgpu device in a docking station
> >> which needs hotplug and the driver might depend on waiting for the
> >> i915 driver as well.
> >
> >
> > Can't we propose a patch to make drm_unplug_srcu per drm_device ? I
> > don't see why it has to be global and not per device thing.
>
> I'm really wondering the same thing for quite a while now.
>
> Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global.

SRCU isn't exactly the cheapest thing, but aside from that we could
make it per-device. I'm not seeing the point much since if you do end
up being stuck on an ioctl this might happen with anything really.

Also note that dma_fence_waits are supposed to be time bound, so you
shouldn't end up waiting on them forever. It should all get sorted out
in due time with TDR I hope (e.g. if i915 is stuck on a fence because
you're unlucky).
-Daniel

>
> Regards,
> Christian.
>
> >
> > Andrey
> >
> >
> >>
> >> Christian.
> >>
> >>> Andrey
> >>>
> >>>
> 
>  Christian.
> 
> > Andrey
> >
> >
> >>
> >>> Andrey
> >>>
> >>>
> 
>  Christian.
> 
> > /* Past this point no more fence are submitted to HW ring
> > and hence we can safely call force signal on all that are
> > currently there.
> >  * Any subsequently created  HW fences will be returned
> > signaled with an error code right away
> >  */
> >
> > for_each_ring(adev)
> > amdgpu_fence_process(ring)
> >
> > drm_dev_unplug(dev);
> > Stop schedulers
> > cancel_sync(all timers and queued works);
> > hw_fini
> > unmap_mmio
> >
> > }
> >
> >
> > Andrey
> >
> >
> >>
> >>
> >>>
> >>
> >> Alternatively grabbing the reset write side and stopping
> >> and then restarting the scheduler could work as well.
> >>
> >> Christian.
> >
> >
> > I didn't get the above and I don't see why I need to reuse
> > the GPU reset rw_lock. I rely on the SRCU unplug flag for
> > unplug. Also, not clear to me why are we focusing on the
> > scheduler threads, any code patch to generate HW fences
> > should be covered, so any code leading to
> > amdgpu_fence_emit needs to be taken into account such as,
> > direct IB submissions, VM flushes e.t.c
> 
>  You need to work together with the reset lock anyway, cause
>  a hotplug could run at the same time as a reset.
> >>>
> >>>
> >>> For going my way indeed now I see now that I have to take
> >>> reset write side lock during HW 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Andrey Grodzovsky


On 2021-04-13 2:25 p.m., Christian König wrote:



Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky:


On 2021-04-13 2:03 p.m., Christian König wrote:

Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:


On 2021-04-13 3:10 a.m., Christian König wrote:

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:


On 2021-04-12 3:18 p.m., Christian König wrote:

Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

[SNIP]


So what's the right approach ? How we guarantee that when 
running amdgpu_fence_driver_force_completion we will signal 
all the HW fences and not racing against some more fences 
insertion into that array ?




Well I would still say the best approach would be to insert 
this between the front end and the backend and not rely on 
signaling fences while holding the device srcu.



My question is, even now, when we run 
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
what there prevents a race with another fence being at the same 
time emitted and inserted into the fence array ? Looks like 
nothing.




Each ring can only be used by one thread at the same time, this 
includes emitting fences as well as other stuff.


During GPU reset we make sure nobody writes to the rings by 
stopping the scheduler and taking the GPU reset lock (so that 
nobody else can start the scheduler again).



What about direct submissions not through scheduler - 
amdgpu_job_submit_direct, I don't see how this is protected.


Those only happen during startup and GPU reset.



Ok, but then looks like I am missing something, see the following 
steps in amdgpu_pci_remove -


1) Use disable_irq API function to stop and flush all in flight HW 
interrupts handlers


2) Grab the reset lock and stop all the schedulers

After above 2 steps the HW fences array is idle, no more insertions 
and no more extractions from the array


3) Run one time amdgpu_fence_process to signal all current HW fences

4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release 
the GPU reset lock and go on with the rest of the sequence (cancel 
timers, work items e.t.c)


What's problematic in this sequence ?


drm_dev_unplug() will wait for the IOCTLs to finish.

The IOCTLs in turn can wait for fences. That can be both hardware 
fences, scheduler fences, as well as fences from other devices (and 
KIQ fences for register writes under SRIOV, but we can hopefully 
ignore them for now).


We have handled the hardware fences, but we have no idea when the 
scheduler fences or the fences from other devices will signal.


Scheduler fences won't signal until the scheduler threads are 
restarted or we somehow cancel the submissions. Doable, but tricky 
as well.



For scheduler fences I am not worried, for the sched_fence->finished 
fence they are by definition attached to HW fences which already 
signaledfor sched_fence->scheduled we should run 
drm_sched_entity_kill_jobs for each entity after stopping the 
scheduler threads and before setting drm_dev_unplug.


Well exactly that is what is tricky here. drm_sched_entity_kill_jobs() 
assumes that there are no more jobs pushed into the entity.


We are racing here once more and need to handle that.



But why, I wrote above that we first stop the all schedulers, then only 
call drm_sched_entity_kill_jobs.







For waiting for other device I have no idea if that couldn't 
deadlock somehow.



Yea, not sure for imported fences and dma_bufs, I would assume the 
other devices should not be impacted by our device removal but, who 
knows...


So I guess we are NOT going with finalizing HW fences before 
drm_dev_unplug and instead will just call drm_dev_enter/exit at the 
back-ends all over the place where there are MMIO accesses ?


Good question. As you said that is really the hard path.

Handling it all at once at IOCTL level certainly has some appeal as 
well, but I have no idea if we can guarantee that this is lock free.



Maybe just empirically - let's try it and see under different test 
scenarios what actually happens  ?


Andrey




Christian.



Andrey



Regards,
Christian.



Andrey









BTW: Could it be that the device SRCU protects more than one 
device and we deadlock because of this?



I haven't actually experienced any deadlock until now but, yes, 
drm_unplug_srcu is defined as static in drm_drv.c and so in the 
presence  of multiple devices from same or different drivers we 
in fact are dependent on all their critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we 
actually need to sync that up with Daniel and the rest of the 
i915 guys.


IIRC we could actually have an amdgpu device in a docking 
station which needs hotplug and the driver might depend on 
waiting for the i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per drm_device ? 
I don't see why it has to be global and not per device thing.


I'm really wondering the same thing for quite a while now.

Adding 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Christian König



Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky:


On 2021-04-13 2:03 p.m., Christian König wrote:

Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:


On 2021-04-13 3:10 a.m., Christian König wrote:

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:


On 2021-04-12 3:18 p.m., Christian König wrote:

Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

[SNIP]


So what's the right approach ? How we guarantee that when 
running amdgpu_fence_driver_force_completion we will signal 
all the HW fences and not racing against some more fences 
insertion into that array ?




Well I would still say the best approach would be to insert 
this between the front end and the backend and not rely on 
signaling fences while holding the device srcu.



My question is, even now, when we run 
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
what there prevents a race with another fence being at the same 
time emitted and inserted into the fence array ? Looks like 
nothing.




Each ring can only be used by one thread at the same time, this 
includes emitting fences as well as other stuff.


During GPU reset we make sure nobody writes to the rings by 
stopping the scheduler and taking the GPU reset lock (so that 
nobody else can start the scheduler again).



What about direct submissions not through scheduler - 
amdgpu_job_submit_direct, I don't see how this is protected.


Those only happen during startup and GPU reset.



Ok, but then looks like I am missing something, see the following 
steps in amdgpu_pci_remove -


1) Use disable_irq API function to stop and flush all in flight HW 
interrupts handlers


2) Grab the reset lock and stop all the schedulers

After above 2 steps the HW fences array is idle, no more insertions 
and no more extractions from the array


3) Run one time amdgpu_fence_process to signal all current HW fences

4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release 
the GPU reset lock and go on with the rest of the sequence (cancel 
timers, work items e.t.c)


What's problematic in this sequence ?


drm_dev_unplug() will wait for the IOCTLs to finish.

The IOCTLs in turn can wait for fences. That can be both hardware 
fences, scheduler fences, as well as fences from other devices (and 
KIQ fences for register writes under SRIOV, but we can hopefully 
ignore them for now).


We have handled the hardware fences, but we have no idea when the 
scheduler fences or the fences from other devices will signal.


Scheduler fences won't signal until the scheduler threads are 
restarted or we somehow cancel the submissions. Doable, but tricky as 
well.



For scheduler fences I am not worried, for the sched_fence->finished 
fence they are by definition attached to HW fences which already 
signaledfor sched_fence->scheduled we should run 
drm_sched_entity_kill_jobs for each entity after stopping the 
scheduler threads and before setting drm_dev_unplug.


Well exactly that is what is tricky here. drm_sched_entity_kill_jobs() 
assumes that there are no more jobs pushed into the entity.


We are racing here once more and need to handle that.



For waiting for other device I have no idea if that couldn't deadlock 
somehow.



Yea, not sure for imported fences and dma_bufs, I would assume the 
other devices should not be impacted by our device removal but, who 
knows...


So I guess we are NOT going with finalizing HW fences before 
drm_dev_unplug and instead will just call drm_dev_enter/exit at the 
back-ends all over the place where there are MMIO accesses ?


Good question. As you said that is really the hard path.

Handling it all at once at IOCTL level certainly has some appeal as 
well, but I have no idea if we can guarantee that this is lock free.


Christian.



Andrey



Regards,
Christian.



Andrey









BTW: Could it be that the device SRCU protects more than one 
device and we deadlock because of this?



I haven't actually experienced any deadlock until now but, yes, 
drm_unplug_srcu is defined as static in drm_drv.c and so in the 
presence  of multiple devices from same or different drivers we 
in fact are dependent on all their critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we 
actually need to sync that up with Daniel and the rest of the 
i915 guys.


IIRC we could actually have an amdgpu device in a docking station 
which needs hotplug and the driver might depend on waiting for 
the i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per drm_device ? 
I don't see why it has to be global and not per device thing.


I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu is 
global.


Regards,
Christian.



Andrey




Christian.


Andrey




Christian.


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted to HW 
ring and hence we can safely call force signal 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Andrey Grodzovsky


On 2021-04-13 2:03 p.m., Christian König wrote:

Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:


On 2021-04-13 3:10 a.m., Christian König wrote:

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:


On 2021-04-12 3:18 p.m., Christian König wrote:

Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

[SNIP]


So what's the right approach ? How we guarantee that when 
running amdgpu_fence_driver_force_completion we will signal all 
the HW fences and not racing against some more fences insertion 
into that array ?




Well I would still say the best approach would be to insert this 
between the front end and the backend and not rely on signaling 
fences while holding the device srcu.



My question is, even now, when we run 
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
what there prevents a race with another fence being at the same 
time emitted and inserted into the fence array ? Looks like nothing.




Each ring can only be used by one thread at the same time, this 
includes emitting fences as well as other stuff.


During GPU reset we make sure nobody writes to the rings by 
stopping the scheduler and taking the GPU reset lock (so that 
nobody else can start the scheduler again).



What about direct submissions not through scheduler - 
amdgpu_job_submit_direct, I don't see how this is protected.


Those only happen during startup and GPU reset.



Ok, but then looks like I am missing something, see the following 
steps in amdgpu_pci_remove -


1) Use disable_irq API function to stop and flush all in flight HW 
interrupts handlers


2) Grab the reset lock and stop all the schedulers

After above 2 steps the HW fences array is idle, no more insertions 
and no more extractions from the array


3) Run one time amdgpu_fence_process to signal all current HW fences

4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release 
the GPU reset lock and go on with the rest of the sequence (cancel 
timers, work items e.t.c)


What's problematic in this sequence ?


drm_dev_unplug() will wait for the IOCTLs to finish.

The IOCTLs in turn can wait for fences. That can be both hardware 
fences, scheduler fences, as well as fences from other devices (and 
KIQ fences for register writes under SRIOV, but we can hopefully 
ignore them for now).


We have handled the hardware fences, but we have no idea when the 
scheduler fences or the fences from other devices will signal.


Scheduler fences won't signal until the scheduler threads are 
restarted or we somehow cancel the submissions. Doable, but tricky as 
well.



For scheduler fences I am not worried, for the sched_fence->finished 
fence they are by definition attached to HW fences which already 
signaled,for sched_fence->scheduled we should run 
drm_sched_entity_kill_jobs for each entity after stopping the scheduler 
threads and before setting drm_dev_unplug.





For waiting for other device I have no idea if that couldn't deadlock 
somehow.



Yea, not sure for imported fences and dma_bufs, I would assume the other 
devices should not be impacted by our device removal but, who knows...


So I guess we are NOT going with finalizing HW fences before 
drm_dev_unplug and instead will just call drm_dev_enter/exit at the 
back-ends all over the place where there are MMIO accesses ?


Andrey



Regards,
Christian.



Andrey









BTW: Could it be that the device SRCU protects more than one 
device and we deadlock because of this?



I haven't actually experienced any deadlock until now but, yes, 
drm_unplug_srcu is defined as static in drm_drv.c and so in the 
presence  of multiple devices from same or different drivers we 
in fact are dependent on all their critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we actually 
need to sync that up with Daniel and the rest of the i915 guys.


IIRC we could actually have an amdgpu device in a docking station 
which needs hotplug and the driver might depend on waiting for the 
i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per drm_device ? I 
don't see why it has to be global and not per device thing.


I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu is 
global.


Regards,
Christian.



Andrey




Christian.


Andrey




Christian.


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted to HW 
ring and hence we can safely call force signal on all that 
are currently there.
 * Any subsequently created  HW fences will be returned 
signaled with an error code right away

 */

    for_each_ring(adev)
        amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side and 
stopping and then restarting the scheduler could work 
as 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Christian König

Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:


On 2021-04-13 3:10 a.m., Christian König wrote:

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:


On 2021-04-12 3:18 p.m., Christian König wrote:

Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

[SNIP]


So what's the right approach ? How we guarantee that when 
running amdgpu_fence_driver_force_completion we will signal all 
the HW fences and not racing against some more fences insertion 
into that array ?




Well I would still say the best approach would be to insert this 
between the front end and the backend and not rely on signaling 
fences while holding the device srcu.



My question is, even now, when we run 
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
what there prevents a race with another fence being at the same 
time emitted and inserted into the fence array ? Looks like nothing.




Each ring can only be used by one thread at the same time, this 
includes emitting fences as well as other stuff.


During GPU reset we make sure nobody writes to the rings by 
stopping the scheduler and taking the GPU reset lock (so that 
nobody else can start the scheduler again).



What about direct submissions not through scheduler - 
amdgpu_job_submit_direct, I don't see how this is protected.


Those only happen during startup and GPU reset.



Ok, but then looks like I am missing something, see the following 
steps in amdgpu_pci_remove -


1) Use disable_irq API function to stop and flush all in flight HW 
interrupts handlers


2) Grab the reset lock and stop all the schedulers

After above 2 steps the HW fences array is idle, no more insertions 
and no more extractions from the array


3) Run one time amdgpu_fence_process to signal all current HW fences

4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release the 
GPU reset lock and go on with the rest of the sequence (cancel timers, 
work items e.t.c)


What's problematic in this sequence ?


drm_dev_unplug() will wait for the IOCTLs to finish.

The IOCTLs in turn can wait for fences. That can be both hardware 
fences, scheduler fences, as well as fences from other devices (and KIQ 
fences for register writes under SRIOV, but we can hopefully ignore them 
for now).


We have handled the hardware fences, but we have no idea when the 
scheduler fences or the fences from other devices will signal.


Scheduler fences won't signal until the scheduler threads are restarted 
or we somehow cancel the submissions. Doable, but tricky as well.


For waiting for other device I have no idea if that couldn't deadlock 
somehow.


Regards,
Christian.



Andrey









BTW: Could it be that the device SRCU protects more than one 
device and we deadlock because of this?



I haven't actually experienced any deadlock until now but, yes, 
drm_unplug_srcu is defined as static in drm_drv.c and so in the 
presence  of multiple devices from same or different drivers we in 
fact are dependent on all their critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we actually 
need to sync that up with Daniel and the rest of the i915 guys.


IIRC we could actually have an amdgpu device in a docking station 
which needs hotplug and the driver might depend on waiting for the 
i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per drm_device ? I 
don't see why it has to be global and not per device thing.


I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global.

Regards,
Christian.



Andrey




Christian.


Andrey




Christian.


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted to HW 
ring and hence we can safely call force signal on all that 
are currently there.
 * Any subsequently created  HW fences will be returned 
signaled with an error code right away

 */

    for_each_ring(adev)
        amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side and 
stopping and then restarting the scheduler could work 
as well.


Christian.



I didn't get the above and I don't see why I need to 
reuse the GPU reset rw_lock. I rely on the SRCU unplug 
flag for unplug. Also, not clear to me why are we 
focusing on the scheduler threads, any code patch to 
generate HW fences should be covered, so any code 
leading to amdgpu_fence_emit needs to be taken into 
account such as, direct IB submissions, VM flushes e.t.c


You need to work together with the reset lock anyway, 
cause a hotplug could run at the same time as a reset.



For going my way indeed now I see now that I have to take 
reset write side lock during HW fences signalling in order 
to protect against scheduler/HW fences detachment and 
reattachment during schedulers 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Andrey Grodzovsky


On 2021-04-13 3:10 a.m., Christian König wrote:

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:


On 2021-04-12 3:18 p.m., Christian König wrote:

Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

[SNIP]


So what's the right approach ? How we guarantee that when running 
amdgpu_fence_driver_force_completion we will signal all the HW 
fences and not racing against some more fences insertion into 
that array ?




Well I would still say the best approach would be to insert this 
between the front end and the backend and not rely on signaling 
fences while holding the device srcu.



My question is, even now, when we run 
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
what there prevents a race with another fence being at the same 
time emitted and inserted into the fence array ? Looks like nothing.




Each ring can only be used by one thread at the same time, this 
includes emitting fences as well as other stuff.


During GPU reset we make sure nobody writes to the rings by stopping 
the scheduler and taking the GPU reset lock (so that nobody else can 
start the scheduler again).



What about direct submissions not through scheduler - 
amdgpu_job_submit_direct, I don't see how this is protected.


Those only happen during startup and GPU reset.



Ok, but then looks like I am missing something, see the following steps 
in amdgpu_pci_remove -


1) Use disable_irq API function to stop and flush all in flight HW 
interrupts handlers


2) Grab the reset lock and stop all the schedulers

After above 2 steps the HW fences array is idle, no more insertions and 
no more extractions from the array


3) Run one time amdgpu_fence_process to signal all current HW fences

4) Set drm_dev_unplug (will 'flush' all in flight IOCTLs), release the 
GPU reset lock and go on with the rest of the sequence (cancel timers, 
work items e.t.c)


What's problematic in this sequence ?

Andrey









BTW: Could it be that the device SRCU protects more than one 
device and we deadlock because of this?



I haven't actually experienced any deadlock until now but, yes, 
drm_unplug_srcu is defined as static in drm_drv.c and so in the 
presence  of multiple devices from same or different drivers we in 
fact are dependent on all their critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we actually 
need to sync that up with Daniel and the rest of the i915 guys.


IIRC we could actually have an amdgpu device in a docking station 
which needs hotplug and the driver might depend on waiting for the 
i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per drm_device ? I 
don't see why it has to be global and not per device thing.


I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global.

Regards,
Christian.



Andrey




Christian.


Andrey




Christian.


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted to HW ring 
and hence we can safely call force signal on all that are 
currently there.
 * Any subsequently created  HW fences will be returned 
signaled with an error code right away

 */

    for_each_ring(adev)
        amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side and stopping 
and then restarting the scheduler could work as well.


Christian.



I didn't get the above and I don't see why I need to 
reuse the GPU reset rw_lock. I rely on the SRCU unplug 
flag for unplug. Also, not clear to me why are we 
focusing on the scheduler threads, any code patch to 
generate HW fences should be covered, so any code leading 
to amdgpu_fence_emit needs to be taken into account such 
as, direct IB submissions, VM flushes e.t.c


You need to work together with the reset lock anyway, 
cause a hotplug could run at the same time as a reset.



For going my way indeed now I see now that I have to take 
reset write side lock during HW fences signalling in order 
to protect against scheduler/HW fences detachment and 
reattachment during schedulers stop/restart. But if we go 
with your approach  then calling drm_dev_unplug and scoping 
amdgpu_job_timeout with drm_dev_enter/exit should be enough 
to prevent any concurrent GPU resets during unplug. In fact 
I already do it anyway - 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Christian König

Hi Dennis,

yeah, that just has the same down side of a lot of additional overhead 
as the is_signaled callback.


Bouncing cache lines on the CPU isn't funny at all.

Christian.

Am 13.04.21 um 11:13 schrieb Li, Dennis:

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian and Andrey,
   We maybe try to implement "wait" callback function of dma_fence_ops, 
when GPU reset or unplug happen, make this callback return - ENODEV, to notify the caller 
device lost.

 * Must return -ERESTARTSYS if the wait is intr = true and the wait was
 * interrupted, and remaining jiffies if fence has signaled, or 0 if 
wait
 * timed out. Can also return other error values on custom 
implementations,
 * which should be treated as if the fence is signaled. For example a 
hardware
 * lockup could be reported like that.
 *
 * This callback is optional.
 */
signed long (*wait)(struct dma_fence *fence,
bool intr, signed long timeout);

Best Regards
Dennis Li
-Original Message-
From: Christian König 
Sent: Tuesday, April 13, 2021 3:10 PM
To: Grodzovsky, Andrey ; Koenig, Christian ; Li, Dennis 
; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, 
Felix ; Zhang, Hawking ; Daniel Vetter 
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:

On 2021-04-12 3:18 p.m., Christian König wrote:

Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

[SNIP]

So what's the right approach ? How we guarantee that when running
amdgpu_fence_driver_force_completion we will signal all the HW
fences and not racing against some more fences insertion into that
array ?


Well I would still say the best approach would be to insert this
between the front end and the backend and not rely on signaling
fences while holding the device srcu.


My question is, even now, when we run
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion,
what there prevents a race with another fence being at the same time
emitted and inserted into the fence array ? Looks like nothing.


Each ring can only be used by one thread at the same time, this
includes emitting fences as well as other stuff.

During GPU reset we make sure nobody writes to the rings by stopping
the scheduler and taking the GPU reset lock (so that nobody else can
start the scheduler again).


What about direct submissions not through scheduler -
amdgpu_job_submit_direct, I don't see how this is protected.

Those only happen during startup and GPU reset.


BTW: Could it be that the device SRCU protects more than one device
and we deadlock because of this?


I haven't actually experienced any deadlock until now but, yes,
drm_unplug_srcu is defined as static in drm_drv.c and so in the
presence  of multiple devices from same or different drivers we in
fact are dependent on all their critical sections i guess.


Shit, yeah the devil is a squirrel. So for A+I laptops we actually
need to sync that up with Daniel and the rest of the i915 guys.

IIRC we could actually have an amdgpu device in a docking station
which needs hotplug and the driver might depend on waiting for the
i915 driver as well.


Can't we propose a patch to make drm_unplug_srcu per drm_device ? I
don't see why it has to be global and not per device thing.

I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global.

Regards,
Christian.


Andrey



Christian.


Andrey



Christian.


Andrey



Andrey



Christian.


     /* Past this point no more fence are submitted to HW ring
and hence we can safely call force signal on all that are
currently there.
  * Any subsequently created  HW fences will be returned
signaled with an error code right away
  */

     for_each_ring(adev)
         amdgpu_fence_process(ring)

     drm_dev_unplug(dev);
     Stop schedulers
     cancel_sync(all timers and queued works);
     hw_fini
     unmap_mmio

}


Andrey





Alternatively grabbing the reset write side and stopping
and then restarting the scheduler could work as well.

Christian.


I didn't get the above and I don't see why I need to reuse
the GPU reset rw_lock. I rely on the SRCU unplug flag for
unplug. Also, not clear to me why are we focusing on the
scheduler threads, any code patch to generate HW fences
should be covered, so any code leading to
amdgpu_fence_emit needs to be taken into account such as,
direct IB submissions, VM flushes e.t.c

You need to work together with the reset lock anyway, cause
a hotplug could run at the same time as a reset.


For going my way indeed now I see now that I have to take
reset write side lock during HW fences signalling in order
to protect against scheduler/HW fences detachment and
reattachment during schedulers stop/restart. But if we go
wi

RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Li, Dennis
[AMD Official Use Only - Internal Distribution Only]

Hi, Christian and Andrey,
  We maybe try to implement "wait" callback function of dma_fence_ops, when 
GPU reset or unplug happen, make this callback return - ENODEV, to notify the 
caller device lost. 

 * Must return -ERESTARTSYS if the wait is intr = true and the wait was
 * interrupted, and remaining jiffies if fence has signaled, or 0 if 
wait
 * timed out. Can also return other error values on custom 
implementations,
 * which should be treated as if the fence is signaled. For example a 
hardware
 * lockup could be reported like that.
 *
 * This callback is optional.
 */
signed long (*wait)(struct dma_fence *fence,
bool intr, signed long timeout);

Best Regards
Dennis Li
-Original Message-
From: Christian König  
Sent: Tuesday, April 13, 2021 3:10 PM
To: Grodzovsky, Andrey ; Koenig, Christian 
; Li, Dennis ; 
amd-gfx@lists.freedesktop.org; Deucher, Alexander ; 
Kuehling, Felix ; Zhang, Hawking 
; Daniel Vetter 
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
>
> On 2021-04-12 3:18 p.m., Christian König wrote:
>> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
>>> [SNIP]
>>>>>
>>>>> So what's the right approach ? How we guarantee that when running 
>>>>> amdgpu_fence_driver_force_completion we will signal all the HW 
>>>>> fences and not racing against some more fences insertion into that 
>>>>> array ?
>>>>>
>>>>
>>>> Well I would still say the best approach would be to insert this 
>>>> between the front end and the backend and not rely on signaling 
>>>> fences while holding the device srcu.
>>>
>>>
>>> My question is, even now, when we run 
>>> amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
>>> amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion,
>>> what there prevents a race with another fence being at the same time 
>>> emitted and inserted into the fence array ? Looks like nothing.
>>>
>>
>> Each ring can only be used by one thread at the same time, this 
>> includes emitting fences as well as other stuff.
>>
>> During GPU reset we make sure nobody writes to the rings by stopping 
>> the scheduler and taking the GPU reset lock (so that nobody else can 
>> start the scheduler again).
>
>
> What about direct submissions not through scheduler - 
> amdgpu_job_submit_direct, I don't see how this is protected.

Those only happen during startup and GPU reset.

>>
>>>>
>>>> BTW: Could it be that the device SRCU protects more than one device 
>>>> and we deadlock because of this?
>>>
>>>
>>> I haven't actually experienced any deadlock until now but, yes, 
>>> drm_unplug_srcu is defined as static in drm_drv.c and so in the 
>>> presence  of multiple devices from same or different drivers we in 
>>> fact are dependent on all their critical sections i guess.
>>>
>>
>> Shit, yeah the devil is a squirrel. So for A+I laptops we actually 
>> need to sync that up with Daniel and the rest of the i915 guys.
>>
>> IIRC we could actually have an amdgpu device in a docking station 
>> which needs hotplug and the driver might depend on waiting for the
>> i915 driver as well.
>
>
> Can't we propose a patch to make drm_unplug_srcu per drm_device ? I 
> don't see why it has to be global and not per device thing.

I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global.

Regards,
Christian.

>
> Andrey
>
>
>>
>> Christian.
>>
>>> Andrey
>>>
>>>
>>>>
>>>> Christian.
>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>>> Andrey
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Christian.
>>>>>>>>
>>>>>>>>>     /* Past this point no more fence are submitted to HW ring 
>>>>>>>>> and hence we can safely call force signal on all that are 
>>>>>>>>> currently there.
>>>>>>>>>  * Any subsequently created  HW fences will be returned 
>>>>>>>>> signaled with an error code right away
>>>>>>>>>  */
>>>&g

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Christian König

Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:


On 2021-04-12 3:18 p.m., Christian König wrote:

Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

[SNIP]


So what's the right approach ? How we guarantee that when running 
amdgpu_fence_driver_force_completion we will signal all the HW 
fences and not racing against some more fences insertion into that 
array ?




Well I would still say the best approach would be to insert this 
between the front end and the backend and not rely on signaling 
fences while holding the device srcu.



My question is, even now, when we run 
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
what there prevents a race with another fence being at the same time 
emitted and inserted into the fence array ? Looks like nothing.




Each ring can only be used by one thread at the same time, this 
includes emitting fences as well as other stuff.


During GPU reset we make sure nobody writes to the rings by stopping 
the scheduler and taking the GPU reset lock (so that nobody else can 
start the scheduler again).



What about direct submissions not through scheduler - 
amdgpu_job_submit_direct, I don't see how this is protected.


Those only happen during startup and GPU reset.





BTW: Could it be that the device SRCU protects more than one device 
and we deadlock because of this?



I haven't actually experienced any deadlock until now but, yes, 
drm_unplug_srcu is defined as static in drm_drv.c and so in the 
presence  of multiple devices from same or different drivers we in 
fact are dependent on all their critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we actually 
need to sync that up with Daniel and the rest of the i915 guys.


IIRC we could actually have an amdgpu device in a docking station 
which needs hotplug and the driver might depend on waiting for the 
i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per drm_device ? I 
don't see why it has to be global and not per device thing.


I'm really wondering the same thing for quite a while now.

Adding Daniel as well, maybe he knows why the drm_unplug_srcu is global.

Regards,
Christian.



Andrey




Christian.


Andrey




Christian.


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted to HW ring 
and hence we can safely call force signal on all that are 
currently there.
 * Any subsequently created  HW fences will be returned 
signaled with an error code right away

 */

    for_each_ring(adev)
        amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side and stopping 
and then restarting the scheduler could work as well.


Christian.



I didn't get the above and I don't see why I need to reuse 
the GPU reset rw_lock. I rely on the SRCU unplug flag for 
unplug. Also, not clear to me why are we focusing on the 
scheduler threads, any code patch to generate HW fences 
should be covered, so any code leading to 
amdgpu_fence_emit needs to be taken into account such as, 
direct IB submissions, VM flushes e.t.c


You need to work together with the reset lock anyway, cause 
a hotplug could run at the same time as a reset.



For going my way indeed now I see now that I have to take 
reset write side lock during HW fences signalling in order 
to protect against scheduler/HW fences detachment and 
reattachment during schedulers stop/restart. But if we go 
with your approach  then calling drm_dev_unplug and scoping 
amdgpu_job_timeout with drm_dev_enter/exit should be enough 
to prevent any concurrent GPU resets during unplug. In fact 
I already do it anyway - 
https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ceefa9c90ed8c405ec3b708d8fc46daaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637536728550884740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=UiNaJE%2BH45iYmbwSDnMSKZS5z0iak0fNlbbfYqKS2Jo%3Dreserved=0


Yes, good point as well.

Christian.



Andrey





Christian.



Andrey






Christian.



Andrey





Andrey






















___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Christian König

Am 13.04.21 um 07:36 schrieb Andrey Grodzovsky:

[SNIP]



emit_fence(fence);


*/* We can't wait forever as the HW might be gone at any point*/**
       dma_fence_wait_timeout(old_fence, 5S);*



You can pretty much ignore this wait here. It is only as a last 
resort so that we never overwrite the ring buffers.



If device is present how can I ignore this ?



I think you missed my question here



Sorry I thought I answered that below.

See this is just the last resort so that we don't need to worry about 
ring buffer overflows during testing.


We should not get here in practice and if we get here generating a 
deadlock might actually be the best handling.


The alternative would be to call BUG().



BTW, I am not sure it's so improbable to get here in case of sudden 
device remove, if you are during rapid commands submission to the ring 
during this time  you could easily get to ring buffer overrun because 
EOP interrupts are gone and fences are not removed anymore but new 
ones keep arriving from new submissions which don't stop yet.




During normal operation hardware fences are only created by two code paths:
1. The scheduler when it pushes jobs to the hardware.
2. The KIQ when it does register access on SRIOV.

Both are limited in how many submissions could be made.

The only case where this here becomes necessary is during GPU reset when 
we do direct submission bypassing the scheduler for IB and other tests.


Christian.


Andrey



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Andrey Grodzovsky


On 2021-04-12 2:23 p.m., Christian König wrote:

Am 12.04.21 um 20:18 schrieb Andrey Grodzovsky:


On 2021-04-12 2:05 p.m., Christian König wrote:


Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky:


On 2021-04-12 1:44 p.m., Christian König wrote:



Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky:

On 2021-04-10 1:34 p.m., Christian König wrote:

Hi Andrey,

Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky:

[SNIP]


If we use a list and a flag called 'emit_allowed' under a 
lock such that in amdgpu_fence_emit we lock the list, check 
the flag and if true add the new HW fence to list and proceed 
to HW emition as normal, otherwise return with -ENODEV. In 
amdgpu_pci_remove we take the lock, set the flag to false, 
and then iterate the list and force signal it. Will this not 
prevent any new HW fence creation from now on from any place 
trying to do so ?


Way to much overhead. The fence processing is intentionally 
lock free to avoid cache line bouncing because the IRQ can 
move from CPU to CPU.


We need something which at least the processing of fences in 
the interrupt handler doesn't affect at all.



As far as I see in the code, amdgpu_fence_emit is only called 
from task context. Also, we can skip this list I proposed and 
just use amdgpu_fence_driver_force_completion for each ring to 
signal all created HW fences.


Ah, wait a second this gave me another idea.

See amdgpu_fence_driver_force_completion():

amdgpu_fence_write(ring, ring->fence_drv.sync_seq);

If we change that to something like:

amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFF);

Not only the currently submitted, but also the next 0x3FFF 
fences will be considered signaled.


This basically solves out problem of making sure that new fences 
are also signaled without any additional overhead whatsoever.



Problem with this is that the act of setting the sync_seq to some 
MAX value alone is not enough, you actually have to call 
amdgpu_fence_process to iterate and signal the fences currently 
stored in ring->fence_drv.fences array and to guarantee that once 
you done your signalling no more HW fences will be added to that 
array anymore. I was thinking to do something like bellow:




Well we could implement the is_signaled callback once more, but 
I'm not sure if that is a good idea.



This indeed could save the explicit signaling I am doing bellow but 
I also set an error code there which might be helpful to propagate 
to users






amdgpu_fence_emit()

{

    dma_fence_init(fence);

    srcu_read_lock(amdgpu_unplug_srcu)

    if (!adev->unplug)) {

        seq = ++ring->fence_drv.sync_seq;
        emit_fence(fence);

*/* We can't wait forever as the HW might be gone at any point*/**
       dma_fence_wait_timeout(old_fence, 5S);*



You can pretty much ignore this wait here. It is only as a last 
resort so that we never overwrite the ring buffers.



If device is present how can I ignore this ?



I think you missed my question here



Sorry I thought I answered that below.

See this is just the last resort so that we don't need to worry about 
ring buffer overflows during testing.


We should not get here in practice and if we get here generating a 
deadlock might actually be the best handling.


The alternative would be to call BUG().



BTW, I am not sure it's so improbable to get here in case of sudden 
device remove, if you are during rapid commands submission to the ring 
during this time  you could easily get to ring buffer overrun because 
EOP interrupts are gone and fences are not removed anymore but new ones 
keep arriving from new submissions which don't stop yet.


Andrey








But it should not have a timeout as far as I can see.



Without timeout wait the who approach falls apart as I can't call 
srcu_synchronize on this scope because once device is physically 
gone the wait here will be forever




Yeah, but this is intentional. The only alternative to avoid 
corruption is to wait with a timeout and call BUG() if that 
triggers. That isn't much better.






        ring->fence_drv.fences[seq & 
ring->fence_drv.num_fences_mask] = fence;


    } else {

        dma_fence_set_error(fence, -ENODEV);
        DMA_fence_signal(fence)

    }

    srcu_read_unlock(amdgpu_unplug_srcu)
    return fence;

}

amdgpu_pci_remove

{

    adev->unplug = true;
    synchronize_srcu(amdgpu_unplug_srcu)



Well that is just duplicating what drm_dev_unplug() should be 
doing on a different level.



drm_dev_unplug is on a much wider scope, for everything in the 
device including 'flushing' in flight IOCTLs, this deals 
specifically with the issue of force signalling HW fences




Yeah, but it adds the same overhead as the device srcu.

Christian.



So what's the right approach ? How we guarantee that when running 
amdgpu_fence_driver_force_completion we will signal all the HW fences 
and not racing against some more fences insertion into that array ?




Well I would still say the best approach would be to insert 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Andrey Grodzovsky


On 2021-04-12 3:18 p.m., Christian König wrote:

Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

[SNIP]


So what's the right approach ? How we guarantee that when running 
amdgpu_fence_driver_force_completion we will signal all the HW 
fences and not racing against some more fences insertion into that 
array ?




Well I would still say the best approach would be to insert this 
between the front end and the backend and not rely on signaling 
fences while holding the device srcu.



My question is, even now, when we run 
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
what there prevents a race with another fence being at the same time 
emitted and inserted into the fence array ? Looks like nothing.




Each ring can only be used by one thread at the same time, this 
includes emitting fences as well as other stuff.


During GPU reset we make sure nobody writes to the rings by stopping 
the scheduler and taking the GPU reset lock (so that nobody else can 
start the scheduler again).



What about direct submissions not through scheduler - 
amdgpu_job_submit_direct, I don't see how this is protected.







BTW: Could it be that the device SRCU protects more than one device 
and we deadlock because of this?



I haven't actually experienced any deadlock until now but, yes, 
drm_unplug_srcu is defined as static in drm_drv.c and so in the 
presence  of multiple devices from same or different drivers we in 
fact are dependent on all their critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we actually 
need to sync that up with Daniel and the rest of the i915 guys.


IIRC we could actually have an amdgpu device in a docking station 
which needs hotplug and the driver might depend on waiting for the 
i915 driver as well.



Can't we propose a patch to make drm_unplug_srcu per drm_device ? I 
don't see why it has to be global and not per device thing.


Andrey




Christian.


Andrey




Christian.


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted to HW ring 
and hence we can safely call force signal on all that are 
currently there.
 * Any subsequently created  HW fences will be returned 
signaled with an error code right away

 */

    for_each_ring(adev)
        amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side and stopping 
and then restarting the scheduler could work as well.


Christian.



I didn't get the above and I don't see why I need to reuse 
the GPU reset rw_lock. I rely on the SRCU unplug flag for 
unplug. Also, not clear to me why are we focusing on the 
scheduler threads, any code patch to generate HW fences 
should be covered, so any code leading to amdgpu_fence_emit 
needs to be taken into account such as, direct IB 
submissions, VM flushes e.t.c


You need to work together with the reset lock anyway, cause 
a hotplug could run at the same time as a reset.



For going my way indeed now I see now that I have to take 
reset write side lock during HW fences signalling in order to 
protect against scheduler/HW fences detachment and 
reattachment during schedulers stop/restart. But if we go 
with your approach  then calling drm_dev_unplug and scoping 
amdgpu_job_timeout with drm_dev_enter/exit should be enough 
to prevent any concurrent GPU resets during unplug. In fact I 
already do it anyway - 
https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ceefa9c90ed8c405ec3b708d8fc46daaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637536728550884740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=UiNaJE%2BH45iYmbwSDnMSKZS5z0iak0fNlbbfYqKS2Jo%3Dreserved=0


Yes, good point as well.

Christian.



Andrey





Christian.



Andrey






Christian.



Andrey





Andrey





















___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Christian König

Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:

[SNIP]


So what's the right approach ? How we guarantee that when running 
amdgpu_fence_driver_force_completion we will signal all the HW 
fences and not racing against some more fences insertion into that 
array ?




Well I would still say the best approach would be to insert this 
between the front end and the backend and not rely on signaling 
fences while holding the device srcu.



My question is, even now, when we run 
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, 
what there prevents a race with another fence being at the same time 
emitted and inserted into the fence array ? Looks like nothing.




Each ring can only be used by one thread at the same time, this includes 
emitting fences as well as other stuff.


During GPU reset we make sure nobody writes to the rings by stopping the 
scheduler and taking the GPU reset lock (so that nobody else can start 
the scheduler again).




BTW: Could it be that the device SRCU protects more than one device 
and we deadlock because of this?



I haven't actually experienced any deadlock until now but, yes, 
drm_unplug_srcu is defined as static in drm_drv.c and so in the 
presence  of multiple devices from same or different drivers we in 
fact are dependent on all their critical sections i guess.




Shit, yeah the devil is a squirrel. So for A+I laptops we actually need 
to sync that up with Daniel and the rest of the i915 guys.


IIRC we could actually have an amdgpu device in a docking station which 
needs hotplug and the driver might depend on waiting for the i915 driver 
as well.


Christian.


Andrey




Christian.


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted to HW ring 
and hence we can safely call force signal on all that are 
currently there.
 * Any subsequently created  HW fences will be returned 
signaled with an error code right away

 */

    for_each_ring(adev)
        amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side and stopping 
and then restarting the scheduler could work as well.


Christian.



I didn't get the above and I don't see why I need to reuse 
the GPU reset rw_lock. I rely on the SRCU unplug flag for 
unplug. Also, not clear to me why are we focusing on the 
scheduler threads, any code patch to generate HW fences 
should be covered, so any code leading to amdgpu_fence_emit 
needs to be taken into account such as, direct IB 
submissions, VM flushes e.t.c


You need to work together with the reset lock anyway, cause a 
hotplug could run at the same time as a reset.



For going my way indeed now I see now that I have to take 
reset write side lock during HW fences signalling in order to 
protect against scheduler/HW fences detachment and 
reattachment during schedulers stop/restart. But if we go with 
your approach  then calling drm_dev_unplug and scoping 
amdgpu_job_timeout with drm_dev_enter/exit should be enough to 
prevent any concurrent GPU resets during unplug. In fact I 
already do it anyway - 
https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ceefa9c90ed8c405ec3b708d8fc46daaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637536728550884740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=UiNaJE%2BH45iYmbwSDnMSKZS5z0iak0fNlbbfYqKS2Jo%3Dreserved=0


Yes, good point as well.

Christian.



Andrey





Christian.



Andrey






Christian.



Andrey





Andrey




















___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Andrey Grodzovsky


On 2021-04-12 2:23 p.m., Christian König wrote:

Am 12.04.21 um 20:18 schrieb Andrey Grodzovsky:


On 2021-04-12 2:05 p.m., Christian König wrote:


Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky:


On 2021-04-12 1:44 p.m., Christian König wrote:



Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky:

On 2021-04-10 1:34 p.m., Christian König wrote:

Hi Andrey,

Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky:

[SNIP]


If we use a list and a flag called 'emit_allowed' under a 
lock such that in amdgpu_fence_emit we lock the list, check 
the flag and if true add the new HW fence to list and proceed 
to HW emition as normal, otherwise return with -ENODEV. In 
amdgpu_pci_remove we take the lock, set the flag to false, 
and then iterate the list and force signal it. Will this not 
prevent any new HW fence creation from now on from any place 
trying to do so ?


Way to much overhead. The fence processing is intentionally 
lock free to avoid cache line bouncing because the IRQ can 
move from CPU to CPU.


We need something which at least the processing of fences in 
the interrupt handler doesn't affect at all.



As far as I see in the code, amdgpu_fence_emit is only called 
from task context. Also, we can skip this list I proposed and 
just use amdgpu_fence_driver_force_completion for each ring to 
signal all created HW fences.


Ah, wait a second this gave me another idea.

See amdgpu_fence_driver_force_completion():

amdgpu_fence_write(ring, ring->fence_drv.sync_seq);

If we change that to something like:

amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFF);

Not only the currently submitted, but also the next 0x3FFF 
fences will be considered signaled.


This basically solves out problem of making sure that new fences 
are also signaled without any additional overhead whatsoever.



Problem with this is that the act of setting the sync_seq to some 
MAX value alone is not enough, you actually have to call 
amdgpu_fence_process to iterate and signal the fences currently 
stored in ring->fence_drv.fences array and to guarantee that once 
you done your signalling no more HW fences will be added to that 
array anymore. I was thinking to do something like bellow:




Well we could implement the is_signaled callback once more, but 
I'm not sure if that is a good idea.



This indeed could save the explicit signaling I am doing bellow but 
I also set an error code there which might be helpful to propagate 
to users






amdgpu_fence_emit()

{

    dma_fence_init(fence);

    srcu_read_lock(amdgpu_unplug_srcu)

    if (!adev->unplug)) {

        seq = ++ring->fence_drv.sync_seq;
        emit_fence(fence);

*/* We can't wait forever as the HW might be gone at any point*/**
       dma_fence_wait_timeout(old_fence, 5S);*



You can pretty much ignore this wait here. It is only as a last 
resort so that we never overwrite the ring buffers.



If device is present how can I ignore this ?



I think you missed my question here



Sorry I thought I answered that below.

See this is just the last resort so that we don't need to worry about 
ring buffer overflows during testing.


We should not get here in practice and if we get here generating a 
deadlock might actually be the best handling.


The alternative would be to call BUG().





But it should not have a timeout as far as I can see.



Without timeout wait the who approach falls apart as I can't call 
srcu_synchronize on this scope because once device is physically 
gone the wait here will be forever




Yeah, but this is intentional. The only alternative to avoid 
corruption is to wait with a timeout and call BUG() if that 
triggers. That isn't much better.






        ring->fence_drv.fences[seq & 
ring->fence_drv.num_fences_mask] = fence;


    } else {

        dma_fence_set_error(fence, -ENODEV);
        DMA_fence_signal(fence)

    }

    srcu_read_unlock(amdgpu_unplug_srcu)
    return fence;

}

amdgpu_pci_remove

{

    adev->unplug = true;
    synchronize_srcu(amdgpu_unplug_srcu)



Well that is just duplicating what drm_dev_unplug() should be 
doing on a different level.



drm_dev_unplug is on a much wider scope, for everything in the 
device including 'flushing' in flight IOCTLs, this deals 
specifically with the issue of force signalling HW fences




Yeah, but it adds the same overhead as the device srcu.

Christian.



So what's the right approach ? How we guarantee that when running 
amdgpu_fence_driver_force_completion we will signal all the HW fences 
and not racing against some more fences insertion into that array ?




Well I would still say the best approach would be to insert this 
between the front end and the backend and not rely on signaling fences 
while holding the device srcu.



My question is, even now, when we run 
amdgpu_fence_driver_fini_hw->amdgpu_fence_wait_empty or 
amdgpu_fence_driver_fini_hw->amdgpu_fence_driver_force_completion, what 
there prevents a race with another fence being at the same time emitted 
and 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Christian König

Am 12.04.21 um 20:18 schrieb Andrey Grodzovsky:


On 2021-04-12 2:05 p.m., Christian König wrote:


Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky:


On 2021-04-12 1:44 p.m., Christian König wrote:



Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky:

On 2021-04-10 1:34 p.m., Christian König wrote:

Hi Andrey,

Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky:

[SNIP]


If we use a list and a flag called 'emit_allowed' under a lock 
such that in amdgpu_fence_emit we lock the list, check the 
flag and if true add the new HW fence to list and proceed to 
HW emition as normal, otherwise return with -ENODEV. In 
amdgpu_pci_remove we take the lock, set the flag to false, and 
then iterate the list and force signal it. Will this not 
prevent any new HW fence creation from now on from any place 
trying to do so ?


Way to much overhead. The fence processing is intentionally 
lock free to avoid cache line bouncing because the IRQ can move 
from CPU to CPU.


We need something which at least the processing of fences in 
the interrupt handler doesn't affect at all.



As far as I see in the code, amdgpu_fence_emit is only called 
from task context. Also, we can skip this list I proposed and 
just use amdgpu_fence_driver_force_completion for each ring to 
signal all created HW fences.


Ah, wait a second this gave me another idea.

See amdgpu_fence_driver_force_completion():

amdgpu_fence_write(ring, ring->fence_drv.sync_seq);

If we change that to something like:

amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFF);

Not only the currently submitted, but also the next 0x3FFF 
fences will be considered signaled.


This basically solves out problem of making sure that new fences 
are also signaled without any additional overhead whatsoever.



Problem with this is that the act of setting the sync_seq to some 
MAX value alone is not enough, you actually have to call 
amdgpu_fence_process to iterate and signal the fences currently 
stored in ring->fence_drv.fences array and to guarantee that once 
you done your signalling no more HW fences will be added to that 
array anymore. I was thinking to do something like bellow:




Well we could implement the is_signaled callback once more, but I'm 
not sure if that is a good idea.



This indeed could save the explicit signaling I am doing bellow but 
I also set an error code there which might be helpful to propagate 
to users






amdgpu_fence_emit()

{

    dma_fence_init(fence);

    srcu_read_lock(amdgpu_unplug_srcu)

    if (!adev->unplug)) {

        seq = ++ring->fence_drv.sync_seq;
        emit_fence(fence);

*/* We can't wait forever as the HW might be gone at any point*/**
       dma_fence_wait_timeout(old_fence, 5S);*



You can pretty much ignore this wait here. It is only as a last 
resort so that we never overwrite the ring buffers.



If device is present how can I ignore this ?



I think you missed my question here



Sorry I thought I answered that below.

See this is just the last resort so that we don't need to worry about 
ring buffer overflows during testing.


We should not get here in practice and if we get here generating a 
deadlock might actually be the best handling.


The alternative would be to call BUG().





But it should not have a timeout as far as I can see.



Without timeout wait the who approach falls apart as I can't call 
srcu_synchronize on this scope because once device is physically 
gone the wait here will be forever




Yeah, but this is intentional. The only alternative to avoid 
corruption is to wait with a timeout and call BUG() if that triggers. 
That isn't much better.






        ring->fence_drv.fences[seq & 
ring->fence_drv.num_fences_mask] = fence;


    } else {

        dma_fence_set_error(fence, -ENODEV);
        DMA_fence_signal(fence)

    }

    srcu_read_unlock(amdgpu_unplug_srcu)
    return fence;

}

amdgpu_pci_remove

{

    adev->unplug = true;
    synchronize_srcu(amdgpu_unplug_srcu)



Well that is just duplicating what drm_dev_unplug() should be doing 
on a different level.



drm_dev_unplug is on a much wider scope, for everything in the 
device including 'flushing' in flight IOCTLs, this deals 
specifically with the issue of force signalling HW fences




Yeah, but it adds the same overhead as the device srcu.

Christian.



So what's the right approach ? How we guarantee that when running 
amdgpu_fence_driver_force_completion we will signal all the HW fences 
and not racing against some more fences insertion into that array ?




Well I would still say the best approach would be to insert this between 
the front end and the backend and not rely on signaling fences while 
holding the device srcu.


BTW: Could it be that the device SRCU protects more than one device and 
we deadlock because of this?


Christian.


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted to HW ring and 
hence we can safely call force signal on all that are currently 
there.
 * 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Andrey Grodzovsky


On 2021-04-12 2:05 p.m., Christian König wrote:

Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky:


On 2021-04-12 1:44 p.m., Christian König wrote:



Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky:

On 2021-04-10 1:34 p.m., Christian König wrote:

Hi Andrey,

Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky:

[SNIP]


If we use a list and a flag called 'emit_allowed' under a lock 
such that in amdgpu_fence_emit we lock the list, check the flag 
and if true add the new HW fence to list and proceed to HW 
emition as normal, otherwise return with -ENODEV. In 
amdgpu_pci_remove we take the lock, set the flag to false, and 
then iterate the list and force signal it. Will this not 
prevent any new HW fence creation from now on from any place 
trying to do so ?


Way to much overhead. The fence processing is intentionally lock 
free to avoid cache line bouncing because the IRQ can move from 
CPU to CPU.


We need something which at least the processing of fences in the 
interrupt handler doesn't affect at all.



As far as I see in the code, amdgpu_fence_emit is only called 
from task context. Also, we can skip this list I proposed and 
just use amdgpu_fence_driver_force_completion for each ring to 
signal all created HW fences.


Ah, wait a second this gave me another idea.

See amdgpu_fence_driver_force_completion():

amdgpu_fence_write(ring, ring->fence_drv.sync_seq);

If we change that to something like:

amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFF);

Not only the currently submitted, but also the next 0x3FFF 
fences will be considered signaled.


This basically solves out problem of making sure that new fences 
are also signaled without any additional overhead whatsoever.



Problem with this is that the act of setting the sync_seq to some 
MAX value alone is not enough, you actually have to call 
amdgpu_fence_process to iterate and signal the fences currently 
stored in ring->fence_drv.fences array and to guarantee that once 
you done your signalling no more HW fences will be added to that 
array anymore. I was thinking to do something like bellow:




Well we could implement the is_signaled callback once more, but I'm 
not sure if that is a good idea.



This indeed could save the explicit signaling I am doing bellow but I 
also set an error code there which might be helpful to propagate to users






amdgpu_fence_emit()

{

    dma_fence_init(fence);

    srcu_read_lock(amdgpu_unplug_srcu)

    if (!adev->unplug)) {

        seq = ++ring->fence_drv.sync_seq;
        emit_fence(fence);

*/* We can't wait forever as the HW might be gone at any point*/**
       dma_fence_wait_timeout(old_fence, 5S);*



You can pretty much ignore this wait here. It is only as a last 
resort so that we never overwrite the ring buffers.



If device is present how can I ignore this ?



I think you missed my question here






But it should not have a timeout as far as I can see.



Without timeout wait the who approach falls apart as I can't call 
srcu_synchronize on this scope because once device is physically gone 
the wait here will be forever




Yeah, but this is intentional. The only alternative to avoid 
corruption is to wait with a timeout and call BUG() if that triggers. 
That isn't much better.






        ring->fence_drv.fences[seq & 
ring->fence_drv.num_fences_mask] = fence;


    } else {

        dma_fence_set_error(fence, -ENODEV);
        DMA_fence_signal(fence)

    }

    srcu_read_unlock(amdgpu_unplug_srcu)
    return fence;

}

amdgpu_pci_remove

{

    adev->unplug = true;
    synchronize_srcu(amdgpu_unplug_srcu)



Well that is just duplicating what drm_dev_unplug() should be doing 
on a different level.



drm_dev_unplug is on a much wider scope, for everything in the device 
including 'flushing' in flight IOCTLs, this deals specifically with 
the issue of force signalling HW fences




Yeah, but it adds the same overhead as the device srcu.

Christian.



So what's the right approach ? How we guarantee that when running 
amdgpu_fence_driver_force_completion we will signal all the HW fences 
and not racing against some more fences insertion into that array ?


Andrey





Andrey




Christian.

    /* Past this point no more fence are submitted to HW ring and 
hence we can safely call force signal on all that are currently there.
 * Any subsequently created  HW fences will be returned 
signaled with an error code right away

 */

    for_each_ring(adev)
        amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side and stopping and 
then restarting the scheduler could work as well.


Christian.



I didn't get the above and I don't see why I need to reuse the 
GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. 
Also, not clear to me why are we focusing on the scheduler 
threads, any code patch to 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Christian König

Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky:


On 2021-04-12 1:44 p.m., Christian König wrote:



Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky:

On 2021-04-10 1:34 p.m., Christian König wrote:

Hi Andrey,

Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky:

[SNIP]


If we use a list and a flag called 'emit_allowed' under a lock 
such that in amdgpu_fence_emit we lock the list, check the flag 
and if true add the new HW fence to list and proceed to HW 
emition as normal, otherwise return with -ENODEV. In 
amdgpu_pci_remove we take the lock, set the flag to false, and 
then iterate the list and force signal it. Will this not prevent 
any new HW fence creation from now on from any place trying to 
do so ?


Way to much overhead. The fence processing is intentionally lock 
free to avoid cache line bouncing because the IRQ can move from 
CPU to CPU.


We need something which at least the processing of fences in the 
interrupt handler doesn't affect at all.



As far as I see in the code, amdgpu_fence_emit is only called from 
task context. Also, we can skip this list I proposed and just use 
amdgpu_fence_driver_force_completion for each ring to signal all 
created HW fences.


Ah, wait a second this gave me another idea.

See amdgpu_fence_driver_force_completion():

amdgpu_fence_write(ring, ring->fence_drv.sync_seq);

If we change that to something like:

amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFF);

Not only the currently submitted, but also the next 0x3FFF 
fences will be considered signaled.


This basically solves out problem of making sure that new fences 
are also signaled without any additional overhead whatsoever.



Problem with this is that the act of setting the sync_seq to some 
MAX value alone is not enough, you actually have to call 
amdgpu_fence_process to iterate and signal the fences currently 
stored in ring->fence_drv.fences array and to guarantee that once 
you done your signalling no more HW fences will be added to that 
array anymore. I was thinking to do something like bellow:




Well we could implement the is_signaled callback once more, but I'm 
not sure if that is a good idea.



This indeed could save the explicit signaling I am doing bellow but I 
also set an error code there which might be helpful to propagate to users






amdgpu_fence_emit()

{

    dma_fence_init(fence);

    srcu_read_lock(amdgpu_unplug_srcu)

    if (!adev->unplug)) {

        seq = ++ring->fence_drv.sync_seq;
        emit_fence(fence);

*/* We can't wait forever as the HW might be gone at any point*/**
       dma_fence_wait_timeout(old_fence, 5S);*



You can pretty much ignore this wait here. It is only as a last 
resort so that we never overwrite the ring buffers.



If device is present how can I ignore this ?




But it should not have a timeout as far as I can see.



Without timeout wait the who approach falls apart as I can't call 
srcu_synchronize on this scope because once device is physically gone 
the wait here will be forever




Yeah, but this is intentional. The only alternative to avoid corruption 
is to wait with a timeout and call BUG() if that triggers. That isn't 
much better.






        ring->fence_drv.fences[seq & 
ring->fence_drv.num_fences_mask] = fence;


    } else {

        dma_fence_set_error(fence, -ENODEV);
        DMA_fence_signal(fence)

    }

    srcu_read_unlock(amdgpu_unplug_srcu)
    return fence;

}

amdgpu_pci_remove

{

    adev->unplug = true;
    synchronize_srcu(amdgpu_unplug_srcu)



Well that is just duplicating what drm_dev_unplug() should be doing 
on a different level.



drm_dev_unplug is on a much wider scope, for everything in the device 
including 'flushing' in flight IOCTLs, this deals specifically with 
the issue of force signalling HW fences




Yeah, but it adds the same overhead as the device srcu.

Christian.


Andrey




Christian.

    /* Past this point no more fence are submitted to HW ring and 
hence we can safely call force signal on all that are currently there.
 * Any subsequently created  HW fences will be returned signaled 
with an error code right away

 */

    for_each_ring(adev)
        amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side and stopping and 
then restarting the scheduler could work as well.


Christian.



I didn't get the above and I don't see why I need to reuse the 
GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. 
Also, not clear to me why are we focusing on the scheduler 
threads, any code patch to generate HW fences should be covered, 
so any code leading to amdgpu_fence_emit needs to be taken into 
account such as, direct IB submissions, VM flushes e.t.c


You need to work together with the reset lock anyway, cause a 
hotplug could run at the same time as a reset.



For going my way indeed now I see now that 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Andrey Grodzovsky


On 2021-04-12 1:44 p.m., Christian König wrote:


Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky:

On 2021-04-10 1:34 p.m., Christian König wrote:

Hi Andrey,

Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky:

[SNIP]


If we use a list and a flag called 'emit_allowed' under a lock 
such that in amdgpu_fence_emit we lock the list, check the flag 
and if true add the new HW fence to list and proceed to HW 
emition as normal, otherwise return with -ENODEV. In 
amdgpu_pci_remove we take the lock, set the flag to false, and 
then iterate the list and force signal it. Will this not prevent 
any new HW fence creation from now on from any place trying to do 
so ?


Way to much overhead. The fence processing is intentionally lock 
free to avoid cache line bouncing because the IRQ can move from 
CPU to CPU.


We need something which at least the processing of fences in the 
interrupt handler doesn't affect at all.



As far as I see in the code, amdgpu_fence_emit is only called from 
task context. Also, we can skip this list I proposed and just use 
amdgpu_fence_driver_force_completion for each ring to signal all 
created HW fences.


Ah, wait a second this gave me another idea.

See amdgpu_fence_driver_force_completion():

amdgpu_fence_write(ring, ring->fence_drv.sync_seq);

If we change that to something like:

amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFF);

Not only the currently submitted, but also the next 0x3FFF 
fences will be considered signaled.


This basically solves out problem of making sure that new fences are 
also signaled without any additional overhead whatsoever.



Problem with this is that the act of setting the sync_seq to some MAX 
value alone is not enough, you actually have to call 
amdgpu_fence_process to iterate and signal the fences currently 
stored in ring->fence_drv.fences array and to guarantee that once you 
done your signalling no more HW fences will be added to that array 
anymore. I was thinking to do something like bellow:




Well we could implement the is_signaled callback once more, but I'm 
not sure if that is a good idea.



This indeed could save the explicit signaling I am doing bellow but I 
also set an error code there which might be helpful to propagate to users






amdgpu_fence_emit()

{

    dma_fence_init(fence);

    srcu_read_lock(amdgpu_unplug_srcu)

    if (!adev->unplug)) {

        seq = ++ring->fence_drv.sync_seq;
        emit_fence(fence);

*/* We can't wait forever as the HW might be gone at any point*/**
       dma_fence_wait_timeout(old_fence, 5S);*



You can pretty much ignore this wait here. It is only as a last resort 
so that we never overwrite the ring buffers.



If device is present how can I ignore this ?




But it should not have a timeout as far as I can see.



Without timeout wait the who approach falls apart as I can't call 
srcu_synchronize on this scope because once device is physically gone 
the wait here will be forever





        ring->fence_drv.fences[seq & ring->fence_drv.num_fences_mask] 
= fence;


    } else {

        dma_fence_set_error(fence, -ENODEV);
        DMA_fence_signal(fence)

    }

    srcu_read_unlock(amdgpu_unplug_srcu)
    return fence;

}

amdgpu_pci_remove

{

    adev->unplug = true;
    synchronize_srcu(amdgpu_unplug_srcu)



Well that is just duplicating what drm_dev_unplug() should be doing on 
a different level.



drm_dev_unplug is on a much wider scope, for everything in the device 
including 'flushing' in flight IOCTLs, this deals specifically with the 
issue of force signalling HW fences


Andrey




Christian.

    /* Past this point no more fence are submitted to HW ring and 
hence we can safely call force signal on all that are currently there.
 * Any subsequently created  HW fences will be returned signaled 
with an error code right away

 */

    for_each_ring(adev)
        amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side and stopping and 
then restarting the scheduler could work as well.


Christian.



I didn't get the above and I don't see why I need to reuse the 
GPU reset rw_lock. I rely on the SRCU unplug flag for unplug. 
Also, not clear to me why are we focusing on the scheduler 
threads, any code patch to generate HW fences should be covered, 
so any code leading to amdgpu_fence_emit needs to be taken into 
account such as, direct IB submissions, VM flushes e.t.c


You need to work together with the reset lock anyway, cause a 
hotplug could run at the same time as a reset.



For going my way indeed now I see now that I have to take reset 
write side lock during HW fences signalling in order to protect 
against scheduler/HW fences detachment and reattachment during 
schedulers stop/restart. But if we go with your approach  then 
calling drm_dev_unplug and scoping amdgpu_job_timeout with 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Christian König


Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky:

On 2021-04-10 1:34 p.m., Christian König wrote:

Hi Andrey,

Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky:

[SNIP]


If we use a list and a flag called 'emit_allowed' under a lock 
such that in amdgpu_fence_emit we lock the list, check the flag 
and if true add the new HW fence to list and proceed to HW emition 
as normal, otherwise return with -ENODEV. In amdgpu_pci_remove we 
take the lock, set the flag to false, and then iterate the list 
and force signal it. Will this not prevent any new HW fence 
creation from now on from any place trying to do so ?


Way to much overhead. The fence processing is intentionally lock 
free to avoid cache line bouncing because the IRQ can move from CPU 
to CPU.


We need something which at least the processing of fences in the 
interrupt handler doesn't affect at all.



As far as I see in the code, amdgpu_fence_emit is only called from 
task context. Also, we can skip this list I proposed and just use 
amdgpu_fence_driver_force_completion for each ring to signal all 
created HW fences.


Ah, wait a second this gave me another idea.

See amdgpu_fence_driver_force_completion():

amdgpu_fence_write(ring, ring->fence_drv.sync_seq);

If we change that to something like:

amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFF);

Not only the currently submitted, but also the next 0x3FFF fences 
will be considered signaled.


This basically solves out problem of making sure that new fences are 
also signaled without any additional overhead whatsoever.



Problem with this is that the act of setting the sync_seq to some MAX 
value alone is not enough, you actually have to call 
amdgpu_fence_process to iterate and signal the fences currently stored 
in ring->fence_drv.fences array and to guarantee that once you done 
your signalling no more HW fences will be added to that array anymore. 
I was thinking to do something like bellow:




Well we could implement the is_signaled callback once more, but I'm not 
sure if that is a good idea.



amdgpu_fence_emit()

{

    dma_fence_init(fence);

    srcu_read_lock(amdgpu_unplug_srcu)

    if (!adev->unplug)) {

        seq = ++ring->fence_drv.sync_seq;
        emit_fence(fence);

*/* We can't wait forever as the HW might be gone at any point*/**
       dma_fence_wait_timeout(old_fence, 5S);*



You can pretty much ignore this wait here. It is only as a last resort 
so that we never overwrite the ring buffers.


But it should not have a timeout as far as I can see.

        ring->fence_drv.fences[seq & ring->fence_drv.num_fences_mask] 
= fence;


    } else {

        dma_fence_set_error(fence, -ENODEV);
        DMA_fence_signal(fence)

    }

    srcu_read_unlock(amdgpu_unplug_srcu)
    return fence;

}

amdgpu_pci_remove

{

    adev->unplug = true;
    synchronize_srcu(amdgpu_unplug_srcu)



Well that is just duplicating what drm_dev_unplug() should be doing on a 
different level.


Christian.

    /* Past this point no more fence are submitted to HW ring and 
hence we can safely call force signal on all that are currently there.
 * Any subsequently created  HW fences will be returned signaled 
with an error code right away

 */

    for_each_ring(adev)
        amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side and stopping and then 
restarting the scheduler could work as well.


Christian.



I didn't get the above and I don't see why I need to reuse the GPU 
reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, 
not clear to me why are we focusing on the scheduler threads, any 
code patch to generate HW fences should be covered, so any code 
leading to amdgpu_fence_emit needs to be taken into account such 
as, direct IB submissions, VM flushes e.t.c


You need to work together with the reset lock anyway, cause a 
hotplug could run at the same time as a reset.



For going my way indeed now I see now that I have to take reset 
write side lock during HW fences signalling in order to protect 
against scheduler/HW fences detachment and reattachment during 
schedulers stop/restart. But if we go with your approach  then 
calling drm_dev_unplug and scoping amdgpu_job_timeout with 
drm_dev_enter/exit should be enough to prevent any concurrent GPU 
resets during unplug. In fact I already do it anyway - 
https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ceefa9c90ed8c405ec3b708d8fc46daaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637536728550884740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=UiNaJE%2BH45iYmbwSDnMSKZS5z0iak0fNlbbfYqKS2Jo%3Dreserved=0


Yes, good point 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Andrey Grodzovsky


On 2021-04-10 1:34 p.m., Christian König wrote:

Hi Andrey,

Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky:

[SNIP]


If we use a list and a flag called 'emit_allowed' under a lock such 
that in amdgpu_fence_emit we lock the list, check the flag and if 
true add the new HW fence to list and proceed to HW emition as 
normal, otherwise return with -ENODEV. In amdgpu_pci_remove we take 
the lock, set the flag to false, and then iterate the list and 
force signal it. Will this not prevent any new HW fence creation 
from now on from any place trying to do so ?


Way to much overhead. The fence processing is intentionally lock 
free to avoid cache line bouncing because the IRQ can move from CPU 
to CPU.


We need something which at least the processing of fences in the 
interrupt handler doesn't affect at all.



As far as I see in the code, amdgpu_fence_emit is only called from 
task context. Also, we can skip this list I proposed and just use 
amdgpu_fence_driver_force_completion for each ring to signal all 
created HW fences.


Ah, wait a second this gave me another idea.

See amdgpu_fence_driver_force_completion():

amdgpu_fence_write(ring, ring->fence_drv.sync_seq);

If we change that to something like:

amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFF);

Not only the currently submitted, but also the next 0x3FFF fences 
will be considered signaled.


This basically solves out problem of making sure that new fences are 
also signaled without any additional overhead whatsoever.



Problem with this is that the act of setting the sync_seq to some MAX 
value alone is not enough, you actually have to call 
amdgpu_fence_process to iterate and signal the fences currently stored 
in ring->fence_drv.fences array and to guarantee that once you done your 
signalling no more HW fences will be added to that array anymore. I was 
thinking to do something like bellow:


amdgpu_fence_emit()

{

    dma_fence_init(fence);

    srcu_read_lock(amdgpu_unplug_srcu)

    if (!adev->unplug)) {

        seq = ++ring->fence_drv.sync_seq;
        emit_fence(fence);

*/* We can't wait forever as the HW might be gone at any point*/**
       dma_fence_wait_timeout(old_fence, 5S);*
        ring->fence_drv.fences[seq & ring->fence_drv.num_fences_mask] = 
fence;


    } else {

        dma_fence_set_error(fence, -ENODEV);
        DMA_fence_signal(fence)

    }

    srcu_read_unlock(amdgpu_unplug_srcu)
    return fence;

}

amdgpu_pci_remove

{

    adev->unplug = true;
    synchronize_srcu(amdgpu_unplug_srcu)

    /* Past this point no more fence are submitted to HW ring and hence 
we can safely call force signal on all that are currently there.
 * Any subsequently created  HW fences will be returned signaled 
with an error code right away

 */

    for_each_ring(adev)
        amdgpu_fence_process(ring)

    drm_dev_unplug(dev);
    Stop schedulers
    cancel_sync(all timers and queued works);
    hw_fini
    unmap_mmio

}


Andrey









Alternatively grabbing the reset write side and stopping and then 
restarting the scheduler could work as well.


Christian.



I didn't get the above and I don't see why I need to reuse the GPU 
reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not 
clear to me why are we focusing on the scheduler threads, any code 
patch to generate HW fences should be covered, so any code leading 
to amdgpu_fence_emit needs to be taken into account such as, direct 
IB submissions, VM flushes e.t.c


You need to work together with the reset lock anyway, cause a 
hotplug could run at the same time as a reset.



For going my way indeed now I see now that I have to take reset write 
side lock during HW fences signalling in order to protect against 
scheduler/HW fences detachment and reattachment during schedulers 
stop/restart. But if we go with your approach  then calling 
drm_dev_unplug and scoping amdgpu_job_timeout with drm_dev_enter/exit 
should be enough to prevent any concurrent GPU resets during unplug. 
In fact I already do it anyway - 
https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Ddrm-misc-next%26id%3Def0ea4dd29ef44d2649c5eda16c8f4869acc36b1data=04%7C01%7Candrey.grodzovsky%40amd.com%7Ceefa9c90ed8c405ec3b708d8fc46daaa%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637536728550884740%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=UiNaJE%2BH45iYmbwSDnMSKZS5z0iak0fNlbbfYqKS2Jo%3Dreserved=0


Yes, good point as well.

Christian.



Andrey





Christian.



Andrey






Christian.



Andrey





Andrey












___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-10 Thread Christian König

Hi Andrey,

Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky:

[SNIP]


If we use a list and a flag called 'emit_allowed' under a lock such 
that in amdgpu_fence_emit we lock the list, check the flag and if 
true add the new HW fence to list and proceed to HW emition as 
normal, otherwise return with -ENODEV. In amdgpu_pci_remove we take 
the lock, set the flag to false, and then iterate the list and force 
signal it. Will this not prevent any new HW fence creation from now 
on from any place trying to do so ?


Way to much overhead. The fence processing is intentionally lock free 
to avoid cache line bouncing because the IRQ can move from CPU to CPU.


We need something which at least the processing of fences in the 
interrupt handler doesn't affect at all.



As far as I see in the code, amdgpu_fence_emit is only called from 
task context. Also, we can skip this list I proposed and just use 
amdgpu_fence_driver_force_completion for each ring to signal all 
created HW fences.


Ah, wait a second this gave me another idea.

See amdgpu_fence_driver_force_completion():

amdgpu_fence_write(ring, ring->fence_drv.sync_seq);

If we change that to something like:

amdgpu_fence_write(ring, ring->fence_drv.sync_seq + 0x3FFF);

Not only the currently submitted, but also the next 0x3FFF fences 
will be considered signaled.


This basically solves out problem of making sure that new fences are 
also signaled without any additional overhead whatsoever.






Alternatively grabbing the reset write side and stopping and then 
restarting the scheduler could work as well.


Christian.



I didn't get the above and I don't see why I need to reuse the GPU 
reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not 
clear to me why are we focusing on the scheduler threads, any code 
patch to generate HW fences should be covered, so any code leading 
to amdgpu_fence_emit needs to be taken into account such as, direct 
IB submissions, VM flushes e.t.c


You need to work together with the reset lock anyway, cause a hotplug 
could run at the same time as a reset.



For going my way indeed now I see now that I have to take reset write 
side lock during HW fences signalling in order to protect against 
scheduler/HW fences detachment and reattachment during schedulers 
stop/restart. But if we go with your approach  then calling 
drm_dev_unplug and scoping amdgpu_job_timeout with drm_dev_enter/exit 
should be enough to prevent any concurrent GPU resets during unplug. 
In fact I already do it anyway - 
https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=drm-misc-next=ef0ea4dd29ef44d2649c5eda16c8f4869acc36b1


Yes, good point as well.

Christian.



Andrey





Christian.



Andrey






Christian.



Andrey





Andrey












___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-09 Thread Andrey Grodzovsky


On 2021-04-09 12:39 p.m., Christian König wrote:

Am 09.04.21 um 17:42 schrieb Andrey Grodzovsky:


On 2021-04-09 3:01 a.m., Christian König wrote:

Am 09.04.21 um 08:53 schrieb Christian König:

Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky:

[SNIP]
But inserting dmr_dev_enter/exit on the highest level in drm_ioctl 
is much less effort and less room for error then going through 
each IOCTL and trying to identify at what point (possibly multiple 
points) they are about to access HW, some of this is hidden deep 
in HAL layers such as DC layer in display driver or the multi 
layers of powerplay/SMU libraries. Also, we can't only limit 
our-self to back-end if by this you mean ASIC specific functions 
which access registers. We also need to take care of any MMIO 
kernel BO (VRAM BOs) where we may access directly MMIO space by 
pointer from the front end of the driver (HW agnostic) and TTM/DRM 
layers.


Exactly, yes. The key point is we need to identify such places 
anyway for GPU reset to work properly. So we could just piggy back 
hotplug on top of that work and are done.



I see most of this was done By Denis in this patch 
https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=drm-misc-next=df9c8d1aa278c435c30a69b8f2418b4a52fcb929, 
indeed this doesn't cover the direct by pointer accesses of MMIO and 
will introduce much more of those and, as people write new code, new 
places to cover will pop up leading to regressions and extra work to 
fix. It would be really much better if we could blanket cover it at 
the very top  such as root of all IOCTLs or, for any queued 
work/timer at the very top function, to handle it once and for all.


And exactly that's what is not possible. At least for the reset case 
you need to look into each hardware access and handle that bit by bit 
and I think that for the hotplug case we should go down that route as 
well.






Our problem here is how to signal all the existing  fences on one 
hand and on the other prevent any new dma_fence waits after we 
finished signaling existing fences. Once we solved this then there 
is no problem using drm_dev_unplug in conjunction with 
drm_dev_enter/exit at the highest level of drm_ioctl to flush any 
IOCTLs in flight and block any new ones.


IMHO when we speak about signalling all fences we don't mean ALL 
the currently existing dma_fence structs (they are spread all over 
the place) but rather signal all the HW fences because HW is 
what's gone and we can't expect for those fences to be ever 
signaled. All the rest such as: scheduler fences, user fences, 
drm_gem reservation objects e.t.c. are either dependent on those 
HW fences and hence signaling the HW fences will in turn signal 
them or, are not impacted by the HW being gone and hence can still 
be waited on and will complete. If this assumption is correct then 
I think that we should use some flag to prevent any new submission 
to HW which creates HW fences (somewhere around 
amdgpu_fence_emit), then traverse all existing HW fences 
(currently they are spread in a few places so maybe we need to 
track them in a list) and signal them. After that it's safe to cal 
drm_dev_unplug and be sure synchronize_srcu won't stall because of 
of dma_fence_wait. After that we can proceed to canceling work 
items, stopping schedulers e.t.c.


That is problematic as well since you need to make sure that the 
scheduler is not creating a new hardware fence in the moment you 
try to signal all of them. It would require another SRCU or lock 
for this.



If we use a list and a flag called 'emit_allowed' under a lock such 
that in amdgpu_fence_emit we lock the list, check the flag and if 
true add the new HW fence to list and proceed to HW emition as 
normal, otherwise return with -ENODEV. In amdgpu_pci_remove we take 
the lock, set the flag to false, and then iterate the list and force 
signal it. Will this not prevent any new HW fence creation from now 
on from any place trying to do so ?


Way to much overhead. The fence processing is intentionally lock free 
to avoid cache line bouncing because the IRQ can move from CPU to CPU.


We need something which at least the processing of fences in the 
interrupt handler doesn't affect at all.



As far as I see in the code, amdgpu_fence_emit is only called from task 
context. Also, we can skip this list I proposed and just use 
amdgpu_fence_driver_force_completion for each ring to signal all created 
HW fences.







Alternatively grabbing the reset write side and stopping and then 
restarting the scheduler could work as well.


Christian.



I didn't get the above and I don't see why I need to reuse the GPU 
reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not 
clear to me why are we focusing on the scheduler threads, any code 
patch to generate HW fences should be covered, so any code leading to 
amdgpu_fence_emit needs to be taken into account such as, direct IB 
submissions, VM flushes e.t.c


You need to work together with the 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-09 Thread Christian König

Am 09.04.21 um 17:42 schrieb Andrey Grodzovsky:


On 2021-04-09 3:01 a.m., Christian König wrote:

Am 09.04.21 um 08:53 schrieb Christian König:

Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky:

[SNIP]
But inserting dmr_dev_enter/exit on the highest level in drm_ioctl 
is much less effort and less room for error then going through each 
IOCTL and trying to identify at what point (possibly multiple 
points) they are about to access HW, some of this is hidden deep in 
HAL layers such as DC layer in display driver or the multi layers 
of powerplay/SMU libraries. Also, we can't only limit our-self to 
back-end if by this you mean ASIC specific functions which access 
registers. We also need to take care of any MMIO kernel BO (VRAM 
BOs) where we may access directly MMIO space by pointer from the 
front end of the driver (HW agnostic) and TTM/DRM layers.


Exactly, yes. The key point is we need to identify such places 
anyway for GPU reset to work properly. So we could just piggy back 
hotplug on top of that work and are done.



I see most of this was done By Denis in this patch 
https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=drm-misc-next=df9c8d1aa278c435c30a69b8f2418b4a52fcb929, 
indeed this doesn't cover the direct by pointer accesses of MMIO and 
will introduce much more of those and, as people write new code, new 
places to cover will pop up leading to regressions and extra work to 
fix. It would be really much better if we could blanket cover it at 
the very top  such as root of all IOCTLs or, for any queued work/timer 
at the very top function, to handle it once and for all.


And exactly that's what is not possible. At least for the reset case you 
need to look into each hardware access and handle that bit by bit and I 
think that for the hotplug case we should go down that route as well.






Our problem here is how to signal all the existing  fences on one 
hand and on the other prevent any new dma_fence waits after we 
finished signaling existing fences. Once we solved this then there 
is no problem using drm_dev_unplug in conjunction with 
drm_dev_enter/exit at the highest level of drm_ioctl to flush any 
IOCTLs in flight and block any new ones.


IMHO when we speak about signalling all fences we don't mean ALL 
the currently existing dma_fence structs (they are spread all over 
the place) but rather signal all the HW fences because HW is what's 
gone and we can't expect for those fences to be ever signaled. All 
the rest such as: scheduler fences, user fences, drm_gem 
reservation objects e.t.c. are either dependent on those HW fences 
and hence signaling the HW fences will in turn signal them or, are 
not impacted by the HW being gone and hence can still be waited on 
and will complete. If this assumption is correct then I think that 
we should use some flag to prevent any new submission to HW which 
creates HW fences (somewhere around amdgpu_fence_emit), then 
traverse all existing HW fences (currently they are spread in a few 
places so maybe we need to track them in a list) and signal them. 
After that it's safe to cal drm_dev_unplug and be sure 
synchronize_srcu won't stall because of of dma_fence_wait. After 
that we can proceed to canceling work items, stopping schedulers 
e.t.c.


That is problematic as well since you need to make sure that the 
scheduler is not creating a new hardware fence in the moment you try 
to signal all of them. It would require another SRCU or lock for this.



If we use a list and a flag called 'emit_allowed' under a lock such 
that in amdgpu_fence_emit we lock the list, check the flag and if true 
add the new HW fence to list and proceed to HW emition as normal, 
otherwise return with -ENODEV. In amdgpu_pci_remove we take the lock, 
set the flag to false, and then iterate the list and force signal it. 
Will this not prevent any new HW fence creation from now on from any 
place trying to do so ?


Way to much overhead. The fence processing is intentionally lock free to 
avoid cache line bouncing because the IRQ can move from CPU to CPU.


We need something which at least the processing of fences in the 
interrupt handler doesn't affect at all.




Alternatively grabbing the reset write side and stopping and then 
restarting the scheduler could work as well.


Christian.



I didn't get the above and I don't see why I need to reuse the GPU 
reset rw_lock. I rely on the SRCU unplug flag for unplug. Also, not 
clear to me why are we focusing on the scheduler threads, any code 
patch to generate HW fences should be covered, so any code leading to 
amdgpu_fence_emit needs to be taken into account such as, direct IB 
submissions, VM flushes e.t.c


You need to work together with the reset lock anyway, cause a hotplug 
could run at the same time as a reset.



Christian.



Andrey






Christian.



Andrey





Andrey










___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-09 Thread Andrey Grodzovsky


On 2021-04-09 3:01 a.m., Christian König wrote:

Am 09.04.21 um 08:53 schrieb Christian König:

Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky:


On 2021-04-08 2:58 p.m., Christian König wrote:

Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky:

On 2021-04-08 4:32 a.m., Christian König wrote:

Am 08.04.21 um 10:22 schrieb Christian König:

[SNIP]



Beyond blocking all delayed works and scheduler threads we 
also need to guarantee no  IOCTL can access MMIO post device 
unplug OR in flight IOCTLs are done before we finish 
pci_remove (amdgpu_pci_remove for us).
For this I suggest we do something like what we worked on 
with Takashi Iwai the ALSA maintainer recently when he helped 
implementing PCI BARs move support for snd_hda_intel. Take a 
look at
https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Dyadro%2Fpcie_hotplug%2Fmovable_bars_v9.1%26id%3Dcbaa324799718e2b828a8c7b5b001dd896748497data=04%7C01%7Candrey.grodzovsky%40amd.com%7C1c5e440d332f46b7f86208d8fb25422c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637535484734581904%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=n%2FG3bLYUKdl9mitR9f1a8qLpkToLdKM3Iz4y23GFg60%3Dreserved=0 
and
https://nam11.safelinks.protection.outlook.com/?url=https:%2F%2Fcgit.freedesktop.org%2F~agrodzov%2Flinux%2Fcommit%2F%3Fh%3Dyadro%2Fpcie_hotplug%2Fmovable_bars_v9.1%26id%3De36365d9ab5bbc30bdc221ab4b3437de34492440data=04%7C01%7Candrey.grodzovsky%40amd.com%7C1c5e440d332f46b7f86208d8fb25422c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637535484734581904%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000sdata=xI88SgbdAK%2FUmCC3JOvAknFTdbDbfu4AIPL%2Bf8ol4ZI%3Dreserved=0 

We also had same issue there, how to prevent MMIO accesses 
while the BARs are migrating. What was done there is a 
refcount was added to count all IOCTLs in flight, for any in 
flight  IOCTL the BAR migration handler would
block for the refcount to drop to 0 before it would proceed, 
for any later IOCTL it stops and wait if device is in 
migration state. We even don't need the wait part, nothing to 
wait for, we just return with -ENODEV for this case.




This is essentially what the DRM SRCU is doing as well.

For the hotplug case we could do this in the toplevel since we 
can signal the fence and don't need to block memory management.



To make SRCU 'wait for' all IOCTLs in flight we would need to 
wrap every IOCTL ( practically - just drm_ioctl function) with 
drm_dev_enter/drm_dev_exit - can we do it ?




Sorry totally missed this question.

Yes, exactly that. As discussed for the hotplug case we can do this.



Thinking more about it - assuming we are  treating 
synchronize_srcu as a 'wait for completion' of any in flight 
{drm_dev_enter, drm_dev_exit} scope, some of those scopes might do 
dma_fence_wait inside. Since we haven't force signaled the fences 
yet we will end up a deadlock. We have to signal all the various 
fences before doing the 'wait for'. But we can't signal the fences 
before setting 'dev->unplugged = true' to reject further CS and 
other stuff which might create more fences we were supposed-to 
force signal and now missed them. Effectively setting 
'dev->unplugged = true' and doing synchronize_srcu in one call 
like drm_dev_unplug does without signalling all the fences in the 
device in between these two steps looks luck a possible deadlock 
to me - what do you think ?




Indeed, that is a really good argument to handle it the same way as 
the reset lock.


E.g. not taking it at the high level IOCTL, but rather when the 
frontend of the driver has acquired all the necessary locks (BO 
resv, VM lock etc...) before calling into the backend to actually 
do things with the hardware.


Christian.


From what you said I understand that you want to solve this problem 
by using drm_dev_enter/exit brackets low enough in the code such 
that it will not include and fence wait.


But inserting dmr_dev_enter/exit on the highest level in drm_ioctl 
is much less effort and less room for error then going through each 
IOCTL and trying to identify at what point (possibly multiple 
points) they are about to access HW, some of this is hidden deep in 
HAL layers such as DC layer in display driver or the multi layers of 
powerplay/SMU libraries. Also, we can't only limit our-self to 
back-end if by this you mean ASIC specific functions which access 
registers. We also need to take care of any MMIO kernel BO (VRAM 
BOs) where we may access directly MMIO space by pointer from the 
front end of the driver (HW agnostic) and TTM/DRM layers.


Exactly, yes. The key point is we need to identify such places anyway 
for GPU reset to work properly. So we could just piggy back hotplug 
on top of that work and are done.



I see most of this was done By Denis in this patch 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-09 Thread Christian König

Am 09.04.21 um 08:53 schrieb Christian König:

Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky:


On 2021-04-08 2:58 p.m., Christian König wrote:

Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky:

On 2021-04-08 4:32 a.m., Christian König wrote:

Am 08.04.21 um 10:22 schrieb Christian König:

[SNIP]



Beyond blocking all delayed works and scheduler threads we 
also need to guarantee no  IOCTL can access MMIO post device 
unplug OR in flight IOCTLs are done before we finish 
pci_remove (amdgpu_pci_remove for us).
For this I suggest we do something like what we worked on with 
Takashi Iwai the ALSA maintainer recently when he helped 
implementing PCI BARs move support for snd_hda_intel. Take a 
look at
https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1=cbaa324799718e2b828a8c7b5b001dd896748497 
and
https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1=e36365d9ab5bbc30bdc221ab4b3437de34492440 

We also had same issue there, how to prevent MMIO accesses 
while the BARs are migrating. What was done there is a 
refcount was added to count all IOCTLs in flight, for any in 
flight  IOCTL the BAR migration handler would
block for the refcount to drop to 0 before it would proceed, 
for any later IOCTL it stops and wait if device is in 
migration state. We even don't need the wait part, nothing to 
wait for, we just return with -ENODEV for this case.




This is essentially what the DRM SRCU is doing as well.

For the hotplug case we could do this in the toplevel since we 
can signal the fence and don't need to block memory management.



To make SRCU 'wait for' all IOCTLs in flight we would need to 
wrap every IOCTL ( practically - just drm_ioctl function) with 
drm_dev_enter/drm_dev_exit - can we do it ?




Sorry totally missed this question.

Yes, exactly that. As discussed for the hotplug case we can do this.



Thinking more about it - assuming we are  treating synchronize_srcu 
as a 'wait for completion' of any in flight {drm_dev_enter, 
drm_dev_exit} scope, some of those scopes might do dma_fence_wait 
inside. Since we haven't force signaled the fences yet we will end 
up a deadlock. We have to signal all the various fences before 
doing the 'wait for'. But we can't signal the fences before setting 
'dev->unplugged = true' to reject further CS and other stuff which 
might create more fences we were supposed-to force signal and now 
missed them. Effectively setting 'dev->unplugged = true' and doing 
synchronize_srcu in one call like drm_dev_unplug does without 
signalling all the fences in the device in between these two steps 
looks luck a possible deadlock to me - what do you think ?




Indeed, that is a really good argument to handle it the same way as 
the reset lock.


E.g. not taking it at the high level IOCTL, but rather when the 
frontend of the driver has acquired all the necessary locks (BO 
resv, VM lock etc...) before calling into the backend to actually do 
things with the hardware.


Christian.


From what you said I understand that you want to solve this problem 
by using drm_dev_enter/exit brackets low enough in the code such that 
it will not include and fence wait.


But inserting dmr_dev_enter/exit on the highest level in drm_ioctl is 
much less effort and less room for error then going through each 
IOCTL and trying to identify at what point (possibly multiple points) 
they are about to access HW, some of this is hidden deep in HAL 
layers such as DC layer in display driver or the multi layers of 
powerplay/SMU libraries. Also, we can't only limit our-self to 
back-end if by this you mean ASIC specific functions which access 
registers. We also need to take care of any MMIO kernel BO (VRAM BOs) 
where we may access directly MMIO space by pointer from the front end 
of the driver (HW agnostic) and TTM/DRM layers.


Exactly, yes. The key point is we need to identify such places anyway 
for GPU reset to work properly. So we could just piggy back hotplug on 
top of that work and are done.




Our problem here is how to signal all the existing  fences on one 
hand and on the other prevent any new dma_fence waits after we 
finished signaling existing fences. Once we solved this then there is 
no problem using drm_dev_unplug in conjunction with 
drm_dev_enter/exit at the highest level of drm_ioctl to flush any 
IOCTLs in flight and block any new ones.


IMHO when we speak about signalling all fences we don't mean ALL the 
currently existing dma_fence structs (they are spread all over the 
place) but rather signal all the HW fences because HW is what's gone 
and we can't expect for those fences to be ever signaled. All the 
rest such as: scheduler fences,  user fences, drm_gem reservation 
objects e.t.c. are either dependent on those HW fences and hence 
signaling the HW fences will in turn signal them or, are not impacted 
by the HW being gone and hence can still be waited on and will 
complete. If this assumption is 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-09 Thread Christian König

Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky:


On 2021-04-08 2:58 p.m., Christian König wrote:

Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky:

On 2021-04-08 4:32 a.m., Christian König wrote:

Am 08.04.21 um 10:22 schrieb Christian König:

[SNIP]



Beyond blocking all delayed works and scheduler threads we also 
need to guarantee no  IOCTL can access MMIO post device unplug 
OR in flight IOCTLs are done before we finish pci_remove 
(amdgpu_pci_remove for us).
For this I suggest we do something like what we worked on with 
Takashi Iwai the ALSA maintainer recently when he helped 
implementing PCI BARs move support for snd_hda_intel. Take a 
look at
https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1=cbaa324799718e2b828a8c7b5b001dd896748497 
and
https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1=e36365d9ab5bbc30bdc221ab4b3437de34492440 

We also had same issue there, how to prevent MMIO accesses 
while the BARs are migrating. What was done there is a refcount 
was added to count all IOCTLs in flight, for any in flight  
IOCTL the BAR migration handler would
block for the refcount to drop to 0 before it would proceed, 
for any later IOCTL it stops and wait if device is in migration 
state. We even don't need the wait part, nothing to wait for, 
we just return with -ENODEV for this case.




This is essentially what the DRM SRCU is doing as well.

For the hotplug case we could do this in the toplevel since we 
can signal the fence and don't need to block memory management.



To make SRCU 'wait for' all IOCTLs in flight we would need to 
wrap every IOCTL ( practically - just drm_ioctl function) with 
drm_dev_enter/drm_dev_exit - can we do it ?




Sorry totally missed this question.

Yes, exactly that. As discussed for the hotplug case we can do this.



Thinking more about it - assuming we are  treating synchronize_srcu 
as a 'wait for completion' of any in flight {drm_dev_enter, 
drm_dev_exit} scope, some of those scopes might do dma_fence_wait 
inside. Since we haven't force signaled the fences yet we will end 
up a deadlock. We have to signal all the various fences before doing 
the 'wait for'. But we can't signal the fences before setting 
'dev->unplugged = true' to reject further CS and other stuff which 
might create more fences we were supposed-to force signal and now 
missed them. Effectively setting 'dev->unplugged = true' and doing 
synchronize_srcu in one call like drm_dev_unplug does without 
signalling all the fences in the device in between these two steps 
looks luck a possible deadlock to me - what do you think ?




Indeed, that is a really good argument to handle it the same way as 
the reset lock.


E.g. not taking it at the high level IOCTL, but rather when the 
frontend of the driver has acquired all the necessary locks (BO resv, 
VM lock etc...) before calling into the backend to actually do things 
with the hardware.


Christian.


From what you said I understand that you want to solve this problem by 
using drm_dev_enter/exit brackets low enough in the code such that it 
will not include and fence wait.


But inserting dmr_dev_enter/exit on the highest level in drm_ioctl is 
much less effort and less room for error then going through each IOCTL 
and trying to identify at what point (possibly multiple points) they 
are about to access HW, some of this is hidden deep in HAL layers such 
as DC layer in display driver or the multi layers of powerplay/SMU 
libraries. Also, we can't only limit our-self to back-end if by this 
you mean ASIC specific functions which access registers. We also need 
to take care of any MMIO kernel BO (VRAM BOs) where we may access 
directly MMIO space by pointer from the front end of the driver (HW 
agnostic) and TTM/DRM layers.


Exactly, yes. The key point is we need to identify such places anyway 
for GPU reset to work properly. So we could just piggy back hotplug on 
top of that work and are done.




Our problem here is how to signal all the existing  fences on one hand 
and on the other prevent any new dma_fence waits after we finished 
signaling existing fences. Once we solved this then there is no 
problem using drm_dev_unplug in conjunction with drm_dev_enter/exit at 
the highest level of drm_ioctl to flush any IOCTLs in flight and block 
any new ones.


IMHO when we speak about signalling all fences we don't mean ALL the 
currently existing dma_fence structs (they are spread all over the 
place) but rather signal all the HW fences because HW is what's gone 
and we can't expect for those fences to be ever signaled. All the rest 
such as: scheduler fences,  user fences, drm_gem reservation objects 
e.t.c. are either dependent on those HW fences and hence signaling the 
HW fences will in turn signal them or, are not impacted by the HW 
being gone and hence can still be waited on and will complete. If this 
assumption is correct then I think that we should use some flag 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-08 Thread Andrey Grodzovsky


On 2021-04-08 2:58 p.m., Christian König wrote:

Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky:

On 2021-04-08 4:32 a.m., Christian König wrote:

Am 08.04.21 um 10:22 schrieb Christian König:

[SNIP]



Beyond blocking all delayed works and scheduler threads we also 
need to guarantee no  IOCTL can access MMIO post device unplug 
OR in flight IOCTLs are done before we finish pci_remove 
(amdgpu_pci_remove for us).
For this I suggest we do something like what we worked on with 
Takashi Iwai the ALSA maintainer recently when he helped 
implementing PCI BARs move support for snd_hda_intel. Take a 
look at
https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1=cbaa324799718e2b828a8c7b5b001dd896748497 
and
https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1=e36365d9ab5bbc30bdc221ab4b3437de34492440 

We also had same issue there, how to prevent MMIO accesses while 
the BARs are migrating. What was done there is a refcount was 
added to count all IOCTLs in flight, for any in flight  IOCTL 
the BAR migration handler would
block for the refcount to drop to 0 before it would proceed, for 
any later IOCTL it stops and wait if device is in migration 
state. We even don't need the wait part, nothing to wait for, we 
just return with -ENODEV for this case.




This is essentially what the DRM SRCU is doing as well.

For the hotplug case we could do this in the toplevel since we 
can signal the fence and don't need to block memory management.



To make SRCU 'wait for' all IOCTLs in flight we would need to wrap 
every IOCTL ( practically - just drm_ioctl function) with 
drm_dev_enter/drm_dev_exit - can we do it ?




Sorry totally missed this question.

Yes, exactly that. As discussed for the hotplug case we can do this.



Thinking more about it - assuming we are  treating synchronize_srcu 
as a 'wait for completion' of any in flight {drm_dev_enter, 
drm_dev_exit} scope, some of those scopes might do dma_fence_wait 
inside. Since we haven't force signaled the fences yet we will end up 
a deadlock. We have to signal all the various fences before doing the 
'wait for'. But we can't signal the fences before setting 
'dev->unplugged = true' to reject further CS and other stuff which 
might create more fences we were supposed-to force signal and now 
missed them. Effectively setting 'dev->unplugged = true' and doing 
synchronize_srcu in one call like drm_dev_unplug does without 
signalling all the fences in the device in between these two steps 
looks luck a possible deadlock to me - what do you think ?




Indeed, that is a really good argument to handle it the same way as 
the reset lock.


E.g. not taking it at the high level IOCTL, but rather when the 
frontend of the driver has acquired all the necessary locks (BO resv, 
VM lock etc...) before calling into the backend to actually do things 
with the hardware.


Christian.


From what you said I understand that you want to solve this problem by 
using drm_dev_enter/exit brackets low enough in the code such that it 
will not include and fence wait.


But inserting dmr_dev_enter/exit on the highest level in drm_ioctl is 
much less effort and less room for error then going through each IOCTL 
and trying to identify at what point (possibly multiple points) they are 
about to access HW, some of this is hidden deep in HAL layers such as DC 
layer in display driver or the multi layers of powerplay/SMU libraries. 
Also, we can't only limit our-self to back-end if by this you mean ASIC 
specific functions which access registers. We also need to take care of 
any MMIO kernel BO (VRAM BOs) where we may access directly MMIO space by 
pointer from the front end of the driver (HW agnostic) and TTM/DRM layers.


Our problem here is how to signal all the existing  fences on one hand 
and on the other prevent any new dma_fence waits after we finished 
signaling existing fences. Once we solved this then there is no problem 
using drm_dev_unplug in conjunction with drm_dev_enter/exit at the 
highest level of drm_ioctl to flush any IOCTLs in flight and block any 
new ones.


IMHO when we speak about signalling all fences we don't mean ALL the 
currently existing dma_fence structs (they are spread all over the 
place) but rather signal all the HW fences because HW is what's gone and 
we can't expect for those fences to be ever signaled. All the rest such 
as: scheduler fences,  user fences, drm_gem reservation objects e.t.c. 
are either dependent on those HW fences and hence signaling the HW 
fences will in turn signal them or, are not impacted by the HW being 
gone and hence can still be waited on and will complete. If this 
assumption is correct then I think that we should use some flag to 
prevent any new submission to HW which creates HW fences (somewhere 
around amdgpu_fence_emit), then traverse all existing HW fences 
(currently they are spread in a few places so maybe we need to track 
them in a list) and 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-08 Thread Christian König

Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky:

On 2021-04-08 4:32 a.m., Christian König wrote:

Am 08.04.21 um 10:22 schrieb Christian König:

[SNIP]



Beyond blocking all delayed works and scheduler threads we also 
need to guarantee no  IOCTL can access MMIO post device unplug OR 
in flight IOCTLs are done before we finish pci_remove 
(amdgpu_pci_remove for us).
For this I suggest we do something like what we worked on with 
Takashi Iwai the ALSA maintainer recently when he helped 
implementing PCI BARs move support for snd_hda_intel. Take a look at
https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1=cbaa324799718e2b828a8c7b5b001dd896748497 
and

https://cgit.freedesktop.org/~agrodzov/linux/commit/?h=yadro/pcie_hotplug/movable_bars_v9.1=e36365d9ab5bbc30bdc221ab4b3437de34492440
We also had same issue there, how to prevent MMIO accesses while 
the BARs are migrating. What was done there is a refcount was 
added to count all IOCTLs in flight, for any in flight  IOCTL the 
BAR migration handler would
block for the refcount to drop to 0 before it would proceed, for 
any later IOCTL it stops and wait if device is in migration 
state. We even don't need the wait part, nothing to wait for, we 
just return with -ENODEV for this case.




This is essentially what the DRM SRCU is doing as well.

For the hotplug case we could do this in the toplevel since we can 
signal the fence and don't need to block memory management.



To make SRCU 'wait for' all IOCTLs in flight we would need to wrap 
every IOCTL ( practically - just drm_ioctl function) with 
drm_dev_enter/drm_dev_exit - can we do it ?




Sorry totally missed this question.

Yes, exactly that. As discussed for the hotplug case we can do this.



Thinking more about it - assuming we are  treating synchronize_srcu as 
a 'wait for completion' of any in flight {drm_dev_enter, drm_dev_exit} 
scope, some of those scopes might do dma_fence_wait inside. Since we 
haven't force signaled the fences yet we will end up a deadlock. We 
have to signal all the various fences before doing the 'wait for'. But 
we can't signal the fences before setting 'dev->unplugged = true' to 
reject further CS and other stuff which might create more fences we 
were supposed-to force signal and now missed them. Effectively setting 
'dev->unplugged = true' and doing synchronize_srcu in one call like 
drm_dev_unplug does without signalling all the fences in the device in 
between these two steps looks luck a possible deadlock to me - what do 
you think ?




Indeed, that is a really good argument to handle it the same way as the 
reset lock.


E.g. not taking it at the high level IOCTL, but rather when the frontend 
of the driver has acquired all the necessary locks (BO resv, VM lock 
etc...) before calling into the backend to actually do things with the 
hardware.


Christian.


Andrey




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-08 Thread Andrey Grodzovsky
only 
completes the hardware fences as part of the reset.


So waiting for a fence while holding the reset lock is illegal 
and needs to be avoided.


Lockdep also complains about this when it is used correctly. 
The only reason it doesn't complain here is because you use an 
atomic+wait_event instead of a locking primitive.


Regards,

Christian.



*Von:*Li, Dennis mailto:dennis...@amd.com>>
*Gesendet:* Donnerstag, 18. März 2021 09:28
*An:* Koenig, Christian <mailto:christian.koe...@amd.com>>; 
amd-gfx@lists.freedesktop.org 
<mailto:amd-gfx@lists.freedesktop.org> 
<mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander 
<mailto:alexander.deuc...@amd.com>>; Kuehling, Felix 
mailto:felix.kuehl...@amd.com>>; 
Zhang, Hawking <mailto:hawking.zh...@amd.com>>
*Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to 
enhance its stability


>>> Those two steps need to be exchanged or otherwise it is 
possible that new delayed work items etc are started before 
the lock is taken.
What about adding check for adev->in_gpu_reset in work item? 
If exchange the two steps, it maybe introduce the deadlock.  
For example, the user thread hold the read lock and waiting 
for the fence, if recovery thread try to hold write lock and 
then complete fences, in this case, recovery thread will 
always be blocked.



Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian <mailto:christian.koe...@amd.com>>

Sent: Thursday, March 18, 2021 3:54 PM
To: Li, Dennis mailto:dennis...@amd.com>>; 
amd-gfx@lists.freedesktop.org 
<mailto:amd-gfx@lists.freedesktop.org>; Deucher, Alexander 
<mailto:alexander.deuc...@amd.com>>; Kuehling, Felix 
mailto:felix.kuehl...@amd.com>>; 
Zhang, Hawking <mailto:hawking.zh...@amd.com>>
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to 
enhance its stability


Am 18.03.21 um 08:23 schrieb Dennis Li:
> We have defined two variables in_gpu_reset and reset_sem in 
adev object. The atomic type variable in_gpu_reset is used to 
avoid recovery thread reenter and make lower functions return 
more earlier when recovery start, but couldn't block recovery 
thread when it access hardware. The r/w semaphore reset_sem is 
used to solve these synchronization issues between recovery 
thread and other threads.

>
> The original solution locked registers' access in lower 
functions, which will introduce following issues:

>
> 1) many lower functions are used in both recovery thread and 
others. Firstly we must harvest these functions, it is easy to 
miss someones. Secondly these functions need select which lock 
(read lock or write lock) will be used, according to the 
thread it is running in. If the thread context isn't 
considered, the added lock will easily introduce deadlock. 
Besides that, in most time, developer easily forget to add 
locks for new functions.

>
> 2) performance drop. More lower functions are more 
frequently called.

>
> 3) easily introduce false positive lockdep complaint, 
because write lock has big range in recovery thread, but low 
level functions will hold read lock may be protected by other 
locks in other threads.

>
> Therefore the new solution will try to add lock protection 
for ioctls of kfd. Its goal is that there are no threads 
except for recovery thread or its children (for xgmi) to 
access hardware when doing GPU reset and resume. So refine 
recovery thread as the following:

>
> Step 0: atomic_cmpxchg(>in_gpu_reset, 0, 1)
> 1). if failed, it means system had a recovery thread 
running, current thread exit directly;

> 2). if success, enter recovery thread;
>
> Step 1: cancel all delay works, stop drm schedule, complete 
all unreceived fences and so on. It try to stop or pause other 
threads.

>
> Step 2: call down_write(>reset_sem) to hold write 
lock, which will block recovery thread until other threads 
release read locks.


Those two steps need to be exchanged or otherwise it is 
possible that new delayed work items etc are started before 
the lock is taken.


Just to make it clear until this is fixed the whole patch set 
is a NAK.


Regards,
Christian.

>
> Step 3: normally, there is only recovery threads running to 
access hardware, it is safe to do gpu reset now.

>
> Step 4: do post gpu reset, such as call all ips' resume 
functions;

>
> Step 5: atomic set adev->in_gpu_reset as 0, wake up other 
threads and release write lock. Recovery thread exit normally.

>
> Other threads call the amdgpu_read_lock to synchronize with 
recovery thread. If it finds that in_gpu_reset is 1, it should 
release read lock if it has holden one, and then blocks itself 
to wait for recovery finished event. If thread successfully 
hold read lock and in_gpu_reset is 0, it continues. It will 
exit nor

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-08 Thread Christian König
h...@amd.com>>
*Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to 
enhance its stability


>>> Those two steps need to be exchanged or otherwise it is 
possible that new delayed work items etc are started before the 
lock is taken.
What about adding check for adev->in_gpu_reset in work item? If 
exchange the two steps, it maybe introduce the deadlock.  For 
example, the user thread hold the read lock and waiting for the 
fence, if recovery thread try to hold write lock and then 
complete fences, in this case, recovery thread will always be 
blocked.



Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian <mailto:christian.koe...@amd.com>>

Sent: Thursday, March 18, 2021 3:54 PM
To: Li, Dennis mailto:dennis...@amd.com>>; 
amd-gfx@lists.freedesktop.org 
<mailto:amd-gfx@lists.freedesktop.org>; Deucher, Alexander 
mailto:alexander.deuc...@amd.com>>; 
Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking 
mailto:hawking.zh...@amd.com>>
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to 
enhance its stability


Am 18.03.21 um 08:23 schrieb Dennis Li:
> We have defined two variables in_gpu_reset and reset_sem in 
adev object. The atomic type variable in_gpu_reset is used to 
avoid recovery thread reenter and make lower functions return 
more earlier when recovery start, but couldn't block recovery 
thread when it access hardware. The r/w semaphore reset_sem is 
used to solve these synchronization issues between recovery 
thread and other threads.

>
> The original solution locked registers' access in lower 
functions, which will introduce following issues:

>
> 1) many lower functions are used in both recovery thread and 
others. Firstly we must harvest these functions, it is easy to 
miss someones. Secondly these functions need select which lock 
(read lock or write lock) will be used, according to the thread 
it is running in. If the thread context isn't considered, the 
added lock will easily introduce deadlock. Besides that, in 
most time, developer easily forget to add locks for new functions.

>
> 2) performance drop. More lower functions are more frequently 
called.

>
> 3) easily introduce false positive lockdep complaint, because 
write lock has big range in recovery thread, but low level 
functions will hold read lock may be protected by other locks 
in other threads.

>
> Therefore the new solution will try to add lock protection 
for ioctls of kfd. Its goal is that there are no threads except 
for recovery thread or its children (for xgmi) to access 
hardware when doing GPU reset and resume. So refine recovery 
thread as the following:

>
> Step 0: atomic_cmpxchg(>in_gpu_reset, 0, 1)
> 1). if failed, it means system had a recovery thread 
running, current thread exit directly;

> 2). if success, enter recovery thread;
>
> Step 1: cancel all delay works, stop drm schedule, complete 
all unreceived fences and so on. It try to stop or pause other 
threads.

>
> Step 2: call down_write(>reset_sem) to hold write lock, 
which will block recovery thread until other threads release 
read locks.


Those two steps need to be exchanged or otherwise it is 
possible that new delayed work items etc are started before the 
lock is taken.


Just to make it clear until this is fixed the whole patch set 
is a NAK.


Regards,
Christian.

>
> Step 3: normally, there is only recovery threads running to 
access hardware, it is safe to do gpu reset now.

>
> Step 4: do post gpu reset, such as call all ips' resume 
functions;

>
> Step 5: atomic set adev->in_gpu_reset as 0, wake up other 
threads and release write lock. Recovery thread exit normally.

>
> Other threads call the amdgpu_read_lock to synchronize with 
recovery thread. If it finds that in_gpu_reset is 1, it should 
release read lock if it has holden one, and then blocks itself 
to wait for recovery finished event. If thread successfully 
hold read lock and in_gpu_reset is 0, it continues. It will 
exit normally or be stopped by recovery thread in step 1.

>
> Dennis Li (4):
>    drm/amdgpu: remove reset lock from low level functions
>    drm/amdgpu: refine the GPU recovery sequence
>    drm/amdgpu: instead of using down/up_read directly
>    drm/amdkfd: add reset lock protection for kfd entry functions
>
> drivers/gpu/drm/amd/amdgpu/amdgpu.h |   6 +
> drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c |  14 +-
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 173 
+-

> .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c |   8 -
> drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c |   4 +-
> drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |   9 +-
> drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c |   5 +-
> drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c |   5 +-
> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 172 -
> drivers/

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-08 Thread Christian König
, maybe we should handle it the same way as reset or 
maybe we should have it at the top level.



If by top level you mean checking for device unplugged and bailing out 
at the entry to IOCTL or right at start of any work_item/timer 
function we have then seems to me it's better and more clear. Once we 
flushed all of them in flight there is no reason for them to execute 
any more when device is unplugged.


Andrey




Regards,
Christian.

The above approach should allow us to wait for all the IOCTLs in 
flight, together with stopping scheduler threads and cancelling and 
flushing all in flight work items and timers i think It should give 
as full solution for the hot-unplug case

of preventing any MMIO accesses post device pci_remove.

Let me know what you think guys.

Andrey




And then testing, testing, testing to see if we have missed something.

Christian.

Am 05.04.21 um 19:58 schrieb Andrey Grodzovsky:


Denis, Christian, are there any updates in the plan on how to move 
on with this ? As you know I need very similar code for my 
up-streaming of device hot-unplug. My latest solution 
(https://lists.freedesktop.org/archives/amd-gfx/2021-January/058606.html) 
was not acceptable because of low level guards on the register 
accessors level which was hurting performance. Basically I need a 
way to prevent any MMIO write accesses from kernel driver after 
device is removed (UMD accesses are taken care of by page faulting 
dummy page). We are using now hot-unplug code for Freemont program 
and so up-streaming became more of a priority then before. This 
MMIO access issue is currently my main blocker from up-streaming. 
Is there any way I can assist in pushing this on ?


Andrey

On 2021-03-18 5:51 a.m., Christian König wrote:

Am 18.03.21 um 10:30 schrieb Li, Dennis:


>>> The GPU reset doesn't complete the fences we wait for. It 
only completes the hardware fences as part of the reset.


>>> So waiting for a fence while holding the reset lock is 
illegal and needs to be avoided.


I understood your concern. It is more complex for DRM GFX, 
therefore I abandon adding lock protection for DRM ioctls now. 
Maybe we can try to add all kernel  dma_fence waiting in a list, 
and signal all in recovery threads. Do you have same concern for 
compute cases?




Yes, compute (KFD) is even harder to handle.

See you can't signal the dma_fence waiting. Waiting for a 
dma_fence also means you wait for the GPU reset to finish.


When we would signal the dma_fence during the GPU reset then we 
would run into memory corruption because the hardware jobs 
running after the GPU reset would access memory which is already 
freed.


>>> Lockdep also complains about this when it is used correctly. 
The only reason it doesn't complain here is because you use an 
atomic+wait_event instead of a locking primitive.


Agree. This approach will escape the monitor of lockdep.  Its 
goal is to block other threads when GPU recovery thread start. 
But I couldn’t find a better method to solve this problem. Do 
you have some suggestion?




Well, completely abandon those change here.

What we need to do is to identify where hardware access happens 
and then insert taking the read side of the GPU reset lock so 
that we don't wait for a dma_fence or allocate memory, but still 
protect the hardware from concurrent access and reset.


Regards,
Christian.


Best Regards

Dennis Li

*From:* Koenig, Christian 
*Sent:* Thursday, March 18, 2021 4:59 PM
*To:* Li, Dennis ; 
amd-gfx@lists.freedesktop.org; Deucher, Alexander 
; Kuehling, Felix 
; Zhang, Hawking 
*Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to 
enhance its stability


Exactly that's what you don't seem to understand.

The GPU reset doesn't complete the fences we wait for. It only 
completes the hardware fences as part of the reset.


So waiting for a fence while holding the reset lock is illegal 
and needs to be avoided.


Lockdep also complains about this when it is used correctly. The 
only reason it doesn't complain here is because you use an 
atomic+wait_event instead of a locking primitive.


Regards,

Christian.



*Von:*Li, Dennis mailto:dennis...@amd.com>>
*Gesendet:* Donnerstag, 18. März 2021 09:28
*An:* Koenig, Christian <mailto:christian.koe...@amd.com>>; 
amd-gfx@lists.freedesktop.org 
<mailto:amd-gfx@lists.freedesktop.org> 
<mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander 
mailto:alexander.deuc...@amd.com>>; 
Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking 
mailto:hawking.zh...@amd.com>>
*Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to 
enhance its stability


>>> Those two steps need to be exchanged or otherwise it is 
possible that new delayed work items etc are started before the 
lock is taken.
What about adding check for adev->in_gpu_reset in work item? If 
exchan

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-07 Thread Andrey Grodzovsky
 in flight there is no reason for them to execute any more when 
device is unplugged.


Andrey




Regards,
Christian.

The above approach should allow us to wait for all the IOCTLs in 
flight, together with stopping scheduler threads and cancelling and 
flushing all in flight work items and timers i think It should give 
as full solution for the hot-unplug case

of preventing any MMIO accesses post device pci_remove.

Let me know what you think guys.

Andrey




And then testing, testing, testing to see if we have missed something.

Christian.

Am 05.04.21 um 19:58 schrieb Andrey Grodzovsky:


Denis, Christian, are there any updates in the plan on how to move 
on with this ? As you know I need very similar code for my 
up-streaming of device hot-unplug. My latest solution 
(https://lists.freedesktop.org/archives/amd-gfx/2021-January/058606.html) 
was not acceptable because of low level guards on the register 
accessors level which was hurting performance. Basically I need a 
way to prevent any MMIO write accesses from kernel driver after 
device is removed (UMD accesses are taken care of by page faulting 
dummy page). We are using now hot-unplug code for Freemont program 
and so up-streaming became more of a priority then before. This 
MMIO access issue is currently my main blocker from up-streaming. 
Is there any way I can assist in pushing this on ?


Andrey

On 2021-03-18 5:51 a.m., Christian König wrote:

Am 18.03.21 um 10:30 schrieb Li, Dennis:


>>> The GPU reset doesn't complete the fences we wait for. It 
only completes the hardware fences as part of the reset.


>>> So waiting for a fence while holding the reset lock is 
illegal and needs to be avoided.


I understood your concern. It is more complex for DRM GFX, 
therefore I abandon adding lock protection for DRM ioctls now. 
Maybe we can try to add all kernel  dma_fence waiting in a list, 
and signal all in recovery threads. Do you have same concern for 
compute cases?




Yes, compute (KFD) is even harder to handle.

See you can't signal the dma_fence waiting. Waiting for a 
dma_fence also means you wait for the GPU reset to finish.


When we would signal the dma_fence during the GPU reset then we 
would run into memory corruption because the hardware jobs running 
after the GPU reset would access memory which is already freed.


>>> Lockdep also complains about this when it is used correctly. 
The only reason it doesn't complain here is because you use an 
atomic+wait_event instead of a locking primitive.


Agree. This approach will escape the monitor of lockdep.  Its 
goal is to block other threads when GPU recovery thread start. 
But I couldn’t find a better method to solve this problem. Do you 
have some suggestion?




Well, completely abandon those change here.

What we need to do is to identify where hardware access happens 
and then insert taking the read side of the GPU reset lock so that 
we don't wait for a dma_fence or allocate memory, but still 
protect the hardware from concurrent access and reset.


Regards,
Christian.


Best Regards

Dennis Li

*From:* Koenig, Christian 
*Sent:* Thursday, March 18, 2021 4:59 PM
*To:* Li, Dennis ; 
amd-gfx@lists.freedesktop.org; Deucher, Alexander 
; Kuehling, Felix 
; Zhang, Hawking 
*Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to 
enhance its stability


Exactly that's what you don't seem to understand.

The GPU reset doesn't complete the fences we wait for. It only 
completes the hardware fences as part of the reset.


So waiting for a fence while holding the reset lock is illegal 
and needs to be avoided.


Lockdep also complains about this when it is used correctly. The 
only reason it doesn't complain here is because you use an 
atomic+wait_event instead of a locking primitive.


Regards,

Christian.



*Von:*Li, Dennis mailto:dennis...@amd.com>>
*Gesendet:* Donnerstag, 18. März 2021 09:28
*An:* Koenig, Christian <mailto:christian.koe...@amd.com>>; amd-gfx@lists.freedesktop.org 
<mailto:amd-gfx@lists.freedesktop.org> 
<mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander 
mailto:alexander.deuc...@amd.com>>; 
Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking 
mailto:hawking.zh...@amd.com>>
*Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to 
enhance its stability


>>> Those two steps need to be exchanged or otherwise it is 
possible that new delayed work items etc are started before the 
lock is taken.
What about adding check for adev->in_gpu_reset in work item? If 
exchange the two steps, it maybe introduce the deadlock. For 
example, the user thread hold the read lock and waiting for the 
fence, if recovery thread try to hold write lock and then 
complete fences, in this case, recovery thread will always be 
blocked.



Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian <

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-07 Thread Christian König
 Christian, are there any updates in the plan on how to move 
on with this ? As you know I need very similar code for my 
up-streaming of device hot-unplug. My latest solution 
(https://lists.freedesktop.org/archives/amd-gfx/2021-January/058606.html) 
was not acceptable because of low level guards on the register 
accessors level which was hurting performance. Basically I need a 
way to prevent any MMIO write accesses from kernel driver after 
device is removed (UMD accesses are taken care of by page faulting 
dummy page). We are using now hot-unplug code for Freemont program 
and so up-streaming became more of a priority then before. This MMIO 
access issue is currently my main blocker from up-streaming. Is 
there any way I can assist in pushing this on ?


Andrey

On 2021-03-18 5:51 a.m., Christian König wrote:

Am 18.03.21 um 10:30 schrieb Li, Dennis:


>>> The GPU reset doesn't complete the fences we wait for. It only 
completes the hardware fences as part of the reset.


>>> So waiting for a fence while holding the reset lock is illegal 
and needs to be avoided.


I understood your concern. It is more complex for DRM GFX, 
therefore I abandon adding lock protection for DRM ioctls now. 
Maybe we can try to add all kernel  dma_fence waiting in a list, 
and signal all in recovery threads. Do you have same concern for 
compute cases?




Yes, compute (KFD) is even harder to handle.

See you can't signal the dma_fence waiting. Waiting for a dma_fence 
also means you wait for the GPU reset to finish.


When we would signal the dma_fence during the GPU reset then we 
would run into memory corruption because the hardware jobs running 
after the GPU reset would access memory which is already freed.


>>> Lockdep also complains about this when it is used correctly. 
The only reason it doesn't complain here is because you use an 
atomic+wait_event instead of a locking primitive.


Agree. This approach will escape the monitor of lockdep.  Its goal 
is to block other threads when GPU recovery thread start. But I 
couldn’t find a better method to solve this problem. Do you have 
some suggestion?




Well, completely abandon those change here.

What we need to do is to identify where hardware access happens and 
then insert taking the read side of the GPU reset lock so that we 
don't wait for a dma_fence or allocate memory, but still protect 
the hardware from concurrent access and reset.


Regards,
Christian.


Best Regards

Dennis Li

*From:* Koenig, Christian 
*Sent:* Thursday, March 18, 2021 4:59 PM
*To:* Li, Dennis ; 
amd-gfx@lists.freedesktop.org; Deucher, Alexander 
; Kuehling, Felix 
; Zhang, Hawking 
*Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance 
its stability


Exactly that's what you don't seem to understand.

The GPU reset doesn't complete the fences we wait for. It only 
completes the hardware fences as part of the reset.


So waiting for a fence while holding the reset lock is illegal and 
needs to be avoided.


Lockdep also complains about this when it is used correctly. The 
only reason it doesn't complain here is because you use an 
atomic+wait_event instead of a locking primitive.


Regards,

Christian.



*Von:*Li, Dennis mailto:dennis...@amd.com>>
*Gesendet:* Donnerstag, 18. März 2021 09:28
*An:* Koenig, Christian <mailto:christian.koe...@amd.com>>; amd-gfx@lists.freedesktop.org 
<mailto:amd-gfx@lists.freedesktop.org> 
<mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander 
mailto:alexander.deuc...@amd.com>>; 
Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking 
mailto:hawking.zh...@amd.com>>
*Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance 
its stability


>>> Those two steps need to be exchanged or otherwise it is 
possible that new delayed work items etc are started before the 
lock is taken.
What about adding check for adev->in_gpu_reset in work item? If 
exchange the two steps, it maybe introduce the deadlock.  For 
example, the user thread hold the read lock and waiting for the 
fence, if recovery thread try to hold write lock and then complete 
fences, in this case, recovery thread will always be blocked.



Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian <mailto:christian.koe...@amd.com>>

Sent: Thursday, March 18, 2021 3:54 PM
To: Li, Dennis mailto:dennis...@amd.com>>; 
amd-gfx@lists.freedesktop.org 
<mailto:amd-gfx@lists.freedesktop.org>; Deucher, Alexander 
mailto:alexander.deuc...@amd.com>>; 
Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking 
mailto:hawking.zh...@amd.com>>
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance 
its stability


Am 18.03.21 um 08:23 schrieb Dennis Li:
> We have defined two variables in_gpu_reset and reset_sem in adev 
object. The atomic type variable in_g

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-06 Thread Andrey Grodzovsky
e wait for. It only 
completes the hardware fences as part of the reset.


>>> So waiting for a fence while holding the reset lock is illegal 
and needs to be avoided.


I understood your concern. It is more complex for DRM GFX, 
therefore I abandon adding lock protection for DRM ioctls now. 
Maybe we can try to add all kernel  dma_fence waiting in a list, 
and signal all in recovery threads. Do you have same concern for 
compute cases?




Yes, compute (KFD) is even harder to handle.

See you can't signal the dma_fence waiting. Waiting for a dma_fence 
also means you wait for the GPU reset to finish.


When we would signal the dma_fence during the GPU reset then we 
would run into memory corruption because the hardware jobs running 
after the GPU reset would access memory which is already freed.


>>> Lockdep also complains about this when it is used correctly. 
The only reason it doesn't complain here is because you use an 
atomic+wait_event instead of a locking primitive.


Agree. This approach will escape the monitor of lockdep.  Its goal 
is to block other threads when GPU recovery thread start. But I 
couldn’t find a better method to solve this problem. Do you have 
some suggestion?




Well, completely abandon those change here.

What we need to do is to identify where hardware access happens and 
then insert taking the read side of the GPU reset lock so that we 
don't wait for a dma_fence or allocate memory, but still protect the 
hardware from concurrent access and reset.


Regards,
Christian.


Best Regards

Dennis Li

*From:* Koenig, Christian 
*Sent:* Thursday, March 18, 2021 4:59 PM
*To:* Li, Dennis ; 
amd-gfx@lists.freedesktop.org; Deucher, Alexander 
; Kuehling, Felix 
; Zhang, Hawking 
*Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance 
its stability


Exactly that's what you don't seem to understand.

The GPU reset doesn't complete the fences we wait for. It only 
completes the hardware fences as part of the reset.


So waiting for a fence while holding the reset lock is illegal and 
needs to be avoided.


Lockdep also complains about this when it is used correctly. The 
only reason it doesn't complain here is because you use an 
atomic+wait_event instead of a locking primitive.


Regards,

Christian.



*Von:*Li, Dennis mailto:dennis...@amd.com>>
*Gesendet:* Donnerstag, 18. März 2021 09:28
*An:* Koenig, Christian <mailto:christian.koe...@amd.com>>; amd-gfx@lists.freedesktop.org 
<mailto:amd-gfx@lists.freedesktop.org> 
<mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander 
mailto:alexander.deuc...@amd.com>>; 
Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking 
mailto:hawking.zh...@amd.com>>
*Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance 
its stability


>>> Those two steps need to be exchanged or otherwise it is 
possible that new delayed work items etc are started before the 
lock is taken.
What about adding check for adev->in_gpu_reset in work item? If 
exchange the two steps, it maybe introduce the deadlock.  For 
example, the user thread hold the read lock and waiting for the 
fence, if recovery thread try to hold write lock and then complete 
fences, in this case, recovery thread will always be blocked.



Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian <mailto:christian.koe...@amd.com>>

Sent: Thursday, March 18, 2021 3:54 PM
To: Li, Dennis mailto:dennis...@amd.com>>; 
amd-gfx@lists.freedesktop.org 
<mailto:amd-gfx@lists.freedesktop.org>; Deucher, Alexander 
mailto:alexander.deuc...@amd.com>>; 
Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking 
mailto:hawking.zh...@amd.com>>
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance 
its stability


Am 18.03.21 um 08:23 schrieb Dennis Li:
> We have defined two variables in_gpu_reset and reset_sem in adev 
object. The atomic type variable in_gpu_reset is used to avoid 
recovery thread reenter and make lower functions return more 
earlier when recovery start, but couldn't block recovery thread 
when it access hardware. The r/w semaphore reset_sem is used to 
solve these synchronization issues between recovery thread and 
other threads.

>
> The original solution locked registers' access in lower 
functions, which will introduce following issues:

>
> 1) many lower functions are used in both recovery thread and 
others. Firstly we must harvest these functions, it is easy to miss 
someones. Secondly these functions need select which lock (read 
lock or write lock) will be used, according to the thread it is 
running in. If the thread context isn't considered, the added lock 
will easily introduce deadlock. Besides that, in most time, 
developer easily forget to add locks for new functions.

>
> 2) performance drop. More lower 

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-06 Thread Christian König



Am 06.04.21 um 12:34 schrieb Christian König:

Hi Andrey,

well good question. My job is to watch over the implementation and 
design and while I always help I can adjust anybodies schedule.


That should read "I can't adjust anybodies schedule".

Christian.



Is the patch to print a warning when the hardware is accessed without 
holding the locks merged yet? If not then that would probably be a 
good starting point.


Then we would need to unify this with the SRCU to make sure that we 
have both the reset lock as well as block the hotplug code from 
reusing the MMIO space.


And then testing, testing, testing to see if we have missed something.

Christian.

Am 05.04.21 um 19:58 schrieb Andrey Grodzovsky:


Denis, Christian, are there any updates in the plan on how to move on 
with this ? As you know I need very similar code for my up-streaming 
of device hot-unplug. My latest solution 
(https://lists.freedesktop.org/archives/amd-gfx/2021-January/058606.html) 
was not acceptable because of low level guards on the register 
accessors level which was hurting performance. Basically I need a way 
to prevent any MMIO write accesses from kernel driver after device is 
removed (UMD accesses are taken care of by page faulting dummy page). 
We are using now hot-unplug code for Freemont program and so 
up-streaming became more of a priority then before. This MMIO access 
issue is currently my main blocker from up-streaming. Is there any 
way I can assist in pushing this on ?


Andrey

On 2021-03-18 5:51 a.m., Christian König wrote:

Am 18.03.21 um 10:30 schrieb Li, Dennis:


>>> The GPU reset doesn't complete the fences we wait for. It only 
completes the hardware fences as part of the reset.


>>> So waiting for a fence while holding the reset lock is illegal 
and needs to be avoided.


I understood your concern. It is more complex for DRM GFX, 
therefore I abandon adding lock protection for DRM ioctls now. 
Maybe we can try to add all kernel  dma_fence waiting in a list, 
and signal all in recovery threads. Do you have same concern for 
compute cases?




Yes, compute (KFD) is even harder to handle.

See you can't signal the dma_fence waiting. Waiting for a dma_fence 
also means you wait for the GPU reset to finish.


When we would signal the dma_fence during the GPU reset then we 
would run into memory corruption because the hardware jobs running 
after the GPU reset would access memory which is already freed.


>>> Lockdep also complains about this when it is used correctly. 
The only reason it doesn't complain here is because you use an 
atomic+wait_event instead of a locking primitive.


Agree. This approach will escape the monitor of lockdep.  Its goal 
is to block other threads when GPU recovery thread start. But I 
couldn’t find a better method to solve this problem. Do you have 
some suggestion?




Well, completely abandon those change here.

What we need to do is to identify where hardware access happens and 
then insert taking the read side of the GPU reset lock so that we 
don't wait for a dma_fence or allocate memory, but still protect the 
hardware from concurrent access and reset.


Regards,
Christian.


Best Regards

Dennis Li

*From:* Koenig, Christian 
*Sent:* Thursday, March 18, 2021 4:59 PM
*To:* Li, Dennis ; 
amd-gfx@lists.freedesktop.org; Deucher, Alexander 
; Kuehling, Felix 
; Zhang, Hawking 
*Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance 
its stability


Exactly that's what you don't seem to understand.

The GPU reset doesn't complete the fences we wait for. It only 
completes the hardware fences as part of the reset.


So waiting for a fence while holding the reset lock is illegal and 
needs to be avoided.


Lockdep also complains about this when it is used correctly. The 
only reason it doesn't complain here is because you use an 
atomic+wait_event instead of a locking primitive.


Regards,

Christian.



*Von:*Li, Dennis mailto:dennis...@amd.com>>
*Gesendet:* Donnerstag, 18. März 2021 09:28
*An:* Koenig, Christian <mailto:christian.koe...@amd.com>>; amd-gfx@lists.freedesktop.org 
<mailto:amd-gfx@lists.freedesktop.org> 
<mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander 
mailto:alexander.deuc...@amd.com>>; 
Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking 
mailto:hawking.zh...@amd.com>>
*Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance 
its stability


>>> Those two steps need to be exchanged or otherwise it is 
possible that new delayed work items etc are started before the 
lock is taken.
What about adding check for adev->in_gpu_reset in work item? If 
exchange the two steps, it maybe introduce the deadlock.  For 
example, the user thread hold the read lock and waiting for the 
fence, if recovery thread try to hold write lock and then complete 
f

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-06 Thread Christian König

Hi Andrey,

well good question. My job is to watch over the implementation and 
design and while I always help I can adjust anybodies schedule.


Is the patch to print a warning when the hardware is accessed without 
holding the locks merged yet? If not then that would probably be a good 
starting point.


Then we would need to unify this with the SRCU to make sure that we have 
both the reset lock as well as block the hotplug code from reusing the 
MMIO space.


And then testing, testing, testing to see if we have missed something.

Christian.

Am 05.04.21 um 19:58 schrieb Andrey Grodzovsky:


Denis, Christian, are there any updates in the plan on how to move on 
with this ? As you know I need very similar code for my up-streaming 
of device hot-unplug. My latest solution 
(https://lists.freedesktop.org/archives/amd-gfx/2021-January/058606.html) 
was not acceptable because of low level guards on the register 
accessors level which was hurting performance. Basically I need a way 
to prevent any MMIO write accesses from kernel driver after device is 
removed (UMD accesses are taken care of by page faulting dummy page). 
We are using now hot-unplug code for Freemont program and so 
up-streaming became more of a priority then before. This MMIO access 
issue is currently my main blocker from up-streaming. Is there any way 
I can assist in pushing this on ?


Andrey

On 2021-03-18 5:51 a.m., Christian König wrote:

Am 18.03.21 um 10:30 schrieb Li, Dennis:


>>> The GPU reset doesn't complete the fences we wait for. It only 
completes the hardware fences as part of the reset.


>>> So waiting for a fence while holding the reset lock is illegal 
and needs to be avoided.


I understood your concern. It is more complex for DRM GFX, therefore 
I abandon adding lock protection for DRM ioctls now. Maybe we can 
try to add all kernel  dma_fence waiting in a list, and signal all 
in recovery threads. Do you have same concern for compute cases?




Yes, compute (KFD) is even harder to handle.

See you can't signal the dma_fence waiting. Waiting for a dma_fence 
also means you wait for the GPU reset to finish.


When we would signal the dma_fence during the GPU reset then we would 
run into memory corruption because the hardware jobs running after 
the GPU reset would access memory which is already freed.


>>> Lockdep also complains about this when it is used correctly. The 
only reason it doesn't complain here is because you use an 
atomic+wait_event instead of a locking primitive.


Agree. This approach will escape the monitor of lockdep.  Its goal 
is to block other threads when GPU recovery thread start. But I 
couldn’t find a better method to solve this problem. Do you have 
some suggestion?




Well, completely abandon those change here.

What we need to do is to identify where hardware access happens and 
then insert taking the read side of the GPU reset lock so that we 
don't wait for a dma_fence or allocate memory, but still protect the 
hardware from concurrent access and reset.


Regards,
Christian.


Best Regards

Dennis Li

*From:* Koenig, Christian 
*Sent:* Thursday, March 18, 2021 4:59 PM
*To:* Li, Dennis ; amd-gfx@lists.freedesktop.org; 
Deucher, Alexander ; Kuehling, Felix 
; Zhang, Hawking 
*Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance 
its stability


Exactly that's what you don't seem to understand.

The GPU reset doesn't complete the fences we wait for. It only 
completes the hardware fences as part of the reset.


So waiting for a fence while holding the reset lock is illegal and 
needs to be avoided.


Lockdep also complains about this when it is used correctly. The 
only reason it doesn't complain here is because you use an 
atomic+wait_event instead of a locking primitive.


Regards,

Christian.



*Von:*Li, Dennis mailto:dennis...@amd.com>>
*Gesendet:* Donnerstag, 18. März 2021 09:28
*An:* Koenig, Christian <mailto:christian.koe...@amd.com>>; amd-gfx@lists.freedesktop.org 
<mailto:amd-gfx@lists.freedesktop.org> 
<mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander 
mailto:alexander.deuc...@amd.com>>; 
Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking 
mailto:hawking.zh...@amd.com>>
*Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance 
its stability


>>> Those two steps need to be exchanged or otherwise it is possible 
that new delayed work items etc are started before the lock is taken.
What about adding check for adev->in_gpu_reset in work item? If 
exchange the two steps, it maybe introduce the deadlock.  For 
example, the user thread hold the read lock and waiting for the 
fence, if recovery thread try to hold write lock and then complete 
fences, in this case, recovery thread will always be blocked.



Best Regards
Dennis Li
-Original Message-
From: Koenig, C

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-05 Thread Andrey Grodzovsky
Denis, Christian, are there any updates in the plan on how to move on 
with this ? As you know I need very similar code for my up-streaming of 
device hot-unplug. My latest solution 
(https://lists.freedesktop.org/archives/amd-gfx/2021-January/058606.html) 
was not acceptable because of low level guards on the register accessors 
level which was hurting performance. Basically I need a way to prevent 
any MMIO write accesses from kernel driver after device is removed (UMD 
accesses are taken care of by page faulting dummy page). We are using 
now hot-unplug code for Freemont program and so up-streaming became more 
of a priority then before. This MMIO access issue is currently my main 
blocker from up-streaming. Is there any way I can assist in pushing this 
on ?


Andrey

On 2021-03-18 5:51 a.m., Christian König wrote:

Am 18.03.21 um 10:30 schrieb Li, Dennis:


>>> The GPU reset doesn't complete the fences we wait for. It only 
completes the hardware fences as part of the reset.


>>> So waiting for a fence while holding the reset lock is illegal 
and needs to be avoided.


I understood your concern. It is more complex for DRM GFX, therefore 
I abandon adding lock protection for DRM ioctls now. Maybe we can try 
to add all kernel  dma_fence waiting in a list, and signal all in 
recovery threads. Do you have same concern for compute cases?




Yes, compute (KFD) is even harder to handle.

See you can't signal the dma_fence waiting. Waiting for a dma_fence 
also means you wait for the GPU reset to finish.


When we would signal the dma_fence during the GPU reset then we would 
run into memory corruption because the hardware jobs running after the 
GPU reset would access memory which is already freed.


>>> Lockdep also complains about this when it is used correctly. The 
only reason it doesn't complain here is because you use an 
atomic+wait_event instead of a locking primitive.


Agree. This approach will escape the monitor of lockdep.  Its goal is 
to block other threads when GPU recovery thread start. But I couldn’t 
find a better method to solve this problem. Do you have some suggestion?




Well, completely abandon those change here.

What we need to do is to identify where hardware access happens and 
then insert taking the read side of the GPU reset lock so that we 
don't wait for a dma_fence or allocate memory, but still protect the 
hardware from concurrent access and reset.


Regards,
Christian.


Best Regards

Dennis Li

*From:* Koenig, Christian 
*Sent:* Thursday, March 18, 2021 4:59 PM
*To:* Li, Dennis ; amd-gfx@lists.freedesktop.org; 
Deucher, Alexander ; Kuehling, Felix 
; Zhang, Hawking 
*Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance 
its stability


Exactly that's what you don't seem to understand.

The GPU reset doesn't complete the fences we wait for. It only 
completes the hardware fences as part of the reset.


So waiting for a fence while holding the reset lock is illegal and 
needs to be avoided.


Lockdep also complains about this when it is used correctly. The only 
reason it doesn't complain here is because you use an 
atomic+wait_event instead of a locking primitive.


Regards,

Christian.



*Von:*Li, Dennis mailto:dennis...@amd.com>>
*Gesendet:* Donnerstag, 18. März 2021 09:28
*An:* Koenig, Christian <mailto:christian.koe...@amd.com>>; amd-gfx@lists.freedesktop.org 
<mailto:amd-gfx@lists.freedesktop.org> <mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander 
mailto:alexander.deuc...@amd.com>>; 
Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking 
mailto:hawking.zh...@amd.com>>
*Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance 
its stability


>>> Those two steps need to be exchanged or otherwise it is possible 
that new delayed work items etc are started before the lock is taken.
What about adding check for adev->in_gpu_reset in work item? If 
exchange the two steps, it maybe introduce the deadlock.  For 
example, the user thread hold the read lock and waiting for the 
fence, if recovery thread try to hold write lock and then complete 
fences, in this case, recovery thread will always be blocked.



Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian <mailto:christian.koe...@amd.com>>

Sent: Thursday, March 18, 2021 3:54 PM
To: Li, Dennis mailto:dennis...@amd.com>>; 
amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>; 
Deucher, Alexander <mailto:alexander.deuc...@amd.com>>; Kuehling, Felix 
mailto:felix.kuehl...@amd.com>>; Zhang, 
Hawking mailto:hawking.zh...@amd.com>>
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its 
stability


Am 18.03.21 um 08:23 schrieb Dennis Li:
> We have defined two variables in_gpu_reset and reset_sem in adev 
object. The atomic

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Christian König

Am 18.03.21 um 10:30 schrieb Li, Dennis:


>>> The GPU reset doesn't complete the fences we wait for. It only 
completes the hardware fences as part of the reset.


>>> So waiting for a fence while holding the reset lock is illegal and 
needs to be avoided.


I understood your concern. It is more complex for DRM GFX, therefore I 
abandon adding lock protection for DRM ioctls now. Maybe we can try to 
add all kernel  dma_fence waiting in a list, and signal all in 
recovery threads. Do you have same concern for compute cases?




Yes, compute (KFD) is even harder to handle.

See you can't signal the dma_fence waiting. Waiting for a dma_fence also 
means you wait for the GPU reset to finish.


When we would signal the dma_fence during the GPU reset then we would 
run into memory corruption because the hardware jobs running after the 
GPU reset would access memory which is already freed.


>>> Lockdep also complains about this when it is used correctly. The 
only reason it doesn't complain here is because you use an 
atomic+wait_event instead of a locking primitive.


Agree. This approach will escape the monitor of lockdep.  Its goal is 
to block other threads when GPU recovery thread start. But I couldn’t 
find a better method to solve this problem. Do you have some suggestion?




Well, completely abandon those change here.

What we need to do is to identify where hardware access happens and then 
insert taking the read side of the GPU reset lock so that we don't wait 
for a dma_fence or allocate memory, but still protect the hardware from 
concurrent access and reset.


Regards,
Christian.


Best Regards

Dennis Li

*From:* Koenig, Christian 
*Sent:* Thursday, March 18, 2021 4:59 PM
*To:* Li, Dennis ; amd-gfx@lists.freedesktop.org; 
Deucher, Alexander ; Kuehling, Felix 
; Zhang, Hawking 
*Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance its 
stability


Exactly that's what you don't seem to understand.

The GPU reset doesn't complete the fences we wait for. It only 
completes the hardware fences as part of the reset.


So waiting for a fence while holding the reset lock is illegal and 
needs to be avoided.


Lockdep also complains about this when it is used correctly. The only 
reason it doesn't complain here is because you use an 
atomic+wait_event instead of a locking primitive.


Regards,

Christian.



*Von:*Li, Dennis mailto:dennis...@amd.com>>
*Gesendet:* Donnerstag, 18. März 2021 09:28
*An:* Koenig, Christian <mailto:christian.koe...@amd.com>>; amd-gfx@lists.freedesktop.org 
<mailto:amd-gfx@lists.freedesktop.org> <mailto:amd-gfx@lists.freedesktop.org>>; Deucher, Alexander 
mailto:alexander.deuc...@amd.com>>; 
Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking 
mailto:hawking.zh...@amd.com>>
*Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its 
stability


>>> Those two steps need to be exchanged or otherwise it is possible 
that new delayed work items etc are started before the lock is taken.
What about adding check for adev->in_gpu_reset in work item? If 
exchange the two steps, it maybe introduce the deadlock.  For example, 
the user thread hold the read lock and waiting for the fence, if 
recovery thread try to hold write lock and then complete fences, in 
this case, recovery thread will always be blocked.



Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian <mailto:christian.koe...@amd.com>>

Sent: Thursday, March 18, 2021 3:54 PM
To: Li, Dennis mailto:dennis...@amd.com>>; 
amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>; 
Deucher, Alexander <mailto:alexander.deuc...@amd.com>>; Kuehling, Felix 
mailto:felix.kuehl...@amd.com>>; Zhang, 
Hawking mailto:hawking.zh...@amd.com>>
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its 
stability


Am 18.03.21 um 08:23 schrieb Dennis Li:
> We have defined two variables in_gpu_reset and reset_sem in adev 
object. The atomic type variable in_gpu_reset is used to avoid 
recovery thread reenter and make lower functions return more earlier 
when recovery start, but couldn't block recovery thread when it access 
hardware. The r/w semaphore reset_sem is used to solve these 
synchronization issues between recovery thread and other threads.

>
> The original solution locked registers' access in lower functions, 
which will introduce following issues:

>
> 1) many lower functions are used in both recovery thread and others. 
Firstly we must harvest these functions, it is easy to miss someones. 
Secondly these functions need select which lock (read lock or write 
lock) will be used, according to the thread it is running in. If the 
thread context isn't considered, the added lock will easily introduce 
deadlock. Besides that, in most time

RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Li, Dennis
>>> The GPU reset doesn't complete the fences we wait for. It only completes 
>>> the hardware fences as part of the reset.
>>> So waiting for a fence while holding the reset lock is illegal and needs to 
>>> be avoided.
I understood your concern. It is more complex for DRM GFX, therefore I abandon 
adding lock protection for DRM ioctls now. Maybe we can try to add all kernel  
dma_fence waiting in a list, and signal all in recovery threads. Do you have 
same concern for compute cases?

>>> Lockdep also complains about this when it is used correctly. The only 
>>> reason it doesn't complain here is because you use an atomic+wait_event 
>>> instead of a locking primitive.
Agree. This approach will escape the monitor of lockdep.  Its goal is to block 
other threads when GPU recovery thread start. But I couldn’t find a better 
method to solve this problem. Do you have some suggestion?

Best Regards
Dennis Li

From: Koenig, Christian 
Sent: Thursday, March 18, 2021 4:59 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, 
Alexander ; Kuehling, Felix 
; Zhang, Hawking 
Subject: AW: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Exactly that's what you don't seem to understand.

The GPU reset doesn't complete the fences we wait for. It only completes the 
hardware fences as part of the reset.

So waiting for a fence while holding the reset lock is illegal and needs to be 
avoided.

Lockdep also complains about this when it is used correctly. The only reason it 
doesn't complain here is because you use an atomic+wait_event instead of a 
locking primitive.

Regards,
Christian.


Von: Li, Dennis mailto:dennis...@amd.com>>
Gesendet: Donnerstag, 18. März 2021 09:28
An: Koenig, Christian 
mailto:christian.koe...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org> 
mailto:amd-gfx@lists.freedesktop.org>>; Deucher, 
Alexander mailto:alexander.deuc...@amd.com>>; 
Kuehling, Felix mailto:felix.kuehl...@amd.com>>; Zhang, 
Hawking mailto:hawking.zh...@amd.com>>
Betreff: RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

>>> Those two steps need to be exchanged or otherwise it is possible that new 
>>> delayed work items etc are started before the lock is taken.
What about adding check for adev->in_gpu_reset in work item? If exchange the 
two steps, it maybe introduce the deadlock.  For example, the user thread hold 
the read lock and waiting for the fence, if recovery thread try to hold write 
lock and then complete fences, in this case, recovery thread will always be 
blocked.

Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian 
mailto:christian.koe...@amd.com>>
Sent: Thursday, March 18, 2021 3:54 PM
To: Li, Dennis mailto:dennis...@amd.com>>; 
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>; Deucher, 
Alexander mailto:alexander.deuc...@amd.com>>; 
Kuehling, Felix mailto:felix.kuehl...@amd.com>>; Zhang, 
Hawking mailto:hawking.zh...@amd.com>>
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Am 18.03.21 um 08:23 schrieb Dennis Li:
> We have defined two variables in_gpu_reset and reset_sem in adev object. The 
> atomic type variable in_gpu_reset is used to avoid recovery thread reenter 
> and make lower functions return more earlier when recovery start, but 
> couldn't block recovery thread when it access hardware. The r/w semaphore 
> reset_sem is used to solve these synchronization issues between recovery 
> thread and other threads.
>
> The original solution locked registers' access in lower functions, which will 
> introduce following issues:
>
> 1) many lower functions are used in both recovery thread and others. Firstly 
> we must harvest these functions, it is easy to miss someones. Secondly these 
> functions need select which lock (read lock or write lock) will be used, 
> according to the thread it is running in. If the thread context isn't 
> considered, the added lock will easily introduce deadlock. Besides that, in 
> most time, developer easily forget to add locks for new functions.
>
> 2) performance drop. More lower functions are more frequently called.
>
> 3) easily introduce false positive lockdep complaint, because write lock has 
> big range in recovery thread, but low level functions will hold read lock may 
> be protected by other locks in other threads.
>
> Therefore the new solution will try to add lock protection for ioctls of kfd. 
> Its goal is that there are no threads except for recovery thread or its 
> children (for xgmi) to access hardware when doing GPU reset and resume. So 
> refine recovery thread as the following:
>
> Step 0: atomic_cmpxchg(>in_gpu_reset, 0, 1)

RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Li, Dennis
>>> Those two steps need to be exchanged or otherwise it is possible that new 
>>> delayed work items etc are started before the lock is taken.
What about adding check for adev->in_gpu_reset in work item? If exchange the 
two steps, it maybe introduce the deadlock.  For example, the user thread hold 
the read lock and waiting for the fence, if recovery thread try to hold write 
lock and then complete fences, in this case, recovery thread will always be 
blocked. 

Best Regards
Dennis Li
-Original Message-
From: Koenig, Christian  
Sent: Thursday, March 18, 2021 3:54 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, 
Alexander ; Kuehling, Felix 
; Zhang, Hawking 
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

Am 18.03.21 um 08:23 schrieb Dennis Li:
> We have defined two variables in_gpu_reset and reset_sem in adev object. The 
> atomic type variable in_gpu_reset is used to avoid recovery thread reenter 
> and make lower functions return more earlier when recovery start, but 
> couldn't block recovery thread when it access hardware. The r/w semaphore 
> reset_sem is used to solve these synchronization issues between recovery 
> thread and other threads.
>
> The original solution locked registers' access in lower functions, which will 
> introduce following issues:
>
> 1) many lower functions are used in both recovery thread and others. Firstly 
> we must harvest these functions, it is easy to miss someones. Secondly these 
> functions need select which lock (read lock or write lock) will be used, 
> according to the thread it is running in. If the thread context isn't 
> considered, the added lock will easily introduce deadlock. Besides that, in 
> most time, developer easily forget to add locks for new functions.
>
> 2) performance drop. More lower functions are more frequently called.
>
> 3) easily introduce false positive lockdep complaint, because write lock has 
> big range in recovery thread, but low level functions will hold read lock may 
> be protected by other locks in other threads.
>
> Therefore the new solution will try to add lock protection for ioctls of kfd. 
> Its goal is that there are no threads except for recovery thread or its 
> children (for xgmi) to access hardware when doing GPU reset and resume. So 
> refine recovery thread as the following:
>
> Step 0: atomic_cmpxchg(>in_gpu_reset, 0, 1)
> 1). if failed, it means system had a recovery thread running, current 
> thread exit directly;
> 2). if success, enter recovery thread;
>
> Step 1: cancel all delay works, stop drm schedule, complete all unreceived 
> fences and so on. It try to stop or pause other threads.
>
> Step 2: call down_write(>reset_sem) to hold write lock, which will 
> block recovery thread until other threads release read locks.

Those two steps need to be exchanged or otherwise it is possible that new 
delayed work items etc are started before the lock is taken.

Just to make it clear until this is fixed the whole patch set is a NAK.

Regards,
Christian.

>
> Step 3: normally, there is only recovery threads running to access hardware, 
> it is safe to do gpu reset now.
>
> Step 4: do post gpu reset, such as call all ips' resume functions;
>
> Step 5: atomic set adev->in_gpu_reset as 0, wake up other threads and release 
> write lock. Recovery thread exit normally.
>
> Other threads call the amdgpu_read_lock to synchronize with recovery thread. 
> If it finds that in_gpu_reset is 1, it should release read lock if it has 
> holden one, and then blocks itself to wait for recovery finished event. If 
> thread successfully hold read lock and in_gpu_reset is 0, it continues. It 
> will exit normally or be stopped by recovery thread in step 1.
>
> Dennis Li (4):
>drm/amdgpu: remove reset lock from low level functions
>drm/amdgpu: refine the GPU recovery sequence
>drm/amdgpu: instead of using down/up_read directly
>drm/amdkfd: add reset lock protection for kfd entry functions
>
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h   |   6 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 173 +-
>   .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c|   8 -
>   drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c|   4 +-
>   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |   9 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c |   5 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c |   5 +-
>   drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  | 172 -
>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h |   3 +-
>   drivers/gpu/drm/amd/amdkfd/kfd_process.c  |   4 +
>   .../amd/amdkfd/kfd_process_queue_manager.c|  17 ++
>   12 files changed, 345 insertions(+), 75 deletions(-)
>

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Christian König

Am 18.03.21 um 08:23 schrieb Dennis Li:

We have defined two variables in_gpu_reset and reset_sem in adev object. The 
atomic type variable in_gpu_reset is used to avoid recovery thread reenter and 
make lower functions return more earlier when recovery start, but couldn't 
block recovery thread when it access hardware. The r/w semaphore reset_sem is 
used to solve these synchronization issues between recovery thread and other 
threads.

The original solution locked registers' access in lower functions, which will 
introduce following issues:

1) many lower functions are used in both recovery thread and others. Firstly we 
must harvest these functions, it is easy to miss someones. Secondly these 
functions need select which lock (read lock or write lock) will be used, 
according to the thread it is running in. If the thread context isn't 
considered, the added lock will easily introduce deadlock. Besides that, in 
most time, developer easily forget to add locks for new functions.

2) performance drop. More lower functions are more frequently called.

3) easily introduce false positive lockdep complaint, because write lock has 
big range in recovery thread, but low level functions will hold read lock may 
be protected by other locks in other threads.

Therefore the new solution will try to add lock protection for ioctls of kfd. 
Its goal is that there are no threads except for recovery thread or its 
children (for xgmi) to access hardware when doing GPU reset and resume. So 
refine recovery thread as the following:

Step 0: atomic_cmpxchg(>in_gpu_reset, 0, 1)
1). if failed, it means system had a recovery thread running, current 
thread exit directly;
2). if success, enter recovery thread;

Step 1: cancel all delay works, stop drm schedule, complete all unreceived 
fences and so on. It try to stop or pause other threads.

Step 2: call down_write(>reset_sem) to hold write lock, which will block 
recovery thread until other threads release read locks.


Those two steps need to be exchanged or otherwise it is possible that 
new delayed work items etc are started before the lock is taken.


Just to make it clear until this is fixed the whole patch set is a NAK.

Regards,
Christian.



Step 3: normally, there is only recovery threads running to access hardware, it 
is safe to do gpu reset now.

Step 4: do post gpu reset, such as call all ips' resume functions;

Step 5: atomic set adev->in_gpu_reset as 0, wake up other threads and release 
write lock. Recovery thread exit normally.

Other threads call the amdgpu_read_lock to synchronize with recovery thread. If 
it finds that in_gpu_reset is 1, it should release read lock if it has holden 
one, and then blocks itself to wait for recovery finished event. If thread 
successfully hold read lock and in_gpu_reset is 0, it continues. It will exit 
normally or be stopped by recovery thread in step 1.

Dennis Li (4):
   drm/amdgpu: remove reset lock from low level functions
   drm/amdgpu: refine the GPU recovery sequence
   drm/amdgpu: instead of using down/up_read directly
   drm/amdkfd: add reset lock protection for kfd entry functions

  drivers/gpu/drm/amd/amdgpu/amdgpu.h   |   6 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c   |  14 +-
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c| 173 +-
  .../gpu/drm/amd/amdgpu/amdgpu_ras_eeprom.c|   8 -
  drivers/gpu/drm/amd/amdgpu/gmc_v10_0.c|   4 +-
  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c |   9 +-
  drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c |   5 +-
  drivers/gpu/drm/amd/amdgpu/mxgpu_nv.c |   5 +-
  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c  | 172 -
  drivers/gpu/drm/amd/amdkfd/kfd_priv.h |   3 +-
  drivers/gpu/drm/amd/amdkfd/kfd_process.c  |   4 +
  .../amd/amdkfd/kfd_process_queue_manager.c|  17 ++
  12 files changed, 345 insertions(+), 75 deletions(-)



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx