Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-15 Thread Christian König
[SNIP] Maybe just empirically - let's try it and see under different test scenarios what actually happens  ? Not a good idea in general, we have that approach way to often at AMD and are then surprised that everything works in QA but fails in production. But Daniel already noted in hi

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-15 Thread Andrey Grodzovsky
On 2021-04-15 3:02 a.m., Christian König wrote: Am 15.04.21 um 08:27 schrieb Andrey Grodzovsky: On 2021-04-14 10:58 a.m., Christian König wrote: Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:  [SNIP] We are racing here once more and need to handle that. But why, I wrote above that we f

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-15 Thread Christian König
Am 15.04.21 um 08:27 schrieb Andrey Grodzovsky: On 2021-04-14 10:58 a.m., Christian König wrote: Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:  [SNIP] We are racing here once more and need to handle that. But why, I wrote above that we first stop the all schedulers, then only call drm_

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-14 Thread Andrey Grodzovsky
On 2021-04-14 10:58 a.m., Christian König wrote: Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:  [SNIP] We are racing here once more and need to handle that. But why, I wrote above that we first stop the all schedulers, then only call drm_sched_entity_kill_jobs. The schedulers consumin

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-14 Thread Christian König
Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:  [SNIP] We are racing here once more and need to handle that. But why, I wrote above that we first stop the all schedulers, then only call drm_sched_entity_kill_jobs. The schedulers consuming jobs is not the problem, we already handle that

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-14 Thread Andrey Grodzovsky
On 2021-04-14 3:01 a.m., Christian König wrote: Am 13.04.21 um 20:30 schrieb Andrey Grodzovsky: On 2021-04-13 2:25 p.m., Christian König wrote: Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-13 2:03 p.m., Christian König wrote: Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-14 Thread Christian König
Am 13.04.21 um 20:30 schrieb Andrey Grodzovsky: On 2021-04-13 2:25 p.m., Christian König wrote: Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-13 2:03 p.m., Christian König wrote: Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky: On 2021-04-13 3:10 a.m., Christian König wrote:

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Daniel Vetter
an > ; Li, Dennis ; > amd-gfx@lists.freedesktop.org; Deucher, Alexander > ; Kuehling, Felix ; Zhang, > Hawking ; Daniel Vetter > Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability > > Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: > > > &g

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Daniel Vetter
On Tue, Apr 13, 2021 at 9:10 AM Christian König wrote: > > Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: > > > > On 2021-04-12 3:18 p.m., Christian König wrote: > >> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: > >>> [SNIP] > > > > So what's the right approach ? How we guarantee that

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Andrey Grodzovsky
On 2021-04-13 2:25 p.m., Christian König wrote: Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-13 2:03 p.m., Christian König wrote: Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky: On 2021-04-13 3:10 a.m., Christian König wrote: Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Christian König
Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-13 2:03 p.m., Christian König wrote: Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky: On 2021-04-13 3:10 a.m., Christian König wrote: Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian König wrote:

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Andrey Grodzovsky
On 2021-04-13 2:03 p.m., Christian König wrote: Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky: On 2021-04-13 3:10 a.m., Christian König wrote: Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian König wrote: Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Christian König
Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky: On 2021-04-13 3:10 a.m., Christian König wrote: Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian König wrote: Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we guar

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Andrey Grodzovsky
On 2021-04-13 3:10 a.m., Christian König wrote: Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian König wrote: Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_forc

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Christian König
enig, Christian ; Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking ; Daniel Vetter Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian K

RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Li, Dennis
tian König Sent: Tuesday, April 13, 2021 3:10 PM To: Grodzovsky, Andrey ; Koenig, Christian ; Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking ; Daniel Vetter Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Am 12.04.2

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Christian König
Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky: On 2021-04-12 3:18 p.m., Christian König wrote: Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences and

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-13 Thread Christian König
Am 13.04.21 um 07:36 schrieb Andrey Grodzovsky: [SNIP] emit_fence(fence); */* We can't wait forever as the HW might be gone at any point*/**        dma_fence_wait_timeout(old_fence, 5S);* You can pretty much ignore this wait here. It is only as a last resort so that we never overwrite th

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Andrey Grodzovsky
On 2021-04-12 2:23 p.m., Christian König wrote: Am 12.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-12 2:05 p.m., Christian König wrote: Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky: On 2021-04-12 1:44 p.m., Christian König wrote: Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky:

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Andrey Grodzovsky
On 2021-04-12 3:18 p.m., Christian König wrote: Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences and not racing against some more fences insertion in

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Christian König
Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky: [SNIP] So what's the right approach ? How we guarantee that when running amdgpu_fence_driver_force_completion we will signal all the HW fences and not racing against some more fences insertion into that array ? Well I would still say the be

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Andrey Grodzovsky
On 2021-04-12 2:23 p.m., Christian König wrote: Am 12.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-12 2:05 p.m., Christian König wrote: Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky: On 2021-04-12 1:44 p.m., Christian König wrote: Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky:

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Christian König
Am 12.04.21 um 20:18 schrieb Andrey Grodzovsky: On 2021-04-12 2:05 p.m., Christian König wrote: Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky: On 2021-04-12 1:44 p.m., Christian König wrote: Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky: On 2021-04-10 1:34 p.m., Christian König wrote:

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Andrey Grodzovsky
On 2021-04-12 2:05 p.m., Christian König wrote: Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky: On 2021-04-12 1:44 p.m., Christian König wrote: Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky: On 2021-04-10 1:34 p.m., Christian König wrote: Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey G

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Christian König
Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky: On 2021-04-12 1:44 p.m., Christian König wrote: Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky: On 2021-04-10 1:34 p.m., Christian König wrote: Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky: [SNIP] If we use a list and a flag c

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Andrey Grodzovsky
On 2021-04-12 1:44 p.m., Christian König wrote: Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky: On 2021-04-10 1:34 p.m., Christian König wrote: Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky: [SNIP] If we use a list and a flag called 'emit_allowed' under a lock such that in am

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Christian König
Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky: On 2021-04-10 1:34 p.m., Christian König wrote: Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky: [SNIP] If we use a list and a flag called 'emit_allowed' under a lock such that in amdgpu_fence_emit we lock the list, check the flag

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-12 Thread Andrey Grodzovsky
On 2021-04-10 1:34 p.m., Christian König wrote: Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky: [SNIP] If we use a list and a flag called 'emit_allowed' under a lock such that in amdgpu_fence_emit we lock the list, check the flag and if true add the new HW fence to list and proc

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-10 Thread Christian König
Hi Andrey, Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky: [SNIP] If we use a list and a flag called 'emit_allowed' under a lock such that in amdgpu_fence_emit we lock the list, check the flag and if true add the new HW fence to list and proceed to HW emition as normal, otherwise return wit

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-09 Thread Andrey Grodzovsky
On 2021-04-09 12:39 p.m., Christian König wrote: Am 09.04.21 um 17:42 schrieb Andrey Grodzovsky: On 2021-04-09 3:01 a.m., Christian König wrote: Am 09.04.21 um 08:53 schrieb Christian König: Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky: [SNIP] But inserting dmr_dev_enter/exit on the highe

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-09 Thread Christian König
Am 09.04.21 um 17:42 schrieb Andrey Grodzovsky: On 2021-04-09 3:01 a.m., Christian König wrote: Am 09.04.21 um 08:53 schrieb Christian König: Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky: [SNIP] But inserting dmr_dev_enter/exit on the highest level in drm_ioctl is much less effort and less

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-09 Thread Andrey Grodzovsky
On 2021-04-09 3:01 a.m., Christian König wrote: Am 09.04.21 um 08:53 schrieb Christian König: Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky: On 2021-04-08 2:58 p.m., Christian König wrote: Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky: On 2021-04-08 4:32 a.m., Christian König wrote: Am 0

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-09 Thread Christian König
Am 09.04.21 um 08:53 schrieb Christian König: Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky: On 2021-04-08 2:58 p.m., Christian König wrote: Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky: On 2021-04-08 4:32 a.m., Christian König wrote: Am 08.04.21 um 10:22 schrieb Christian König: [SNIP]

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-08 Thread Christian König
Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky: On 2021-04-08 2:58 p.m., Christian König wrote: Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky: On 2021-04-08 4:32 a.m., Christian König wrote: Am 08.04.21 um 10:22 schrieb Christian König: [SNIP] Beyond blocking all delayed works and schedu

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-08 Thread Andrey Grodzovsky
On 2021-04-08 2:58 p.m., Christian König wrote: Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky: On 2021-04-08 4:32 a.m., Christian König wrote: Am 08.04.21 um 10:22 schrieb Christian König: [SNIP] Beyond blocking all delayed works and scheduler threads we also need to guarantee no  IOCTL

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-08 Thread Christian König
Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky: On 2021-04-08 4:32 a.m., Christian König wrote: Am 08.04.21 um 10:22 schrieb Christian König: [SNIP] Beyond blocking all delayed works and scheduler threads we also need to guarantee no  IOCTL can access MMIO post device unplug OR in flight I

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-08 Thread Andrey Grodzovsky
t lock so that we don't wait for a dma_fence or allocate memory, but still protect the hardware from concurrent access and reset. Regards, Christian. Best Regards Dennis Li *From:* Koenig, Christian *Sent:* Thursday, March 18, 2021 4:59 PM *To:* Li, Dennis ; amd-gfx@lists.freedesktop.org

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-08 Thread Christian König
't wait for a dma_fence or allocate memory, but still protect the hardware from concurrent access and reset. Regards, Christian. Best Regards Dennis Li *From:* Koenig, Christian *Sent:* Thursday, March 18, 2021 4:59 PM *To:* Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alex

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-08 Thread Christian König
allocate memory, but still protect the hardware from concurrent access and reset. Regards, Christian. Best Regards Dennis Li *From:* Koenig, Christian *Sent:* Thursday, March 18, 2021 4:59 PM *To:* Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang,

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-07 Thread Andrey Grodzovsky
t the hardware from concurrent access and reset. Regards, Christian. Best Regards Dennis Li *From:* Koenig, Christian *Sent:* Thursday, March 18, 2021 4:59 PM *To:* Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking *Subject:* AW: [PATCH 0/4]

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-07 Thread Christian König
access and reset. Regards, Christian. Best Regards Dennis Li *From:* Koenig, Christian *Sent:* Thursday, March 18, 2021 4:59 PM *To:* Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking *Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-06 Thread Andrey Grodzovsky
rds Dennis Li *From:* Koenig, Christian *Sent:* Thursday, March 18, 2021 4:59 PM *To:* Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking *Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Exactly that's what

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-06 Thread Christian König
* Thursday, March 18, 2021 4:59 PM *To:* Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking *Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Exactly that's what you don't seem to understand. The GPU reset

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-06 Thread Christian König
@lists.freedesktop.org>>; Deucher, Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability >&g

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-04-05 Thread Andrey Grodzovsky
@lists.freedesktop.org>>; Deucher, Alexander mailto:alexander.deuc...@amd.com>>; Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking mailto:hawking.zh...@amd.com>> *Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability >>> T

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Christian König
freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking *Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Exactly that's what you don't seem to understand. The GPU reset doesn't complete the fences we wait for. It only completes the

RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Li, Dennis
find a better method to solve this problem. Do you have some suggestion? Best Regards Dennis Li From: Koenig, Christian Sent: Thursday, March 18, 2021 4:59 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking Subject: AW: [PATCH 0/4] Refine GPU

AW: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Koenig, Christian
Christian ; amd-gfx@lists.freedesktop.org ; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking Betreff: RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability >>> Those two steps need to be exchanged or otherwise it is possible that new >>> delayed work items etc are starte

RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Li, Dennis
:54 PM To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling, Felix ; Zhang, Hawking Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability Am 18.03.21 um 08:23 schrieb Dennis Li: > We have defined two variables in_gpu_reset and reset_sem in ad

Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Christian König
Am 18.03.21 um 08:23 schrieb Dennis Li: We have defined two variables in_gpu_reset and reset_sem in adev object. The atomic type variable in_gpu_reset is used to avoid recovery thread reenter and make lower functions return more earlier when recovery start, but couldn't block recovery thread w

[PATCH 0/4] Refine GPU recovery sequence to enhance its stability

2021-03-18 Thread Dennis Li
We have defined two variables in_gpu_reset and reset_sem in adev object. The atomic type variable in_gpu_reset is used to avoid recovery thread reenter and make lower functions return more earlier when recovery start, but couldn't block recovery thread when it access hardware. The r/w semaphore