[SNIP]
Maybe just empirically - let's try it and see under different
test scenarios what actually happens ?
Not a good idea in general, we have that approach way to often at
AMD and are then surprised that everything works in QA but fails
in production.
But Daniel already noted in hi
On 2021-04-15 3:02 a.m., Christian König wrote:
Am 15.04.21 um 08:27 schrieb Andrey Grodzovsky:
On 2021-04-14 10:58 a.m., Christian König wrote:
Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:
[SNIP]
We are racing here once more and need to handle that.
But why, I wrote above that we f
Am 15.04.21 um 08:27 schrieb Andrey Grodzovsky:
On 2021-04-14 10:58 a.m., Christian König wrote:
Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:
[SNIP]
We are racing here once more and need to handle that.
But why, I wrote above that we first stop the all schedulers, then
only call drm_
On 2021-04-14 10:58 a.m., Christian König wrote:
Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:
[SNIP]
We are racing here once more and need to handle that.
But why, I wrote above that we first stop the all schedulers, then
only call drm_sched_entity_kill_jobs.
The schedulers consumin
Am 14.04.21 um 16:36 schrieb Andrey Grodzovsky:
[SNIP]
We are racing here once more and need to handle that.
But why, I wrote above that we first stop the all schedulers, then
only call drm_sched_entity_kill_jobs.
The schedulers consuming jobs is not the problem, we already handle
that
On 2021-04-14 3:01 a.m., Christian König wrote:
Am 13.04.21 um 20:30 schrieb Andrey Grodzovsky:
On 2021-04-13 2:25 p.m., Christian König wrote:
Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky:
On 2021-04-13 2:03 p.m., Christian König wrote:
Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:
Am 13.04.21 um 20:30 schrieb Andrey Grodzovsky:
On 2021-04-13 2:25 p.m., Christian König wrote:
Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky:
On 2021-04-13 2:03 p.m., Christian König wrote:
Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:
On 2021-04-13 3:10 a.m., Christian König wrote:
an
> ; Li, Dennis ;
> amd-gfx@lists.freedesktop.org; Deucher, Alexander
> ; Kuehling, Felix ; Zhang,
> Hawking ; Daniel Vetter
> Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
>
> Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
> >
> &g
On Tue, Apr 13, 2021 at 9:10 AM Christian König
wrote:
>
> Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
> >
> > On 2021-04-12 3:18 p.m., Christian König wrote:
> >> Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
> >>> [SNIP]
> >
> > So what's the right approach ? How we guarantee that
On 2021-04-13 2:25 p.m., Christian König wrote:
Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky:
On 2021-04-13 2:03 p.m., Christian König wrote:
Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:
On 2021-04-13 3:10 a.m., Christian König wrote:
Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
Am 13.04.21 um 20:18 schrieb Andrey Grodzovsky:
On 2021-04-13 2:03 p.m., Christian König wrote:
Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:
On 2021-04-13 3:10 a.m., Christian König wrote:
Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
On 2021-04-12 3:18 p.m., Christian König wrote:
On 2021-04-13 2:03 p.m., Christian König wrote:
Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:
On 2021-04-13 3:10 a.m., Christian König wrote:
Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
On 2021-04-12 3:18 p.m., Christian König wrote:
Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
[
Am 13.04.21 um 17:12 schrieb Andrey Grodzovsky:
On 2021-04-13 3:10 a.m., Christian König wrote:
Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
On 2021-04-12 3:18 p.m., Christian König wrote:
Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
[SNIP]
So what's the right approach ? How we guar
On 2021-04-13 3:10 a.m., Christian König wrote:
Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
On 2021-04-12 3:18 p.m., Christian König wrote:
Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
[SNIP]
So what's the right approach ? How we guarantee that when running
amdgpu_fence_driver_forc
enig, Christian ; Li, Dennis
; amd-gfx@lists.freedesktop.org; Deucher, Alexander ; Kuehling,
Felix ; Zhang, Hawking ; Daniel Vetter
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
On 2021-04-12 3:18 p.m., Christian K
tian König
Sent: Tuesday, April 13, 2021 3:10 PM
To: Grodzovsky, Andrey ; Koenig, Christian
; Li, Dennis ;
amd-gfx@lists.freedesktop.org; Deucher, Alexander ;
Kuehling, Felix ; Zhang, Hawking
; Daniel Vetter
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 12.04.2
Am 12.04.21 um 22:01 schrieb Andrey Grodzovsky:
On 2021-04-12 3:18 p.m., Christian König wrote:
Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
[SNIP]
So what's the right approach ? How we guarantee that when running
amdgpu_fence_driver_force_completion we will signal all the HW
fences and
Am 13.04.21 um 07:36 schrieb Andrey Grodzovsky:
[SNIP]
emit_fence(fence);
*/* We can't wait forever as the HW might be gone at any point*/**
dma_fence_wait_timeout(old_fence, 5S);*
You can pretty much ignore this wait here. It is only as a last
resort so that we never overwrite th
On 2021-04-12 2:23 p.m., Christian König wrote:
Am 12.04.21 um 20:18 schrieb Andrey Grodzovsky:
On 2021-04-12 2:05 p.m., Christian König wrote:
Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky:
On 2021-04-12 1:44 p.m., Christian König wrote:
Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky:
On 2021-04-12 3:18 p.m., Christian König wrote:
Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
[SNIP]
So what's the right approach ? How we guarantee that when running
amdgpu_fence_driver_force_completion we will signal all the HW
fences and not racing against some more fences insertion in
Am 12.04.21 um 21:12 schrieb Andrey Grodzovsky:
[SNIP]
So what's the right approach ? How we guarantee that when running
amdgpu_fence_driver_force_completion we will signal all the HW
fences and not racing against some more fences insertion into that
array ?
Well I would still say the be
On 2021-04-12 2:23 p.m., Christian König wrote:
Am 12.04.21 um 20:18 schrieb Andrey Grodzovsky:
On 2021-04-12 2:05 p.m., Christian König wrote:
Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky:
On 2021-04-12 1:44 p.m., Christian König wrote:
Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky:
Am 12.04.21 um 20:18 schrieb Andrey Grodzovsky:
On 2021-04-12 2:05 p.m., Christian König wrote:
Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky:
On 2021-04-12 1:44 p.m., Christian König wrote:
Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky:
On 2021-04-10 1:34 p.m., Christian König wrote:
On 2021-04-12 2:05 p.m., Christian König wrote:
Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky:
On 2021-04-12 1:44 p.m., Christian König wrote:
Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky:
On 2021-04-10 1:34 p.m., Christian König wrote:
Hi Andrey,
Am 09.04.21 um 20:18 schrieb Andrey G
Am 12.04.21 um 20:01 schrieb Andrey Grodzovsky:
On 2021-04-12 1:44 p.m., Christian König wrote:
Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky:
On 2021-04-10 1:34 p.m., Christian König wrote:
Hi Andrey,
Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky:
[SNIP]
If we use a list and a flag c
On 2021-04-12 1:44 p.m., Christian König wrote:
Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky:
On 2021-04-10 1:34 p.m., Christian König wrote:
Hi Andrey,
Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky:
[SNIP]
If we use a list and a flag called 'emit_allowed' under a lock
such that in am
Am 12.04.21 um 19:27 schrieb Andrey Grodzovsky:
On 2021-04-10 1:34 p.m., Christian König wrote:
Hi Andrey,
Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky:
[SNIP]
If we use a list and a flag called 'emit_allowed' under a lock
such that in amdgpu_fence_emit we lock the list, check the flag
On 2021-04-10 1:34 p.m., Christian König wrote:
Hi Andrey,
Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky:
[SNIP]
If we use a list and a flag called 'emit_allowed' under a lock such
that in amdgpu_fence_emit we lock the list, check the flag and if
true add the new HW fence to list and proc
Hi Andrey,
Am 09.04.21 um 20:18 schrieb Andrey Grodzovsky:
[SNIP]
If we use a list and a flag called 'emit_allowed' under a lock such
that in amdgpu_fence_emit we lock the list, check the flag and if
true add the new HW fence to list and proceed to HW emition as
normal, otherwise return wit
On 2021-04-09 12:39 p.m., Christian König wrote:
Am 09.04.21 um 17:42 schrieb Andrey Grodzovsky:
On 2021-04-09 3:01 a.m., Christian König wrote:
Am 09.04.21 um 08:53 schrieb Christian König:
Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky:
[SNIP]
But inserting dmr_dev_enter/exit on the highe
Am 09.04.21 um 17:42 schrieb Andrey Grodzovsky:
On 2021-04-09 3:01 a.m., Christian König wrote:
Am 09.04.21 um 08:53 schrieb Christian König:
Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky:
[SNIP]
But inserting dmr_dev_enter/exit on the highest level in drm_ioctl
is much less effort and less
On 2021-04-09 3:01 a.m., Christian König wrote:
Am 09.04.21 um 08:53 schrieb Christian König:
Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky:
On 2021-04-08 2:58 p.m., Christian König wrote:
Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky:
On 2021-04-08 4:32 a.m., Christian König wrote:
Am 0
Am 09.04.21 um 08:53 schrieb Christian König:
Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky:
On 2021-04-08 2:58 p.m., Christian König wrote:
Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky:
On 2021-04-08 4:32 a.m., Christian König wrote:
Am 08.04.21 um 10:22 schrieb Christian König:
[SNIP]
Am 08.04.21 um 22:39 schrieb Andrey Grodzovsky:
On 2021-04-08 2:58 p.m., Christian König wrote:
Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky:
On 2021-04-08 4:32 a.m., Christian König wrote:
Am 08.04.21 um 10:22 schrieb Christian König:
[SNIP]
Beyond blocking all delayed works and schedu
On 2021-04-08 2:58 p.m., Christian König wrote:
Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky:
On 2021-04-08 4:32 a.m., Christian König wrote:
Am 08.04.21 um 10:22 schrieb Christian König:
[SNIP]
Beyond blocking all delayed works and scheduler threads we also
need to guarantee no IOCTL
Am 08.04.21 um 18:08 schrieb Andrey Grodzovsky:
On 2021-04-08 4:32 a.m., Christian König wrote:
Am 08.04.21 um 10:22 schrieb Christian König:
[SNIP]
Beyond blocking all delayed works and scheduler threads we also
need to guarantee no IOCTL can access MMIO post device unplug OR
in flight I
t lock so
that we don't wait for a dma_fence or allocate memory, but
still protect the hardware from concurrent access and reset.
Regards,
Christian.
Best Regards
Dennis Li
*From:* Koenig, Christian
*Sent:* Thursday, March 18, 2021 4:59 PM
*To:* Li, Dennis ;
amd-gfx@lists.freedesktop.org
't wait for a dma_fence or allocate memory, but still
protect the hardware from concurrent access and reset.
Regards,
Christian.
Best Regards
Dennis Li
*From:* Koenig, Christian
*Sent:* Thursday, March 18, 2021 4:59 PM
*To:* Li, Dennis ;
amd-gfx@lists.freedesktop.org; Deucher, Alex
allocate memory, but still
protect the hardware from concurrent access and reset.
Regards,
Christian.
Best Regards
Dennis Li
*From:* Koenig, Christian
*Sent:* Thursday, March 18, 2021 4:59 PM
*To:* Li, Dennis ;
amd-gfx@lists.freedesktop.org; Deucher, Alexander
; Kuehling, Felix
; Zhang,
t the hardware from concurrent access and reset.
Regards,
Christian.
Best Regards
Dennis Li
*From:* Koenig, Christian
*Sent:* Thursday, March 18, 2021 4:59 PM
*To:* Li, Dennis ;
amd-gfx@lists.freedesktop.org; Deucher, Alexander
; Kuehling, Felix
; Zhang, Hawking
*Subject:* AW: [PATCH 0/4]
access and reset.
Regards,
Christian.
Best Regards
Dennis Li
*From:* Koenig, Christian
*Sent:* Thursday, March 18, 2021 4:59 PM
*To:* Li, Dennis ;
amd-gfx@lists.freedesktop.org; Deucher, Alexander
; Kuehling, Felix
; Zhang, Hawking
*Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to
rds
Dennis Li
*From:* Koenig, Christian
*Sent:* Thursday, March 18, 2021 4:59 PM
*To:* Li, Dennis ;
amd-gfx@lists.freedesktop.org; Deucher, Alexander
; Kuehling, Felix
; Zhang, Hawking
*Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance
its stability
Exactly that's what
* Thursday, March 18, 2021 4:59 PM
*To:* Li, Dennis ;
amd-gfx@lists.freedesktop.org; Deucher, Alexander
; Kuehling, Felix
; Zhang, Hawking
*Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance
its stability
Exactly that's what you don't seem to understand.
The GPU reset
@lists.freedesktop.org>>; Deucher, Alexander
mailto:alexander.deuc...@amd.com>>;
Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking
mailto:hawking.zh...@amd.com>>
*Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance
its stability
>&g
@lists.freedesktop.org>>; Deucher, Alexander
mailto:alexander.deuc...@amd.com>>;
Kuehling, Felix <mailto:felix.kuehl...@amd.com>>; Zhang, Hawking
mailto:hawking.zh...@amd.com>>
*Betreff:* RE: [PATCH 0/4] Refine GPU recovery sequence to enhance
its stability
>>> T
freedesktop.org;
Deucher, Alexander ; Kuehling, Felix
; Zhang, Hawking
*Subject:* AW: [PATCH 0/4] Refine GPU recovery sequence to enhance its
stability
Exactly that's what you don't seem to understand.
The GPU reset doesn't complete the fences we wait for. It only
completes the
find a better
method to solve this problem. Do you have some suggestion?
Best Regards
Dennis Li
From: Koenig, Christian
Sent: Thursday, March 18, 2021 4:59 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher,
Alexander ; Kuehling, Felix
; Zhang, Hawking
Subject: AW: [PATCH 0/4] Refine GPU
Christian ; amd-gfx@lists.freedesktop.org
; Deucher, Alexander
; Kuehling, Felix ; Zhang,
Hawking
Betreff: RE: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
>>> Those two steps need to be exchanged or otherwise it is possible that new
>>> delayed work items etc are starte
:54 PM
To: Li, Dennis ; amd-gfx@lists.freedesktop.org; Deucher,
Alexander ; Kuehling, Felix
; Zhang, Hawking
Subject: Re: [PATCH 0/4] Refine GPU recovery sequence to enhance its stability
Am 18.03.21 um 08:23 schrieb Dennis Li:
> We have defined two variables in_gpu_reset and reset_sem in ad
Am 18.03.21 um 08:23 schrieb Dennis Li:
We have defined two variables in_gpu_reset and reset_sem in adev object. The
atomic type variable in_gpu_reset is used to avoid recovery thread reenter and
make lower functions return more earlier when recovery start, but couldn't
block recovery thread w
We have defined two variables in_gpu_reset and reset_sem in adev object. The
atomic type variable in_gpu_reset is used to avoid recovery thread reenter and
make lower functions return more earlier when recovery start, but couldn't
block recovery thread when it access hardware. The r/w semaphore
51 matches
Mail list logo