Hi Christian
>> Well exactly that's what I disagree on. A second timeout on the same job is
>> perfectly possible and desired, we just don't need to necessarily report it
>> once more
It is not a second timeout on the same job...if you disable gpu_recovery it is
endless report on the same job
Hi Dave, Daniel,
Fixes for 5.2:
- Fix a crash on gpu reset at driver load time
- ATPX hotplug fix for when the dGPU is powered off
- PLL fix for r5xx asics
- SR-IOV fixes
The following changes since commit 422449238e9853458283beffed77562d4b40a2fa:
Merge branch 'drm-next-5.2' of
This new drmcgrp resource limits the largest GEM buffer that can be
allocated in a cgroup.
Change-Id: I0830d56775568e1cf215b56cc892d5e7945e9f25
Signed-off-by: Kenny Ho
---
include/linux/cgroup_drm.h | 2 ++
kernel/cgroup/drm.c| 59 ++
2 files
This is a follow up to the RFC I made last november to introduce a cgroup
controller for the GPU/DRM subsystem [a]. The goal is to be able to provide
resource management to GPU resources using things like container. The cover
letter from v1 is copied below for reference.
Usage examples:
//
Change-Id: I6830d3990f63f0c13abeba29b1d330cf28882831
Signed-off-by: Kenny Ho
---
include/linux/cgroup_drm.h| 32 ++
include/linux/cgroup_subsys.h | 4
init/Kconfig | 5 +
kernel/cgroup/Makefile| 1 +
kernel/cgroup/drm.c |
Change-Id: I3750fc657b956b52750a36cb303c54fa6a265b44
Signed-off-by: Kenny Ho
---
drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4
1 file changed, 4 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index da7b4fe8ade3..2568fd730161
The drm resource being measured and limited here is the GEM buffer
objects. User applications allocate and free these buffers. In
addition, a process can allocate a buffer and share it with another
process. The consumer of a shared buffer can also outlive the
allocator of the buffer.
For the
Change-Id: I908ee6975ea0585e4c30eafde4599f87094d8c65
Signed-off-by: Kenny Ho
---
include/drm/drm_cgroup.h | 24
include/linux/cgroup_drm.h | 10
kernel/cgroup/drm.c| 118 -
3 files changed, 151 insertions(+), 1 deletion(-)
create
Firmware versions can be found as separate sysfs files at:
/sys/class/drm/cardX/device/fw_version (where X is the card number)
The firmware versions are displayed in hexadecimal.
v2: Moved sysfs files to subfolder
Change-Id: I10cae4c0ca6f1b6a9ced07da143426e1d011e203
Signed-off-by: Ori Messinger
Typo in the patch description. s/interafce/interface/
On 5/9/19 10:24 AM, Alex Deucher wrote:
> On Thu, May 9, 2019 at 6:31 AM Pan, Xinhui wrote:
>> add badpages node.
>> it will output badpages list in format
>> gpu pfn : gpu page size : flags
>>
>> example
>> 0x : 0x1000 : R
>>
On Wed, May 8, 2019 at 4:47 PM Zhu, James wrote:
>
> During S3 test, when system wake up and resume, ras interface
> is already allocated. Move workaround before ras jumps to resume
> step in gfx_v9_0_ecc_late_init, and make sure workaround applied
> during resume. Also remove unused
On Thu, May 9, 2019 at 6:38 AM Michel Dänzer wrote:
>
> From: Michel Dänzer
>
> Even if glamor_gbm_bo_from_pixmap / glamor_fd_from_pixmap themselves
> don't trigger any drawing, there could already be unflushed drawing to
> the pixmap whose storage we share with a client.
>
> (Ported from amdgpu
On Thu, May 9, 2019 at 6:31 AM Pan, Xinhui wrote:
>
> add badpages node.
> it will output badpages list in format
> gpu pfn : gpu page size : flags
>
> example
> 0x : 0x1000 : R
> 0x0001 : 0x1000 : R
> 0x0002 : 0x1000 : R
> 0x0003 : 0x1000 : R
> 0x0004 :
Linux kernel 5.0 , amd rx580 gpu card
1 or many graphics program become uninterrupted(STAT D) when run 64
graphics program of sub-window concurrently.
We have to reboot machine to release the uninterrupted(STAT D) program.
We have such kernel log:
kernel: RenderThread D 0 393786 337242
On Wed, May 8, 2019 at 10:47 AM Wang Hai wrote:
>
> Fix the following sparse warnings:
>
> drivers/gpu/drm/amd/amdgpu/../display/dc/dce120/dce120_resource.c:483:21:
> warning: symbol 'dce120_clock_source_create' was not declared. Should it be
> static?
>
Am 09.05.19 um 16:23 schrieb Alex Deucher:
In case we need to use them for GPU reset prior initializing the
asic. Fixes a crash if the driver attempts to reset the GPU at driver
load time.
Signed-off-by: Alex Deucher
Acked-by: Christian König
---
drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
In case we need to use them for GPU reset prior initializing the
asic. Fixes a crash if the driver attempts to reset the GPU at driver
load time.
Signed-off-by: Alex Deucher
---
drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 19 ++-
1 file changed, 10 insertions(+), 9 deletions(-)
Thank you Alex! It does fix the crash. (GPU post failed following that but at
least it exits gracefully.)
Regards,
Amber
On 2019-05-08 10:48 p.m., Deucher, Alexander wrote:
The attached patch should fix it.
Alex
From: amd-gfx
On 2019-05-07 5:52 p.m., Kuehling, Felix wrote:
> Use unsigned long for number of pages.
>
> Check that pfns are valid after hmm_vma_fault. If they are not,
> return an error instead of continuing with invalid page pointers and
> PTEs.
>
> Signed-off-by: Felix Kuehling
Reviewed-by: Philip Yang
Am 09.05.19 um 14:49 schrieb Liu, Monk:
Christian
I believe even yourself would agree that keep reporting TMO for the same IB is ugly (need
put a "gpu_recovery=0" as option ), I can also argue with you that this is a
bad design ...
Well exactly that's what I disagree on. A second timeout on
Christian
I believe even yourself would agree that keep reporting TMO for the same IB is
ugly (need put a "gpu_recovery=0" as option ), I can also argue with you that
this is a bad design ...
Besides you didn't on technique persuade me to believe there will be bad
things upcoming with my
Well putting the other drivers aside for moment, I would also say that
it is bad design to not restart the timeout. So even then I would not
review that patch.
Question is rather what are you actually trying to do and why don't you
want to change your design?
Designs should be discussed on
Hi Christian
I saw your patch commit ""19067e"", it actually explained nothing about why
need this timer to restarted since handler already restarted this timer inside
...
and sorry that I don't have the history background to understand your previous
reply, please check in lines:
>>the
>> And not only our driver is relying on that but also the ARM drivers. See the
>> history of that change.
That 's the issue I didn't see earlier, if ARM drivers doesn't restart timer in
their job_timeout() it is a problem
But I don't want to change my plan on that feature, so can you give
Sorry for the confusion.
What I wanted to say is that we don't necessary need to report the same
job twice in the logs.
But as long as the job is still running on the hardware we should also
keep the timeout running as well.
Christian.
Am 09.05.19 um 13:06 schrieb Liu, Monk:
> Christian
>
>
Hi Monk,
the timeout handler might only kill waves until we see some progress
again and then continue.
E.g. we don't necessary kill the whole job, but maybe just one
drawing/computing operations. Or the same job is submitted another time.
etc...
As long as there is a job running on the
Hi Monk,
the timeout handler might only kill waves until we see some progress
again and then continue.
E.g. we don't necessary kill the whole job, but maybe just one
drawing/computing operations. Or the same job is submitted another time.
As long as there is a job running on the hardware
Christian
I think your previous reply > " Well, NAK. We don't need multiple timeout
reports, but we really need to restart the timeout counter after handling it."
Just looks paradox with what you say now > " ok you don't seem to understand:
It is intentional that the same job times out
Hi Monk,
ok you don't seem to understand: It is intentional that the same job
times out multiple times! So we can't really change anything here.
What we can do is instead of sending a signal (which is not a good idea
from the timeout handler anyway) we can start a background script to do
the
Hah ... I see, but my requirement cannot be satisfied with current design:
What I need to do is put a signal arming in job_timeout() to notify a USER
SPACE daemon app , which finally leverage "UMR" to DUMP/retrieve sw/hw
information related with the TMO/hang as much as possible . To make it
From: Michel Dänzer
Even if glamor_gbm_bo_from_pixmap / glamor_fd_from_pixmap themselves
don't trigger any drawing, there could already be unflushed drawing to
the pixmap whose storage we share with a client.
(Ported from amdgpu commit 4b17533fcb30842caf0035ba593b7d986520cc85)
Signed-off-by:
add badpages node.
it will output badpages list in format
gpu pfn : gpu page size : flags
example
0x : 0x1000 : R
0x0001 : 0x1000 : R
0x0002 : 0x1000 : R
0x0003 : 0x1000 : R
0x0004 : 0x1000 : R
0x0005 : 0x1000 : R
0x0006 : 0x1000 : R
drm_sched_start() is not necessary called from the timeout handler.
If a soft recovery is sufficient, we just continue without a complete reset.
Christian.
Am 09.05.19 um 12:25 schrieb Liu, Monk:
Christian
Check "drm_sched_start" which is invoked from gpu_recover() , there is a
Christian
Check "drm_sched_start" which is invoked from gpu_recover() , there is a
"drm_sched_start_timeout()" in the tail
/Monk
-Original Message-
From: Christian König
Sent: Thursday, May 9, 2019 3:18 PM
To: Liu, Monk ; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/sched:
Am 09.05.19 um 06:31 schrieb Monk Liu:
we don't need duplicated IB's timeout error message reported endlessly,
just one report per timedout IB is enough
Well, NAK. We don't need multiple timeout reports, but we really need to
restart the timeout counter after handling it.
Otherwise we will
Am 08.05.19 um 21:02 schrieb Liu, Leo:
On 5/8/19 1:45 PM, Alex Deucher wrote:
[CAUTION: External Email]
On Wed, May 8, 2019 at 11:51 AM Liu, Leo wrote:
Since the check aleady done with command submission check
Missing signed-off-by.
patches 1-5 are:
Reviewed-by: Alex Deucher
As for this
Hi Kent,
not strong opinion on that and I agree that from a housekeeping point of
view we should probably create an own directory for the files.
Christian.
Am 08.05.19 um 19:11 schrieb Russell, Kent:
> Hi Christian,
>
> Are you worried about him putting them in a fw_version subfolder like the
37 matches
Mail list logo