RE: [PATCH] drm/sched: fix the duplicated TMO message for one IB

2019-05-09 Thread Liu, Monk
Hi Christian >> Well exactly that's what I disagree on. A second timeout on the same job is >> perfectly possible and desired, we just don't need to necessarily report it >> once more It is not a second timeout on the same job...if you disable gpu_recovery it is endless report on the same job

[pull] amdgpu, radeon drm-next-5.2

2019-05-09 Thread Alex Deucher
Hi Dave, Daniel, Fixes for 5.2: - Fix a crash on gpu reset at driver load time - ATPX hotplug fix for when the dGPU is powered off - PLL fix for r5xx asics - SR-IOV fixes The following changes since commit 422449238e9853458283beffed77562d4b40a2fa: Merge branch 'drm-next-5.2' of

[RFC PATCH v2 5/5] drm, cgroup: Add peak GEM buffer allocation limit

2019-05-09 Thread Kenny Ho
This new drmcgrp resource limits the largest GEM buffer that can be allocated in a cgroup. Change-Id: I0830d56775568e1cf215b56cc892d5e7945e9f25 Signed-off-by: Kenny Ho --- include/linux/cgroup_drm.h | 2 ++ kernel/cgroup/drm.c| 59 ++ 2 files

[RFC PATCH v2 0/5] new cgroup controller for gpu/drm subsystem

2019-05-09 Thread Kenny Ho
This is a follow up to the RFC I made last november to introduce a cgroup controller for the GPU/DRM subsystem [a]. The goal is to be able to provide resource management to GPU resources using things like container. The cover letter from v1 is copied below for reference. Usage examples: //

[RFC PATCH v2 1/5] cgroup: Introduce cgroup for drm subsystem

2019-05-09 Thread Kenny Ho
Change-Id: I6830d3990f63f0c13abeba29b1d330cf28882831 Signed-off-by: Kenny Ho --- include/linux/cgroup_drm.h| 32 ++ include/linux/cgroup_subsys.h | 4 init/Kconfig | 5 + kernel/cgroup/Makefile| 1 + kernel/cgroup/drm.c |

[RFC PATCH v2 3/5] drm/amdgpu: Register AMD devices for DRM cgroup

2019-05-09 Thread Kenny Ho
Change-Id: I3750fc657b956b52750a36cb303c54fa6a265b44 Signed-off-by: Kenny Ho --- drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c index da7b4fe8ade3..2568fd730161

[RFC PATCH v2 4/5] drm, cgroup: Add total GEM buffer allocation limit

2019-05-09 Thread Kenny Ho
The drm resource being measured and limited here is the GEM buffer objects. User applications allocate and free these buffers. In addition, a process can allocate a buffer and share it with another process. The consumer of a shared buffer can also outlive the allocator of the buffer. For the

[RFC PATCH v2 2/5] cgroup: Add mechanism to register DRM devices

2019-05-09 Thread Kenny Ho
Change-Id: I908ee6975ea0585e4c30eafde4599f87094d8c65 Signed-off-by: Kenny Ho --- include/drm/drm_cgroup.h | 24 include/linux/cgroup_drm.h | 10 kernel/cgroup/drm.c| 118 - 3 files changed, 151 insertions(+), 1 deletion(-) create

[PATCH] drm/amdgpu: Report firmware versions with sysfs v2

2019-05-09 Thread Messinger, Ori
Firmware versions can be found as separate sysfs files at: /sys/class/drm/cardX/device/fw_version (where X is the card number) The firmware versions are displayed in hexadecimal. v2: Moved sysfs files to subfolder Change-Id: I10cae4c0ca6f1b6a9ced07da143426e1d011e203 Signed-off-by: Ori Messinger

Re: [PATCH v3] drm/amdgpu: add badpages sysfs interafce

2019-05-09 Thread William Lewis
Typo in the patch description.  s/interafce/interface/ On 5/9/19 10:24 AM, Alex Deucher wrote: > On Thu, May 9, 2019 at 6:31 AM Pan, Xinhui wrote: >> add badpages node. >> it will output badpages list in format >> gpu pfn : gpu page size : flags >> >> example >> 0x : 0x1000 : R >>

Re: [PATCH] drm/amdgpu: Fix S3 test issue

2019-05-09 Thread Alex Deucher
On Wed, May 8, 2019 at 4:47 PM Zhu, James wrote: > > During S3 test, when system wake up and resume, ras interface > is already allocated. Move workaround before ras jumps to resume > step in gfx_v9_0_ecc_late_init, and make sure workaround applied > during resume. Also remove unused

Re: [PATCH xf86-video-ati] dri3: Always flush glamor before sharing pixmap storage with clients

2019-05-09 Thread Alex Deucher
On Thu, May 9, 2019 at 6:38 AM Michel Dänzer wrote: > > From: Michel Dänzer > > Even if glamor_gbm_bo_from_pixmap / glamor_fd_from_pixmap themselves > don't trigger any drawing, there could already be unflushed drawing to > the pixmap whose storage we share with a client. > > (Ported from amdgpu

Re: [PATCH v3] drm/amdgpu: add badpages sysfs interafce

2019-05-09 Thread Alex Deucher
On Thu, May 9, 2019 at 6:31 AM Pan, Xinhui wrote: > > add badpages node. > it will output badpages list in format > gpu pfn : gpu page size : flags > > example > 0x : 0x1000 : R > 0x0001 : 0x1000 : R > 0x0002 : 0x1000 : R > 0x0003 : 0x1000 : R > 0x0004 :

program become uninterrupt(STAT D) when run 64 graphics program in sub window

2019-05-09 Thread wormwang
Linux kernel 5.0 , amd rx580 gpu card 1 or many graphics program become uninterrupted(STAT D) when run 64 graphics program of sub-window concurrently. We have to reboot machine to release the uninterrupted(STAT D) program. We have such kernel log:  kernel: RenderThread D 0 393786 337242

Re: [PATCH] drm/amd/display: Make some functions static

2019-05-09 Thread Alex Deucher
On Wed, May 8, 2019 at 10:47 AM Wang Hai wrote: > > Fix the following sparse warnings: > > drivers/gpu/drm/amd/amdgpu/../display/dc/dce120/dce120_resource.c:483:21: > warning: symbol 'dce120_clock_source_create' was not declared. Should it be > static? >

Re: [PATCH] drm/amdgpu/psp: move psp version specific function pointers to early_init

2019-05-09 Thread Christian König
Am 09.05.19 um 16:23 schrieb Alex Deucher: In case we need to use them for GPU reset prior initializing the asic. Fixes a crash if the driver attempts to reset the GPU at driver load time. Signed-off-by: Alex Deucher Acked-by: Christian König --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c

[PATCH] drm/amdgpu/psp: move psp version specific function pointers to early_init

2019-05-09 Thread Alex Deucher
In case we need to use them for GPU reset prior initializing the asic. Fixes a crash if the driver attempts to reset the GPU at driver load time. Signed-off-by: Alex Deucher --- drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 19 ++- 1 file changed, 10 insertions(+), 9 deletions(-)

Re: Kernel crash at reloading amdgpu

2019-05-09 Thread Lin, Amber
Thank you Alex! It does fix the crash. (GPU post failed following that but at least it exits gracefully.) Regards, Amber On 2019-05-08 10:48 p.m., Deucher, Alexander wrote: The attached patch should fix it. Alex From: amd-gfx

Re: [PATCH 1/1] drm/amdgpu: Improve error handling for HMM

2019-05-09 Thread Yang, Philip
On 2019-05-07 5:52 p.m., Kuehling, Felix wrote: > Use unsigned long for number of pages. > > Check that pfns are valid after hmm_vma_fault. If they are not, > return an error instead of continuing with invalid page pointers and > PTEs. > > Signed-off-by: Felix Kuehling Reviewed-by: Philip Yang

Re: [PATCH] drm/sched: fix the duplicated TMO message for one IB

2019-05-09 Thread Christian König
Am 09.05.19 um 14:49 schrieb Liu, Monk: Christian I believe even yourself would agree that keep reporting TMO for the same IB is ugly (need put a "gpu_recovery=0" as option ), I can also argue with you that this is a bad design ... Well exactly that's what I disagree on. A second timeout on

RE: [PATCH] drm/sched: fix the duplicated TMO message for one IB

2019-05-09 Thread Liu, Monk
Christian I believe even yourself would agree that keep reporting TMO for the same IB is ugly (need put a "gpu_recovery=0" as option ), I can also argue with you that this is a bad design ... Besides you didn't on technique persuade me to believe there will be bad things upcoming with my

Re: [PATCH] drm/sched: fix the duplicated TMO message for one IB

2019-05-09 Thread Koenig, Christian
Well putting the other drivers aside for moment, I would also say that it is bad design to not restart the timeout. So even then I would not review that patch. Question is rather what are you actually trying to do and why don't you want to change your design? Designs should be discussed on

RE: [PATCH] drm/sched: fix the duplicated TMO message for one IB

2019-05-09 Thread Liu, Monk
Hi Christian I saw your patch commit ""19067e"", it actually explained nothing about why need this timer to restarted since handler already restarted this timer inside ... and sorry that I don't have the history background to understand your previous reply, please check in lines: >>the

RE: [PATCH] drm/sched: fix the duplicated TMO message for one IB

2019-05-09 Thread Liu, Monk
>> And not only our driver is relying on that but also the ARM drivers. See the >> history of that change. That 's the issue I didn't see earlier, if ARM drivers doesn't restart timer in their job_timeout() it is a problem But I don't want to change my plan on that feature, so can you give

Re: [PATCH] drm/sched: fix the duplicated TMO message for one IB

2019-05-09 Thread Koenig, Christian
Sorry for the confusion. What I wanted to say is that we don't necessary need to report the same job twice in the logs. But as long as the job is still running on the hardware we should also keep the timeout running as well. Christian. Am 09.05.19 um 13:06 schrieb Liu, Monk: > Christian > >

Re: [PATCH] drm/sched: fix the duplicated TMO message for one IB

2019-05-09 Thread Christian König
Hi Monk, the timeout handler might only kill waves until we see some progress again and then continue. E.g. we don't necessary kill the whole job, but maybe just one drawing/computing operations. Or the same job is submitted another time. etc... As long as there is a job running on the

Re: [PATCH] drm/sched: fix the duplicated TMO message for one IB

2019-05-09 Thread Christian König
Hi Monk, the timeout handler might only kill waves until we see some progress again and then continue. E.g. we don't necessary kill the whole job, but maybe just one drawing/computing operations. Or the same job is submitted another time. As long as there is a job running on the hardware

RE: [PATCH] drm/sched: fix the duplicated TMO message for one IB

2019-05-09 Thread Liu, Monk
Christian I think your previous reply > " Well, NAK. We don't need multiple timeout reports, but we really need to restart the timeout counter after handling it." Just looks paradox with what you say now > " ok you don't seem to understand: It is intentional that the same job times out

Re: [PATCH] drm/sched: fix the duplicated TMO message for one IB

2019-05-09 Thread Koenig, Christian
Hi Monk, ok you don't seem to understand: It is intentional that the same job times out multiple times! So we can't really change anything here. What we can do is instead of sending a signal (which is not a good idea from the timeout handler anyway) we can start a background script to do the

RE: [PATCH] drm/sched: fix the duplicated TMO message for one IB

2019-05-09 Thread Liu, Monk
Hah ... I see, but my requirement cannot be satisfied with current design: What I need to do is put a signal arming in job_timeout() to notify a USER SPACE daemon app , which finally leverage "UMR" to DUMP/retrieve sw/hw information related with the TMO/hang as much as possible . To make it

[PATCH xf86-video-ati] dri3: Always flush glamor before sharing pixmap storage with clients

2019-05-09 Thread Michel Dänzer
From: Michel Dänzer Even if glamor_gbm_bo_from_pixmap / glamor_fd_from_pixmap themselves don't trigger any drawing, there could already be unflushed drawing to the pixmap whose storage we share with a client. (Ported from amdgpu commit 4b17533fcb30842caf0035ba593b7d986520cc85) Signed-off-by:

[PATCH v3] drm/amdgpu: add badpages sysfs interafce

2019-05-09 Thread Pan, Xinhui
add badpages node. it will output badpages list in format gpu pfn : gpu page size : flags example 0x : 0x1000 : R 0x0001 : 0x1000 : R 0x0002 : 0x1000 : R 0x0003 : 0x1000 : R 0x0004 : 0x1000 : R 0x0005 : 0x1000 : R 0x0006 : 0x1000 : R

Re: [PATCH] drm/sched: fix the duplicated TMO message for one IB

2019-05-09 Thread Christian König
drm_sched_start() is not necessary called from the timeout handler. If a soft recovery is sufficient, we just continue without a complete reset. Christian. Am 09.05.19 um 12:25 schrieb Liu, Monk: Christian Check "drm_sched_start" which is invoked from gpu_recover() , there is a

RE: [PATCH] drm/sched: fix the duplicated TMO message for one IB

2019-05-09 Thread Liu, Monk
Christian Check "drm_sched_start" which is invoked from gpu_recover() , there is a "drm_sched_start_timeout()" in the tail /Monk -Original Message- From: Christian König Sent: Thursday, May 9, 2019 3:18 PM To: Liu, Monk ; amd-gfx@lists.freedesktop.org Subject: Re: [PATCH] drm/sched:

Re: [PATCH] drm/sched: fix the duplicated TMO message for one IB

2019-05-09 Thread Christian König
Am 09.05.19 um 06:31 schrieb Monk Liu: we don't need duplicated IB's timeout error message reported endlessly, just one report per timedout IB is enough Well, NAK. We don't need multiple timeout reports, but we really need to restart the timeout counter after handling it. Otherwise we will

Re: [PATCH 6/6] drm/amdgpu: remove MM engine related WARN_ON for user fence

2019-05-09 Thread Christian König
Am 08.05.19 um 21:02 schrieb Liu, Leo: On 5/8/19 1:45 PM, Alex Deucher wrote: [CAUTION: External Email] On Wed, May 8, 2019 at 11:51 AM Liu, Leo wrote: Since the check aleady done with command submission check Missing signed-off-by. patches 1-5 are: Reviewed-by: Alex Deucher As for this

Re: [PATCH] drm/amdgpu: Report firmware versions with sysfs

2019-05-09 Thread Koenig, Christian
Hi Kent, not strong opinion on that and I agree that from a housekeeping point of view we should probably create an own directory for the files. Christian. Am 08.05.19 um 19:11 schrieb Russell, Kent: > Hi Christian, > > Are you worried about him putting them in a fw_version subfolder like the