RE: TDR and VRAM lost handling in KMD (v2)

2017-10-12 Thread Liu, Monk
>That sounds sane, but unfortunately might not be possible with the existing 
>IOCTL. Keep in mind that we need to keep backward compatibility here.

unfortunately the current scheme on amdgpu_ctx_query() won’t work with TDR 
feature, which is aim to support vulkan/mesa/close-ogl/radv …
It’s enumeration is too limited to MESA implement …

Do you have good idea ?  both keep the compatibility here and give flexible ?
looks like we need to add a new amdgpu_ctx_query_2() INTERFACE ….

·A new IOCTL added for context:

Void amdgpu_ctx_reinit(){
Ctx→vram_lost_counter = adev→vram_lost_counter;
Ctx→reset_counter = adev→reset_counter;
}

Mhm, is there any advantage to just creating a new context?

[ML] sorry, this function is wrong, here is my original idea:

MESA can create a new ctx based on an old one,like:

Create gl-ctx1,
Create BO-A under gl-ctx1 …
VRAM LOST
Create gl-ctx2 from gl-ctx1 (share list, I’m not familiar with UMD, David Mao 
an Nicolai can input)
Create BO-b UNDER gl-ctx2 …
Submit job upon gl-ctx2, but it can refer to BO-A,

With our design, kernel won’t block submit from context2 (from gl-ctx2) since 
its vram_lost_counter equals to latest adev copy
But gl-ctx2 is a clone from gl-ctx1, the only difference is the 
vram_lost/gpu_reset counter is updated to latest one
So logically we should also block submit from gl-ctx2 (mapping to kernel 
context2), and we failed do so …

That’s why I want to add a new “amdgpu_ctx_clone”, which should work like:

Int amdgpu_ctx_clone(struct context *original, struct context *new) {
New->vram_lost_counter = original->vram_lost_counter;
New->gpu_reset_counter = original->gpu_reset_counter;
}


BR Monk


From: Koenig, Christian
Sent: 2017年10月12日 19:50
To: Liu, Monk ; Haehnle, Nicolai ; 
Michel Dänzer ; Olsak, Marek ; 
Deucher, Alexander ; Zhou, David(ChunMing) 
; Mao, David 
Cc: Ramirez, Alejandro ; 
amd-gfx@lists.freedesktop.org; Filipas, Mario ; Ding, 
Pixel ; Li, Bingley ; Jiang, Jerry (SW) 

Subject: Re: TDR and VRAM lost handling in KMD (v2)

Am 12.10.2017 um 13:37 schrieb Liu, Monk:
Hi team

Very good, many policy and implement are agreed, looks we only have some 
arguments in amdgpu_ctx_query(), well I also confused with the current 
implement of it, ☹

First, I want to know if you guys agree that we don't update ctx->reset_counter 
in amdgpu_ctx_query() ? because I want to make the query result always 
consistent upon a given context,

That sounds like a good idea to me, but I'm not sure if it won't break 
userspace (I don't think so). Nicolai or Marek need to comment.



Second, I want to say that for KERNEL, it shouldn't use the term from MESA or 
OGL or VULKAN, e.g. kernel shouldn't use AMDGPU_INNOCENT_RESET to map to 
GL_INNOCENT_RESET_ARB, etc...
Because that way kernel will be very limited to certain UMD, so I suggest we 
totally re-name the context status, and each UMD has its own way to map the 
kernel context's result to gl-context/vk-context/etc…

Yes, completely agree.



Kernel should only provide below *FLAG bits* on a given context:

·Define AMDGPU_CTX_STATE_GUILTY 0x1 //as long as TDR 
detects a job hang, KMD set the context behind this context as "guilty"
·Define AMDGPU_CTX_STATE_VRAMLOST 0x2  //as long as there 
is a VRAM lost event hit after this context created, we mark this context 
"VRAM_LOST", so UMD can say that all BO under this context may lost their 
content,  since kernel have no relationship between context and BO so this is 
UMD's call to judge if a BO considered "VRAM lost" or not.
·Define AMDGPU_CTX_STATE_RESET   0x3 //as long as there is a gpu 
reset occurred after context creation, this flag shall be set

That sounds sane, but unfortunately might not be possible with the existing 
IOCTL. Keep in mind that we need to keep backward compatibility here.



Sample code:

Int amdgpu_ctx_query(struct amdgpu_ctx_query_parm * out, …..) {
if (ctx- >vram_lost_counter != adev->vram_lost_counter)
out- >status |= AMDGPU_CTX_STATE_VRAM_LOST;

if (ctx- >reset_counter != adev→reset_counter) {
out- >status |= AMDGPU_CTX_STATE_RESET;

if (ctx- >guilty == TRUE)
out- >status |= AMDGPU_CTX_STATE_GUILTY;
}

return 0;
}

For UMD if it found "out.status == 0" means there is no gpu reset even 
occurred, the context is totally regular,

·A new IOCTL added for context:

Void amdgpu_ctx_reinit(){
Ctx→vram_lost_counter = adev→vram_lost_counter;
Ctx→reset_counter = adev→reset_counter;
}

Mhm, is there any advantage to just creating a new context?

Regards,
Christian.




if UMD decide *not* to 

[amd-staging-drm-next] compilation error with latest commit #1b006d838f78

2017-10-12 Thread Dieter Nützel

Hello,

next (regression) compilation error:

drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc_resource.c: In function 
‘resource_map_pool_resources’:
drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc_resource.c:1688:14: 
error: implicit declaration of function ‘acquire_first_split_pipe’; did 
you mean ‘acquire_first_free_pipe’? 
[-Werror=implicit-function-declaration]

   pipe_idx = acquire_first_split_pipe(>res_ctx, pool, stream);
  ^~~~
  acquire_first_free_pipe


It is wrongly (?) guarded behind:

#if defined(CONFIG_DRM_AMD_DC_DCN1_0)
static int acquire_first_split_pipe(
struct resource_context *res_ctx,
const struct resource_pool *pool,
struct dc_stream_state *stream)
[snip]

Thanks,
Dieter
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 3/6] drm/amd/display: Unify DRM state variable namings.

2017-10-12 Thread sunpeng.li
From: "Leo (Sunpeng) Li" 

Use new_*_state and old_*_state for their respective new/old DRM object
states.

Signed-off-by: Leo (Sunpeng) Li 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 80 +++
 1 file changed, 40 insertions(+), 40 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index d4426b3..fe0b658 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -573,12 +573,12 @@ struct amdgpu_dm_connector 
*amdgpu_dm_find_first_crct_matching_connector(
struct drm_crtc *crtc)
 {
uint32_t i;
-   struct drm_connector_state *conn_state;
+   struct drm_connector_state *new_con_state;
struct drm_connector *connector;
struct drm_crtc *crtc_from_state;
 
-   for_each_new_connector_in_state(state, connector, conn_state, i) {
-   crtc_from_state = conn_state->crtc;
+   for_each_new_connector_in_state(state, connector, new_con_state, i) {
+   crtc_from_state = new_con_state->crtc;
 
if (crtc_from_state == crtc)
return to_amdgpu_dm_connector(connector);
@@ -608,7 +608,7 @@ int amdgpu_dm_display_resume(struct amdgpu_device *adev)
struct amdgpu_dm_connector *aconnector;
struct drm_connector *connector;
struct drm_crtc *crtc;
-   struct drm_crtc_state *crtc_state;
+   struct drm_crtc_state *new_crtc_state;
int ret = 0;
int i;
 
@@ -644,8 +644,8 @@ int amdgpu_dm_display_resume(struct amdgpu_device *adev)
}
 
/* Force mode set in atomic comit */
-   for_each_new_crtc_in_state(adev->dm.cached_state, crtc, crtc_state, i)
-   crtc_state->active_changed = true;
+   for_each_new_crtc_in_state(adev->dm.cached_state, crtc, new_crtc_state, 
i)
+   new_crtc_state->active_changed = true;
 
ret = drm_atomic_helper_resume(ddev, adev->dm.cached_state);
 
@@ -3971,7 +3971,7 @@ int amdgpu_dm_atomic_commit(
bool nonblock)
 {
struct drm_crtc *crtc;
-   struct drm_crtc_state *old_crtc_state, *new_state;
+   struct drm_crtc_state *old_crtc_state, *new_crtc_state;
struct amdgpu_device *adev = dev->dev_private;
int i;
 
@@ -3982,11 +3982,11 @@ int amdgpu_dm_atomic_commit(
 * it will update crtc->dm_crtc_state->stream pointer which is used in
 * the ISRs.
 */
-   for_each_oldnew_crtc_in_state(state, crtc, old_crtc_state, new_state, 
i) {
+   for_each_oldnew_crtc_in_state(state, crtc, old_crtc_state, 
new_crtc_state, i) {
struct dm_crtc_state *old_acrtc_state = 
to_dm_crtc_state(old_crtc_state);
struct amdgpu_crtc *acrtc = to_amdgpu_crtc(crtc);
 
-   if (drm_atomic_crtc_needs_modeset(new_state) && 
old_acrtc_state->stream)
+   if (drm_atomic_crtc_needs_modeset(new_crtc_state) && 
old_acrtc_state->stream)
manage_dm_interrupts(adev, acrtc, false);
}
 
@@ -4011,7 +4011,7 @@ void amdgpu_dm_atomic_commit_tail(
unsigned long flags;
bool wait_for_vblank = true;
struct drm_connector *connector;
-   struct drm_connector_state *old_conn_state, *new_con_state;
+   struct drm_connector_state *old_con_state, *new_con_state;
struct dm_crtc_state *old_acrtc_state, *new_acrtc_state;
 
drm_atomic_helper_update_legacy_modeset_state(dev, state);
@@ -4145,9 +4145,9 @@ void amdgpu_dm_atomic_commit_tail(
}
 
/* Handle scaling and undersacn changes*/
-   for_each_oldnew_connector_in_state(state, connector, old_conn_state, 
new_con_state, i) {
+   for_each_oldnew_connector_in_state(state, connector, old_con_state, 
new_con_state, i) {
struct dm_connector_state *con_new_state = 
to_dm_connector_state(new_con_state);
-   struct dm_connector_state *con_old_state = 
to_dm_connector_state(old_conn_state);
+   struct dm_connector_state *con_old_state = 
to_dm_connector_state(old_con_state);
struct amdgpu_crtc *acrtc = 
to_amdgpu_crtc(con_new_state->base.crtc);
struct dc_stream_status *status = NULL;
 
@@ -4375,7 +4375,7 @@ static int dm_update_crtcs_state(
bool *lock_and_validation_needed)
 {
struct drm_crtc *crtc;
-   struct drm_crtc_state *old_crtc_state, *crtc_state;
+   struct drm_crtc_state *old_crtc_state, *new_crtc_state;
int i;
struct dm_crtc_state *old_acrtc_state, *new_acrtc_state;
struct dm_atomic_state *dm_state = to_dm_atomic_state(state);
@@ -4384,34 +4384,34 @@ static int dm_update_crtcs_state(
 
/*TODO Move this code into dm_crtc_atomic_check once we get rid of 
dc_validation_set */
/* update changed items */
-   

[PATCH 6/6] drm/amd/display: Remove useless pcrtc pointer

2017-10-12 Thread sunpeng.li
From: "Leo (Sunpeng) Li" 

in amdgpu_dm_atomic_commit_tail. Just use crtc instead.

Signed-off-by: Leo (Sunpeng) Li 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 67222ff..f9b5769 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -4004,7 +4004,7 @@ void amdgpu_dm_atomic_commit_tail(
struct dm_atomic_state *dm_state;
uint32_t i, j;
uint32_t new_crtcs_count = 0;
-   struct drm_crtc *crtc, *pcrtc;
+   struct drm_crtc *crtc;
struct drm_crtc_state *old_crtc_state, *new_crtc_state;
struct amdgpu_crtc *new_crtcs[MAX_STREAMS];
struct dc_stream_state *new_stream = NULL;
@@ -4200,11 +4200,11 @@ void amdgpu_dm_atomic_commit_tail(
}
 
/* update planes when needed per crtc*/
-   for_each_new_crtc_in_state(state, pcrtc, new_crtc_state, j) {
+   for_each_new_crtc_in_state(state, crtc, new_crtc_state, j) {
dm_new_crtc_state = to_dm_crtc_state(new_crtc_state);
 
if (dm_new_crtc_state->stream)
-   amdgpu_dm_commit_planes(state, dev, dm, pcrtc, 
_for_vblank);
+   amdgpu_dm_commit_planes(state, dev, dm, crtc, 
_for_vblank);
}
 
 
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 0/6] Use new DRM API where possible, and cleanups.

2017-10-12 Thread sunpeng.li
From: "Leo (Sunpeng) Li" 

Hi Dave,

This series reworks the previous patch. Patch 1 is a v2 of the previous,
and additional patches are from the feedback received. They apply on top
of your drm-next-amd-dc-staging branch.

Thanks,
Leo

Leo (Sunpeng) Li (6):
  drm/amd/display: Use DRM new-style object iterators.
  drm/amd/display: Use new DRM API where possible
  drm/amd/display: Unify DRM state variable namings.
  drm/amd/display: Unify amdgpu_dm state variable namings.
  drm/amd/display: Fix typo
  drm/amd/display: Remove useless pcrtc pointer

 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 320 +++---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h |   3 +-
 2 files changed, 156 insertions(+), 167 deletions(-)

-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 5/6] drm/amd/display: Fix typo

2017-10-12 Thread sunpeng.li
From: "Leo (Sunpeng) Li" 

undersacn -> underscan

Signed-off-by: Leo (Sunpeng) Li 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index de88ee1..67222ff 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -4144,7 +4144,7 @@ void amdgpu_dm_atomic_commit_tail(
}
}
 
-   /* Handle scaling and undersacn changes*/
+   /* Handle scaling and underscan changes*/
for_each_oldnew_connector_in_state(state, connector, old_con_state, 
new_con_state, i) {
struct dm_connector_state *dm_new_con_state = 
to_dm_connector_state(new_con_state);
struct dm_connector_state *dm_old_con_state = 
to_dm_connector_state(old_con_state);
@@ -4707,7 +4707,7 @@ int amdgpu_dm_atomic_check(struct drm_device *dev,
 if (ret)
 goto fail;
 
-   /* Check scaling and undersacn changes*/
+   /* Check scaling and underscan changes*/
/*TODO Removed scaling changes validation due to inability to commit
 * new stream into context w\o causing full reset. Need to
 * decide how to handle.
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH 2/6] drm/amd/display: Use new DRM API where possible

2017-10-12 Thread sunpeng.li
From: "Leo (Sunpeng) Li" 

To conform to DRM's new API, we should not be accessing a DRM object's
internal state directly. Rather, the DRM for_each_old/new_* iterators,
and drm_atomic_get_old/new_* interface should be used.

This is an ongoing process. For now, update the DRM-facing atomic
functions, where the atomic state object is given.

Signed-off-by: Leo (Sunpeng) Li 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 131 +++---
 1 file changed, 66 insertions(+), 65 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index cc024ab..d4426b3 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -3873,28 +3873,31 @@ static void amdgpu_dm_commit_planes(struct 
drm_atomic_state *state,
 {
uint32_t i;
struct drm_plane *plane;
-   struct drm_plane_state *old_plane_state;
+   struct drm_plane_state *old_plane_state, *new_plane_state;
struct dc_stream_state *dc_stream_attach;
struct dc_plane_state *plane_states_constructed[MAX_SURFACES];
struct amdgpu_crtc *acrtc_attach = to_amdgpu_crtc(pcrtc);
-   struct dm_crtc_state *acrtc_state = to_dm_crtc_state(pcrtc->state);
+   struct drm_crtc_state *new_pcrtc_state =
+   drm_atomic_get_new_crtc_state(state, pcrtc);
+   struct dm_crtc_state *acrtc_state = to_dm_crtc_state(new_pcrtc_state);
int planes_count = 0;
unsigned long flags;
 
/* update planes when needed */
-   for_each_old_plane_in_state(state, plane, old_plane_state, i) {
-   struct drm_plane_state *plane_state = plane->state;
-   struct drm_crtc *crtc = plane_state->crtc;
-   struct drm_framebuffer *fb = plane_state->fb;
+   for_each_oldnew_plane_in_state(state, plane, old_plane_state, 
new_plane_state, i) {
+   struct drm_crtc *crtc = new_plane_state->crtc;
+   struct drm_crtc_state *new_crtc_state =
+   drm_atomic_get_new_crtc_state(state, crtc);
+   struct drm_framebuffer *fb = new_plane_state->fb;
bool pflip_needed;
-   struct dm_plane_state *dm_plane_state = 
to_dm_plane_state(plane_state);
+   struct dm_plane_state *dm_plane_state = 
to_dm_plane_state(new_plane_state);
 
if (plane->type == DRM_PLANE_TYPE_CURSOR) {
handle_cursor_update(plane, old_plane_state);
continue;
}
 
-   if (!fb || !crtc || pcrtc != crtc || !crtc->state->active)
+   if (!fb || !crtc || pcrtc != crtc || !new_crtc_state->active)
continue;
 
pflip_needed = !state->allow_modeset;
@@ -3918,13 +3921,13 @@ static void amdgpu_dm_commit_planes(struct 
drm_atomic_state *state,
dc_stream_attach = acrtc_state->stream;
planes_count++;
 
-   } else if (crtc->state->planes_changed) {
+   } else if (new_crtc_state->planes_changed) {
/* Assume even ONE crtc with immediate flip means
 * entire can't wait for VBLANK
 * TODO Check if it's correct
 */
*wait_for_vblank =
-   pcrtc->state->pageflip_flags & 
DRM_MODE_PAGE_FLIP_ASYNC ?
+   new_pcrtc_state->pageflip_flags & 
DRM_MODE_PAGE_FLIP_ASYNC ?
false : true;
 
/* TODO: Needs rework for multiplane flip */
@@ -3942,7 +3945,7 @@ static void amdgpu_dm_commit_planes(struct 
drm_atomic_state *state,
if (planes_count) {
unsigned long flags;
 
-   if (pcrtc->state->event) {
+   if (new_pcrtc_state->event) {
 
drm_crtc_vblank_get(pcrtc);
 
@@ -3968,7 +3971,7 @@ int amdgpu_dm_atomic_commit(
bool nonblock)
 {
struct drm_crtc *crtc;
-   struct drm_crtc_state *new_state;
+   struct drm_crtc_state *old_crtc_state, *new_state;
struct amdgpu_device *adev = dev->dev_private;
int i;
 
@@ -3979,8 +3982,8 @@ int amdgpu_dm_atomic_commit(
 * it will update crtc->dm_crtc_state->stream pointer which is used in
 * the ISRs.
 */
-   for_each_new_crtc_in_state(state, crtc, new_state, i) {
-   struct dm_crtc_state *old_acrtc_state = 
to_dm_crtc_state(crtc->state);
+   for_each_oldnew_crtc_in_state(state, crtc, old_crtc_state, new_state, 
i) {
+   struct dm_crtc_state *old_acrtc_state = 
to_dm_crtc_state(old_crtc_state);
struct amdgpu_crtc *acrtc = to_amdgpu_crtc(crtc);
 
if (drm_atomic_crtc_needs_modeset(new_state) 

[PATCH 4/6] drm/amd/display: Unify amdgpu_dm state variable namings.

2017-10-12 Thread sunpeng.li
From: "Leo (Sunpeng) Li" 

Use dm_new_*_state and dm_old_*_state for their respective amdgpu_dm new
and old object states. Helps with readability, and enforces use of new
DRM api (choose either new, or old).

Signed-off-by: Leo (Sunpeng) Li 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 137 +++---
 1 file changed, 68 insertions(+), 69 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index fe0b658..de88ee1 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -3890,7 +3890,7 @@ static void amdgpu_dm_commit_planes(struct 
drm_atomic_state *state,
drm_atomic_get_new_crtc_state(state, crtc);
struct drm_framebuffer *fb = new_plane_state->fb;
bool pflip_needed;
-   struct dm_plane_state *dm_plane_state = 
to_dm_plane_state(new_plane_state);
+   struct dm_plane_state *dm_new_plane_state = 
to_dm_plane_state(new_plane_state);
 
if (plane->type == DRM_PLANE_TYPE_CURSOR) {
handle_cursor_update(plane, old_plane_state);
@@ -3914,9 +3914,9 @@ static void amdgpu_dm_commit_planes(struct 
drm_atomic_state *state,
spin_unlock_irqrestore(>dev->event_lock, flags);
 
if (!pflip_needed) {
-   WARN_ON(!dm_plane_state->dc_state);
+   WARN_ON(!dm_new_plane_state->dc_state);
 
-   plane_states_constructed[planes_count] = 
dm_plane_state->dc_state;
+   plane_states_constructed[planes_count] = 
dm_new_plane_state->dc_state;
 
dc_stream_attach = acrtc_state->stream;
planes_count++;
@@ -3983,10 +3983,10 @@ int amdgpu_dm_atomic_commit(
 * the ISRs.
 */
for_each_oldnew_crtc_in_state(state, crtc, old_crtc_state, 
new_crtc_state, i) {
-   struct dm_crtc_state *old_acrtc_state = 
to_dm_crtc_state(old_crtc_state);
+   struct dm_crtc_state *dm_old_crtc_state = 
to_dm_crtc_state(old_crtc_state);
struct amdgpu_crtc *acrtc = to_amdgpu_crtc(crtc);
 
-   if (drm_atomic_crtc_needs_modeset(new_crtc_state) && 
old_acrtc_state->stream)
+   if (drm_atomic_crtc_needs_modeset(new_crtc_state) && 
dm_old_crtc_state->stream)
manage_dm_interrupts(adev, acrtc, false);
}
 
@@ -4012,7 +4012,7 @@ void amdgpu_dm_atomic_commit_tail(
bool wait_for_vblank = true;
struct drm_connector *connector;
struct drm_connector_state *old_con_state, *new_con_state;
-   struct dm_crtc_state *old_acrtc_state, *new_acrtc_state;
+   struct dm_crtc_state *dm_old_crtc_state, *dm_new_crtc_state;
 
drm_atomic_helper_update_legacy_modeset_state(dev, state);
 
@@ -4022,8 +4022,8 @@ void amdgpu_dm_atomic_commit_tail(
for_each_oldnew_crtc_in_state(state, crtc, old_crtc_state, 
new_crtc_state, i) {
struct amdgpu_crtc *acrtc = to_amdgpu_crtc(crtc);
 
-   new_acrtc_state = to_dm_crtc_state(new_crtc_state);
-   old_acrtc_state = to_dm_crtc_state(old_crtc_state);
+   dm_new_crtc_state = to_dm_crtc_state(new_crtc_state);
+   dm_old_crtc_state = to_dm_crtc_state(old_crtc_state);
 
DRM_DEBUG_DRIVER(
"amdgpu_crtc id:%d crtc_state_flags: enable:%d, 
active:%d, "
@@ -4041,11 +4041,11 @@ void amdgpu_dm_atomic_commit_tail(
 * aconnector as needed
 */
 
-   if (modeset_required(new_crtc_state, new_acrtc_state->stream, 
old_acrtc_state->stream)) {
+   if (modeset_required(new_crtc_state, dm_new_crtc_state->stream, 
dm_old_crtc_state->stream)) {
 
DRM_DEBUG_DRIVER("Atomic commit: SET crtc id %d: 
[%p]\n", acrtc->crtc_id, acrtc);
 
-   if (!new_acrtc_state->stream) {
+   if (!dm_new_crtc_state->stream) {
/*
 * this could happen because of issues with
 * userspace notifications delivery.
@@ -4067,8 +4067,8 @@ void amdgpu_dm_atomic_commit_tail(
}
 
 
-   if (old_acrtc_state->stream)
-   remove_stream(adev, acrtc, 
old_acrtc_state->stream);
+   if (dm_old_crtc_state->stream)
+   remove_stream(adev, acrtc, 
dm_old_crtc_state->stream);
 
 
/*
@@ -4092,8 +4092,8 @@ void amdgpu_dm_atomic_commit_tail(
DRM_DEBUG_DRIVER("Atomic commit: RESET. crtc id 
%d:[%p]\n", acrtc->crtc_id, acrtc);
 
/* i.e. reset mode */
-   if 

[PATCH v2 1/6] drm/amd/display: Use DRM new-style object iterators.

2017-10-12 Thread sunpeng.li
From: "Leo (Sunpeng) Li" 

Use the correct for_each_new/old_* iterators instead of for_each_*

List of affected functions:

amdgpu_dm_find_first_crtc_matching_connector: use for_each_new
- Old from_state_var flag was always choosing the new state

amdgpu_dm_display_resume: use for_each_new
- drm_atomic_helper_duplicate_state is called during suspend to
  cache the state
- It sets 'state' within the state triplet to 'new_state'

amdgpu_dm_commit_planes: use for_each_old
- Called after the state was swapped (via atomic commit tail)

amdgpu_dm_atomic_commit: use for_each_new
- Called before the state is swapped

amdgpu_dm_atomic_commit_tail: use for_each_old
- Called after the state was swapped

dm_update_crtcs_state: use for_each_new
- Called before the state is swapped (via atomic check)

amdgpu_dm_atomic_check: use for_each_new
- Called before the state is swapped

v2: Split out typo fixes to a new patch.

Signed-off-by: Leo (Sunpeng) Li 
---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 32 ---
 drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h |  3 +--
 2 files changed, 12 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 9bfe1f9..cc024ab 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -570,23 +570,15 @@ static int dm_suspend(void *handle)
 
 struct amdgpu_dm_connector *amdgpu_dm_find_first_crct_matching_connector(
struct drm_atomic_state *state,
-   struct drm_crtc *crtc,
-   bool from_state_var)
+   struct drm_crtc *crtc)
 {
uint32_t i;
struct drm_connector_state *conn_state;
struct drm_connector *connector;
struct drm_crtc *crtc_from_state;
 
-   for_each_new_connector_in_state(
-   state,
-   connector,
-   conn_state,
-   i) {
-   crtc_from_state =
-   from_state_var ?
-   conn_state->crtc :
-   connector->state->crtc;
+   for_each_new_connector_in_state(state, connector, conn_state, i) {
+   crtc_from_state = conn_state->crtc;
 
if (crtc_from_state == crtc)
return to_amdgpu_dm_connector(connector);
@@ -3890,7 +3882,7 @@ static void amdgpu_dm_commit_planes(struct 
drm_atomic_state *state,
unsigned long flags;
 
/* update planes when needed */
-   for_each_new_plane_in_state(state, plane, old_plane_state, i) {
+   for_each_old_plane_in_state(state, plane, old_plane_state, i) {
struct drm_plane_state *plane_state = plane->state;
struct drm_crtc *crtc = plane_state->crtc;
struct drm_framebuffer *fb = plane_state->fb;
@@ -4024,7 +4016,7 @@ void amdgpu_dm_atomic_commit_tail(
dm_state = to_dm_atomic_state(state);
 
/* update changed items */
-   for_each_new_crtc_in_state(state, crtc, old_crtc_state, i) {
+   for_each_old_crtc_in_state(state, crtc, old_crtc_state, i) {
struct amdgpu_crtc *acrtc = to_amdgpu_crtc(crtc);
struct drm_crtc_state *new_state = crtc->state;
 
@@ -4113,11 +4105,9 @@ void amdgpu_dm_atomic_commit_tail(
new_acrtc_state = 
to_dm_crtc_state(new_crtcs[i]->base.state);
 
new_stream = new_acrtc_state->stream;
-   aconnector =
-   amdgpu_dm_find_first_crct_matching_connector(
+   aconnector = 
amdgpu_dm_find_first_crct_matching_connector(
state,
-   _crtcs[i]->base,
-   false);
+   _crtcs[i]->base);
if (!aconnector) {
DRM_DEBUG_DRIVER("Atomic commit: Failed to find 
connector for acrtc id:%d "
 "skipping freesync init\n",
@@ -4151,7 +4141,7 @@ void amdgpu_dm_atomic_commit_tail(
}
 
/* Handle scaling and undersacn changes*/
-   for_each_new_connector_in_state(state, connector, old_conn_state, i) {
+   for_each_old_connector_in_state(state, connector, old_conn_state, i) {
struct amdgpu_dm_connector *aconnector = 
to_amdgpu_dm_connector(connector);
struct dm_connector_state *con_new_state =
to_dm_connector_state(aconnector->base.state);
@@ -4205,7 +4195,7 @@ void amdgpu_dm_atomic_commit_tail(
}
 
/* update planes when needed per crtc*/
-   for_each_new_crtc_in_state(state, pcrtc, old_crtc_state, j) {
+   for_each_old_crtc_in_state(state, pcrtc, old_crtc_state, j) {

[PATCH] drm/amdgpu: bump version for vram lost counter query

2017-10-12 Thread Alex Deucher
Signed-off-by: Alex Deucher 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index b9a3258..38cd8bd 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -72,9 +72,10 @@
  * - 3.20.0 - Add support for local BOs
  * - 3.21.0 - Add DRM_AMDGPU_FENCE_TO_HANDLE ioctl
  * - 3.22.0 - Add DRM_AMDGPU_SCHED ioctl
+ * - 3.23.0 - Add query for vram lost counter
  */
 #define KMS_DRIVER_MAJOR   3
-#define KMS_DRIVER_MINOR   22
+#define KMS_DRIVER_MINOR   23
 #define KMS_DRIVER_PATCHLEVEL  0
 
 int amdgpu_vram_limit = 0;
-- 
2.5.5

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 1/3] drm/syncobj: extract two helpers from drm_syncobj_create

2017-10-12 Thread Marek Olšák
Sorry, this codepath is not tested by radeonsi.

Marek

On Sun, Oct 1, 2017 at 1:20 AM, Zhou, David(ChunMing)
 wrote:
> Could you test and review it? On hand, I have no env.
>
> Regards,
> David Zhou
>
> 发自坚果 Pro
>
> Marek Ol?醟  于 2017年9月30日 下午11:56写道:
>
> The idea sounds good.
>
> Marek
>
> On Sat, Sep 30, 2017 at 3:55 AM, Chunming Zhou  wrote:
>> My mean is like the attached, I revert part of yours.
>>
>> Regards,
>>
>> David zhou
>>
>>
>>
>> On 2017年09月29日 22:15, Marek Olšák wrote:
>>>
>>> On Fri, Sep 29, 2017 at 4:13 PM, Marek Olšák  wrote:

 On Fri, Sep 29, 2017 at 4:44 AM, Chunming Zhou  wrote:
>
>
> On 2017年09月13日 04:42, Marek Olšák wrote:
>>
>> From: Marek Olšák 
>>
>> For amdgpu.
>>
>> drm_syncobj_create is renamed to drm_syncobj_create_as_handle, and new
>> helpers drm_syncobj_create and drm_syncobj_get_handle are added.
>>
>> Signed-off-by: Marek Olšák 
>> ---
>>drivers/gpu/drm/drm_syncobj.c | 49
>> +++
>>include/drm/drm_syncobj.h |  4 
>>2 files changed, 49 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/drm_syncobj.c
>> b/drivers/gpu/drm/drm_syncobj.c
>> index 0422b8c..0bb1741 100644
>> --- a/drivers/gpu/drm/drm_syncobj.c
>> +++ b/drivers/gpu/drm/drm_syncobj.c
>> @@ -262,8 +262,14 @@ void drm_syncobj_free(struct kref *kref)
>>}
>>EXPORT_SYMBOL(drm_syncobj_free);
>>-static int drm_syncobj_create(struct drm_file *file_private,
>> - u32 *handle, uint32_t flags)
>
> You can add a new parameter for passing dma fence, then in patch3, you
> can
> directly use it for AMDGPU_FENCE_TO HANDLE_GET_SYNCOBJ.
>
> otherwise the set looks good to me.

 Sorry I just pushed this.
>>>
>>> Actually, you commented on a deleted line. The function already has
>>> dma_fence among the parameters.
>>>
>>> Marek
>>
>>
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[ANNOUNCE] libdrm 2.4.84

2017-10-12 Thread Marek Olšák

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


libdrm 2.4.84 has been released.


Alex Deucher (1):
  tests/amdgpu: add missing header to SOURCES

Andrey Grodzovsky (1):
  amdgpu: Add deadlock detection test suit.

Anuj Phogat (1):
  intel: Change a KBL pci id to GT2 from GT1.5

Christian König (1):
  amdgpu: make userptr unit test more interesting

Dave Airlie (1):
  headers: sync syncobj ioctl defines.

Eric Engestrom (1):
  freedreno/kgsl: fix pointer-to-int cast

James Zhu (2):
  tests/amdgpu: add new uvd enc support check
  tests/amdgpu: fix uvd enc data corruption issue

Jan Vesely (1):
  amdgpu: Do not write beyond allocated memory when parsing ids

Marek Olšák (7):
  amdgpu: print error messages when amdgpu_device_initialize is failing
  include: sync drm.h and amdgpu_drm.h with airlied/drm-next
  amdgpu: add sync_file import and export functions
  drm: add drmSyncobjWait wrapper
  amdgpu: add amdgpu_cs_syncobj_wait
  amdgpu: add amdgpu_cs_fence_to_handle
  configure.ac: bump version to 2.4.84

Philipp Zabel (1):
  etnaviv: prevent deadlock in error path

Rob Herring (2):
  Android: move libraries to /vendor
  headers: sync DRM_MODE_ROTATE/REFLECT defines from kernel v4.14-rc1

git tag: libdrm-2.4.84

https://dri.freedesktop.org/libdrm/libdrm-2.4.84.tar.bz2
MD5:  35b9544bc2ad864acd1abaa1a2b99092  libdrm-2.4.84.tar.bz2
SHA1: 3a8835aaef89648757593f8de9eff95990dd  libdrm-2.4.84.tar.bz2
SHA256: 7ae9c24d91139ac9a2cdee06fe46dbe1c401a1eda1c0bd2a6d1ecf72f479e0aa  
libdrm-2.4.84.tar.bz2
SHA512: 
860ebc5fa934edee97e9e7e13aaa2f2e70a68b946f4f3893cd7f93b8296c10b3cd4ce4c23b1676eefe375286e6e2292b96e917d7976f65c61da3fa661e5e641a
  libdrm-2.4.84.tar.bz2
PGP:  https://dri.freedesktop.org/libdrm/libdrm-2.4.84.tar.bz2.sig

https://dri.freedesktop.org/libdrm/libdrm-2.4.84.tar.gz
MD5:  5e676f903bdb245878383334dca4cc33  libdrm-2.4.84.tar.gz
SHA1: e406522d41c2dc5f3ac9643f885a349e039ffeb6  libdrm-2.4.84.tar.gz
SHA256: ca4d3a4705be2ec289f9df7cfa871f5e02fa43d0f653622c9d9d428959143e78  
libdrm-2.4.84.tar.gz
SHA512: 
efbe352e1bbf3bb1962ad8e1c0a2774e5683db9cd0d91e42b844ddc74089d131f305bc6977c0734690c88af11f5d6777dbfd4bbaab9778fa550dc1b6f52e5cb6
  libdrm-2.4.84.tar.gz
PGP:  https://dri.freedesktop.org/libdrm/libdrm-2.4.84.tar.gz.sig

-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQEcBAEBAgAGBQJZ37tJAAoJEP3RXVrO8PKxmNoH/jdcdZzvpXtWB6eWC2SEXGbl
WvhoIrNf94cAz/1JvN6nzXjo5oewTTrHsHTd7XYxHcFt13jHTB5uCX8gnCotvY+d
yJesW8liHg4wx1FyL8syAufFlWtzyxK74lQj6R4DHvVeHMnUpc9+goDVZDPkHj2R
/S4u5qbLC4AJeATyDsQG07kPwhblrQpm7QLns4caQsJssrmhboM5w4IkjViHUR1j
7ju5MW1eyAAdWHdhXRgFk7lIO+nuyZXoxS6TjMlQmYichBM261PJldUjo5Zi4Aal
oH/QuDYvoFq0F5E4f/JIliUciU/w4FFSOtheWW2AztzrAeUGyTl1D413WPI4ZUQ=
=o07z
-END PGP SIGNATURE-
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2

2017-10-12 Thread Christian König

Am 12.10.2017 um 20:27 schrieb Marek Olšák:

On Thu, Sep 21, 2017 at 4:38 PM, Andres Rodriguez  wrote:

Hi Christian,

The reference radv patches are on the list. The basic idea is to only set
the explicit sync flag for buffers allocated for dri usage.

Did you mean "only set the explicit sync flag for buffers NOT
allocated for dri usage"?


I really hope so, cause otherwise that wouldn't be backward compatible.

But the name indicates that no implicit sync is requested any more, so I 
think he got that right.


Christian.



Marek



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: introduce AMDGPU_GEM_CREATE_EXPLICIT_SYNC v2

2017-10-12 Thread Marek Olšák
On Thu, Sep 21, 2017 at 4:38 PM, Andres Rodriguez  wrote:
> Hi Christian,
>
> The reference radv patches are on the list. The basic idea is to only set
> the explicit sync flag for buffers allocated for dri usage.

Did you mean "only set the explicit sync flag for buffers NOT
allocated for dri usage"?

Marek
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amd/display: Fix warning about overflow

2017-10-12 Thread Harry Wentland
We're overflowing the last bit. Cast it explicitly

Signed-off-by: Harry Wentland 
---
 drivers/gpu/drm/amd/display/dc/bios/bios_parser2.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/display/dc/bios/bios_parser2.c 
b/drivers/gpu/drm/amd/display/dc/bios/bios_parser2.c
index cb94e18cc455..715dc789bb24 100644
--- a/drivers/gpu/drm/amd/display/dc/bios/bios_parser2.c
+++ b/drivers/gpu/drm/amd/display/dc/bios/bios_parser2.c
@@ -1047,7 +1047,7 @@ static enum bp_result get_embedded_panel_info_v2_1(
lvds->lcd_timing.miscinfo & ATOM_V_REPLICATIONBY2;
info->lcd_timing.misc_info.COMPOSITE_SYNC =
lvds->lcd_timing.miscinfo & ATOM_COMPOSITESYNC;
-   info->lcd_timing.misc_info.INTERLACE =
+   info->lcd_timing.misc_info.INTERLACE = (uint32_t)
lvds->lcd_timing.miscinfo & ATOM_INTERLACE;
 
/* not provided by VBIOS*/
-- 
2.14.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [pull] amdgpu drm-fixes-4.14

2017-10-12 Thread Alex Deucher
On Thu, Oct 12, 2017 at 1:02 PM, Christian König
 wrote:
> Am 12.10.2017 um 18:20 schrieb Michel Dänzer:
>>
>> On 12/10/17 05:58 PM, Alex Deucher wrote:
>>>
>>> Hi Dave,
>>>
>>> One memory management regression fix.
>>>
>>> The following changes since commit
>>> 545036a9944e9d6e50fed4ca03117147c880ff71:
>>>
>>>Merge tag 'drm-misc-fixes-2017-10-11' of
>>> git://anongit.freedesktop.org/drm/drm-misc into drm-fixes (2017-10-12
>>> 10:38:09 +1000)
>>>
>>> are available in the git repository at:
>>>
>>>git://people.freedesktop.org/~agd5f/linux drm-fixes-4.14
>>>
>>> for you to fetch changes up to 27b94b4f1386c3a8181f5a0277434a32e24e7dd7:
>>>
>>>drm/amdgpu: fix placement flags in amdgpu_ttm_bind (2017-10-12
>>> 10:34:42 -0400)
>>>
>>> 
>>> Christian König (1):
>>>drm/amdgpu: fix placement flags in amdgpu_ttm_bind
>>
>> Thanks Alex, but there's another piglit hang regression in 4.14, caused
>> by commit 6af0883ed977 "drm/amdgpu: discard commands of killed
>> processes", fixed by five commits 6b37d03280a4..318d85de9c20 in
>> amd-staging-drm-next. Either the latter need to be backported to 4.14,
>> or the former needs to be reverted from it.
>
>
> The revert is probably easier to handle at this point.
>
> So to answer your question from the other thread I vote for that.

Nicolai's patches apply cleanly and I think they change about the same
amount of code and we don't have to worry about any problems down the
road when the revert gets merged into drm-next.

Alex
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/ttm: fix the fix for huge compound pages

2017-10-12 Thread Christian König
From: Christian König 

We don't use compound pages at the moment. Take this into account when
freeing them.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/ttm/ttm_page_alloc.c | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c 
b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index b6f16e73..c3be50f41461 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -731,22 +731,33 @@ static void ttm_put_pages(struct page **pages, unsigned 
npages, int flags,
/* No pool for this memory type so free the pages */
i = 0;
while (i < npages) {
-   unsigned order;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   struct page *p = pages[i];
+#endif
+   unsigned order = 0, j;
 
if (!pages[i]) {
++i;
continue;
}
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   for (j = 0; j < HPAGE_PMD_NR; ++j)
+   if (p++ != pages[i + j])
+   break;
+
+   if (j == HPAGE_PMD_NR)
+   order = HPAGE_PMD_ORDER;
+#endif
+
if (page_count(pages[i]) != 1)
pr_err("Erroneous page count. Leaking 
pages.\n");
-   order = compound_order(pages[i]);
__free_pages(pages[i], order);
 
-   order = 1 << order;
-   while (order) {
+   j = 1 << order;
+   while (j) {
pages[i++] = NULL;
-   --order;
+   --j;
}
}
return;
-- 
2.11.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini

2017-10-12 Thread Christian König

Am 12.10.2017 um 18:49 schrieb Michel Dänzer:

On 12/10/17 01:00 PM, Michel Dänzer wrote:

[0] I also got this, but I don't know yet if it's related:

No, that seems to be a separate issue; I can still reproduce it with the
huge page related changes reverted. Unfortunately, it doesn't seem to
happen reliably on every piglit run.


Can you enable KASAN in your kernel, and please look up at which line 
number amdgpu_vm_bo_invalidate+0x88 is.



Even before your changes this morning, there's another hang which
doesn't happen every time, without any corresponding dmesg output.

Lots of "fun" in amd-staging-drm-next...


Yeah, way to much stuff on my TODO list and not enough time/resources 
for extensive testing :(


Thanks for the reports,
Christian.





  BUG: unable to handle kernel NULL pointer dereference at 0220
  IP: amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu]
  PGD 0
  P4D 0
  
  Oops:  [#1] SMP

  Modules linked in: cpufreq_powersave cpufreq_userspace cpufreq_conservative 
amdkfd(O) edac_mce_amd kvm amdgpu(O) irqbypass crct10dif_pclmul crc32_pclmul 
chash snd_hda_codec_realtek ghash_clmulni_intel snd_hda_codec_generic 
snd_hda_codec_hdmi pcbc binfmt_misc ttm(O) efi_pstore snd_hda_intel 
drm_kms_helper(O) snd_hda_codec nls_ascii drm(O) snd_hda_core nls_cp437 
i2c_algo_bit aesni_intel snd_hwdep fb_sys_fops aes_x86_64 crypto_simd vfat 
syscopyarea glue_helper sysfillrect snd_pcm fat sysimgblt sp5100_tco wmi_bmof 
ppdev r8169 snd_timer cryptd pcspkr efivars mfd_core mii ccp i2c_piix4 snd 
soundcore rng_core sg wmi parport_pc parport i2c_designware_platform 
i2c_designware_core button acpi_cpufreq tcp_bbr sch_fq sunrpc nct6775 hwmon_vid 
efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache
   jbd2 fscrypto raid10 raid1 raid0 multipath linear md_mod dm_mod sd_mod evdev 
hid_generic usbhid hid crc32c_intel ahci libahci xhci_pci libata xhci_hcd 
scsi_mod usbcore shpchp gpio_amdpt gpio_generic
  CPU: 13 PID: 1075 Comm: max-texture-siz Tainted: GW  O4.13.0-rc5+ 
#28
  Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 TOMAHAWK 
(MS-7A34), BIOS 1.80 09/13/2017
  task: 9d2982c75a00 task.stack: b2744e9bc000
  RIP: 0010:amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu]
  RSP: 0018:b2744e9bf6e8 EFLAGS: 00010202
  RAX:  RBX: 9d2848642820 RCX: 9d28c77fdae0
  RDX: 0001 RSI: 9d28c77fd800 RDI: 9d288f286008
  RBP: b2744e9bf728 R08: 00ff R09: 
  R10: 0078 R11: 9d298ba170a0 R12: 9d28c77fd800
  R13: 0001 R14: 9d288f286000 R15: 9d2848642800
  FS:  7f809fc5c300() GS:9d298e94() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 0220 CR3: 00030e05a000 CR4: 003406e0
  Call Trace:
   amdgpu_bo_move_notify+0x42/0xd0 [amdgpu]
   ttm_bo_unmap_virtual_locked+0x298/0xac0 [ttm]
   ? ttm_bo_mem_space+0x391/0x580 [ttm]
   ttm_bo_unmap_virtual_locked+0x737/0xac0 [ttm]
   ttm_bo_unmap_virtual_locked+0xa6f/0xac0 [ttm]
   ttm_bo_mem_space+0x306/0x580 [ttm]
   ttm_bo_validate+0xd4/0x150 [ttm]
   ttm_bo_init_reserved+0x22e/0x440 [ttm]
   amdgpu_ttm_placement_from_domain+0x33c/0x580 [amdgpu]
   ? amdgpu_fill_buffer+0x300/0x420 [amdgpu]
   amdgpu_bo_create+0x50/0x2b0 [amdgpu]
   amdgpu_gem_object_create+0x9f/0x110 [amdgpu]
   amdgpu_gem_create_ioctl+0x12f/0x270 [amdgpu]
   ? amdgpu_gem_object_close+0x210/0x210 [amdgpu]
   drm_ioctl_kernel+0x5d/0xf0 [drm]
   drm_ioctl+0x32a/0x630 [drm]
   ? amdgpu_gem_object_close+0x210/0x210 [amdgpu]
   ? lru_cache_add_active_or_unevictable+0x36/0xb0
   ? __handle_mm_fault+0x90d/0xff0
   amdgpu_drm_ioctl+0x4f/0x1c20 [amdgpu]
   do_vfs_ioctl+0xa5/0x600
   ? handle_mm_fault+0xd8/0x230
   ? __do_page_fault+0x267/0x4c0
   SyS_ioctl+0x79/0x90
   entry_SYSCALL_64_fastpath+0x1e/0xa9
  RIP: 0033:0x7f809c8f3dc7
  RSP: 002b:7ffcc8c485f8 EFLAGS: 0246 ORIG_RAX: 0010
  RAX: ffda RBX: 7f809cbaab00 RCX: 7f809c8f3dc7
  RDX: 7ffcc8c48640 RSI: c0206440 RDI: 0006
  RBP: 4010 R08: 7f809cbaabe8 R09: 0060
  R10: 0004 R11: 0246 R12: 40001000
  R13: 7f809cbaab58 R14: 1000 R15: 7f809cbaab00
  Code: 49 8b 47 10 48 39 45 d0 4c 8d 78 f0 0f 84 87 00 00 00 4d 8b 37 45 84 ed 41 c6 
47 30 01 49 8d 5f 20 49 8d 7e 08 74 19 49 8b 46 58 <48> 8b 80 20 02 00 00 49 39 
84 24 20 02 00 00 0f 84 ab 00 00 00
  RIP: amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu] RSP: b2744e9bf6e8
  CR2: 0220






___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [pull] amdgpu drm-fixes-4.14

2017-10-12 Thread Christian König

Am 12.10.2017 um 18:20 schrieb Michel Dänzer:

On 12/10/17 05:58 PM, Alex Deucher wrote:

Hi Dave,

One memory management regression fix.

The following changes since commit 545036a9944e9d6e50fed4ca03117147c880ff71:

   Merge tag 'drm-misc-fixes-2017-10-11' of 
git://anongit.freedesktop.org/drm/drm-misc into drm-fixes (2017-10-12 10:38:09 
+1000)

are available in the git repository at:

   git://people.freedesktop.org/~agd5f/linux drm-fixes-4.14

for you to fetch changes up to 27b94b4f1386c3a8181f5a0277434a32e24e7dd7:

   drm/amdgpu: fix placement flags in amdgpu_ttm_bind (2017-10-12 10:34:42 
-0400)


Christian König (1):
   drm/amdgpu: fix placement flags in amdgpu_ttm_bind

Thanks Alex, but there's another piglit hang regression in 4.14, caused
by commit 6af0883ed977 "drm/amdgpu: discard commands of killed
processes", fixed by five commits 6b37d03280a4..318d85de9c20 in
amd-staging-drm-next. Either the latter need to be backported to 4.14,
or the former needs to be reverted from it.


The revert is probably easier to handle at this point.

So to answer your question from the other thread I vote for that.

Christian.

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [pull] amdgpu drm-fixes-4.14

2017-10-12 Thread Deucher, Alexander
> -Original Message-
> From: Michel Dänzer [mailto:mic...@daenzer.net]
> Sent: Thursday, October 12, 2017 12:21 PM
> To: Alex Deucher
> Cc: amd-gfx@lists.freedesktop.org; dri-de...@lists.freedesktop.org;
> airl...@gmail.com; Deucher, Alexander
> Subject: Re: [pull] amdgpu drm-fixes-4.14
> 
> On 12/10/17 05:58 PM, Alex Deucher wrote:
> > Hi Dave,
> >
> > One memory management regression fix.
> >
> > The following changes since commit
> 545036a9944e9d6e50fed4ca03117147c880ff71:
> >
> >   Merge tag 'drm-misc-fixes-2017-10-11' of
> git://anongit.freedesktop.org/drm/drm-misc into drm-fixes (2017-10-12
> 10:38:09 +1000)
> >
> > are available in the git repository at:
> >
> >   git://people.freedesktop.org/~agd5f/linux drm-fixes-4.14
> >
> > for you to fetch changes up to
> 27b94b4f1386c3a8181f5a0277434a32e24e7dd7:
> >
> >   drm/amdgpu: fix placement flags in amdgpu_ttm_bind (2017-10-12
> 10:34:42 -0400)
> >
> > 
> > Christian König (1):
> >   drm/amdgpu: fix placement flags in amdgpu_ttm_bind
> 
> Thanks Alex, but there's another piglit hang regression in 4.14, caused
> by commit 6af0883ed977 "drm/amdgpu: discard commands of killed
> processes", fixed by five commits 6b37d03280a4..318d85de9c20 in
> amd-staging-drm-next. Either the latter need to be backported to 4.14,
> or the former needs to be reverted from it.

Sorting that out seemed like more effort, so I opted to just send this out.  
Once we get that fixed up, I can send another fixes pull.

Alex

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini

2017-10-12 Thread Michel Dänzer
On 12/10/17 01:00 PM, Michel Dänzer wrote:
> 
> [0] I also got this, but I don't know yet if it's related:

No, that seems to be a separate issue; I can still reproduce it with the
huge page related changes reverted. Unfortunately, it doesn't seem to
happen reliably on every piglit run.

Even before your changes this morning, there's another hang which
doesn't happen every time, without any corresponding dmesg output.

Lots of "fun" in amd-staging-drm-next...


>  BUG: unable to handle kernel NULL pointer dereference at 0220
>  IP: amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu]
>  PGD 0 
>  P4D 0 
>  
>  Oops:  [#1] SMP
>  Modules linked in: cpufreq_powersave cpufreq_userspace cpufreq_conservative 
> amdkfd(O) edac_mce_amd kvm amdgpu(O) irqbypass crct10dif_pclmul crc32_pclmul 
> chash snd_hda_codec_realtek ghash_clmulni_intel snd_hda_codec_generic 
> snd_hda_codec_hdmi pcbc binfmt_misc ttm(O) efi_pstore snd_hda_intel 
> drm_kms_helper(O) snd_hda_codec nls_ascii drm(O) snd_hda_core nls_cp437 
> i2c_algo_bit aesni_intel snd_hwdep fb_sys_fops aes_x86_64 crypto_simd vfat 
> syscopyarea glue_helper sysfillrect snd_pcm fat sysimgblt sp5100_tco wmi_bmof 
> ppdev r8169 snd_timer cryptd pcspkr efivars mfd_core mii ccp i2c_piix4 snd 
> soundcore rng_core sg wmi parport_pc parport i2c_designware_platform 
> i2c_designware_core button acpi_cpufreq tcp_bbr sch_fq sunrpc nct6775 
> hwmon_vid efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache
>   jbd2 fscrypto raid10 raid1 raid0 multipath linear md_mod dm_mod sd_mod 
> evdev hid_generic usbhid hid crc32c_intel ahci libahci xhci_pci libata 
> xhci_hcd scsi_mod usbcore shpchp gpio_amdpt gpio_generic
>  CPU: 13 PID: 1075 Comm: max-texture-siz Tainted: GW  O
> 4.13.0-rc5+ #28
>  Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 TOMAHAWK 
> (MS-7A34), BIOS 1.80 09/13/2017
>  task: 9d2982c75a00 task.stack: b2744e9bc000
>  RIP: 0010:amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu]
>  RSP: 0018:b2744e9bf6e8 EFLAGS: 00010202
>  RAX:  RBX: 9d2848642820 RCX: 9d28c77fdae0
>  RDX: 0001 RSI: 9d28c77fd800 RDI: 9d288f286008
>  RBP: b2744e9bf728 R08: 00ff R09: 
>  R10: 0078 R11: 9d298ba170a0 R12: 9d28c77fd800
>  R13: 0001 R14: 9d288f286000 R15: 9d2848642800
>  FS:  7f809fc5c300() GS:9d298e94() knlGS:
>  CS:  0010 DS:  ES:  CR0: 80050033
>  CR2: 0220 CR3: 00030e05a000 CR4: 003406e0
>  Call Trace:
>   amdgpu_bo_move_notify+0x42/0xd0 [amdgpu]
>   ttm_bo_unmap_virtual_locked+0x298/0xac0 [ttm]
>   ? ttm_bo_mem_space+0x391/0x580 [ttm]
>   ttm_bo_unmap_virtual_locked+0x737/0xac0 [ttm]
>   ttm_bo_unmap_virtual_locked+0xa6f/0xac0 [ttm]
>   ttm_bo_mem_space+0x306/0x580 [ttm]
>   ttm_bo_validate+0xd4/0x150 [ttm]
>   ttm_bo_init_reserved+0x22e/0x440 [ttm]
>   amdgpu_ttm_placement_from_domain+0x33c/0x580 [amdgpu]
>   ? amdgpu_fill_buffer+0x300/0x420 [amdgpu]
>   amdgpu_bo_create+0x50/0x2b0 [amdgpu]
>   amdgpu_gem_object_create+0x9f/0x110 [amdgpu]
>   amdgpu_gem_create_ioctl+0x12f/0x270 [amdgpu]
>   ? amdgpu_gem_object_close+0x210/0x210 [amdgpu]
>   drm_ioctl_kernel+0x5d/0xf0 [drm]
>   drm_ioctl+0x32a/0x630 [drm]
>   ? amdgpu_gem_object_close+0x210/0x210 [amdgpu]
>   ? lru_cache_add_active_or_unevictable+0x36/0xb0
>   ? __handle_mm_fault+0x90d/0xff0
>   amdgpu_drm_ioctl+0x4f/0x1c20 [amdgpu]
>   do_vfs_ioctl+0xa5/0x600
>   ? handle_mm_fault+0xd8/0x230
>   ? __do_page_fault+0x267/0x4c0
>   SyS_ioctl+0x79/0x90
>   entry_SYSCALL_64_fastpath+0x1e/0xa9
>  RIP: 0033:0x7f809c8f3dc7
>  RSP: 002b:7ffcc8c485f8 EFLAGS: 0246 ORIG_RAX: 0010
>  RAX: ffda RBX: 7f809cbaab00 RCX: 7f809c8f3dc7
>  RDX: 7ffcc8c48640 RSI: c0206440 RDI: 0006
>  RBP: 4010 R08: 7f809cbaabe8 R09: 0060
>  R10: 0004 R11: 0246 R12: 40001000
>  R13: 7f809cbaab58 R14: 1000 R15: 7f809cbaab00
>  Code: 49 8b 47 10 48 39 45 d0 4c 8d 78 f0 0f 84 87 00 00 00 4d 8b 37 45 84 
> ed 41 c6 47 30 01 49 8d 5f 20 49 8d 7e 08 74 19 49 8b 46 58 <48> 8b 80 20 02 
> 00 00 49 39 84 24 20 02 00 00 0f 84 ab 00 00 00 
>  RIP: amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu] RSP: b2744e9bf6e8
>  CR2: 0220
> 
> 


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [pull] amdgpu drm-fixes-4.14

2017-10-12 Thread Michel Dänzer
On 12/10/17 05:58 PM, Alex Deucher wrote:
> Hi Dave,
> 
> One memory management regression fix.
> 
> The following changes since commit 545036a9944e9d6e50fed4ca03117147c880ff71:
> 
>   Merge tag 'drm-misc-fixes-2017-10-11' of 
> git://anongit.freedesktop.org/drm/drm-misc into drm-fixes (2017-10-12 
> 10:38:09 +1000)
> 
> are available in the git repository at:
> 
>   git://people.freedesktop.org/~agd5f/linux drm-fixes-4.14
> 
> for you to fetch changes up to 27b94b4f1386c3a8181f5a0277434a32e24e7dd7:
> 
>   drm/amdgpu: fix placement flags in amdgpu_ttm_bind (2017-10-12 10:34:42 
> -0400)
> 
> 
> Christian König (1):
>   drm/amdgpu: fix placement flags in amdgpu_ttm_bind

Thanks Alex, but there's another piglit hang regression in 4.14, caused
by commit 6af0883ed977 "drm/amdgpu: discard commands of killed
processes", fixed by five commits 6b37d03280a4..318d85de9c20 in
amd-staging-drm-next. Either the latter need to be backported to 4.14,
or the former needs to be reverted from it.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[pull] amdgpu drm-fixes-4.14

2017-10-12 Thread Alex Deucher
Hi Dave,

One memory management regression fix.

The following changes since commit 545036a9944e9d6e50fed4ca03117147c880ff71:

  Merge tag 'drm-misc-fixes-2017-10-11' of 
git://anongit.freedesktop.org/drm/drm-misc into drm-fixes (2017-10-12 10:38:09 
+1000)

are available in the git repository at:

  git://people.freedesktop.org/~agd5f/linux drm-fixes-4.14

for you to fetch changes up to 27b94b4f1386c3a8181f5a0277434a32e24e7dd7:

  drm/amdgpu: fix placement flags in amdgpu_ttm_bind (2017-10-12 10:34:42 -0400)


Christian König (1):
  drm/amdgpu: fix placement flags in amdgpu_ttm_bind

 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: [PATCH] drm/amdgpu: minor CS optimization

2017-10-12 Thread Deucher, Alexander
> -Original Message-
> From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf
> Of Christian König
> Sent: Thursday, October 12, 2017 6:20 AM
> To: Grodzovsky, Andrey; amd-gfx@lists.freedesktop.org
> Subject: [PATCH] drm/amdgpu: minor CS optimization
> 
> From: Christian König 
> 
> We only need to loop over all IBs for old UVD/VCE command stream
> patching.
> 
> Signed-off-by: Christian König 

Reviewed-by: Alex Deucher 

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 37 +
> -
>  1 file changed, 19 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 2ae5d52..20e2a71 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -855,36 +855,37 @@ static int amdgpu_cs_ib_vm_chunk(struct
> amdgpu_device *adev,
>   struct amdgpu_fpriv *fpriv = p->filp->driver_priv;
>   struct amdgpu_vm *vm = >vm;
>   struct amdgpu_ring *ring = p->job->ring;
> - int i, j, r;
> -
> - for (i = 0, j = 0; i < p->nchunks && j < p->job->num_ibs; i++) {
> -
> - struct amdgpu_cs_chunk *chunk;
> - struct amdgpu_ib *ib;
> - struct drm_amdgpu_cs_chunk_ib *chunk_ib;
> -
> - chunk = >chunks[i];
> - ib = >job->ibs[j];
> - chunk_ib = (struct drm_amdgpu_cs_chunk_ib *)chunk-
> >kdata;
> + int r;
> 
> - if (chunk->chunk_id != AMDGPU_CHUNK_ID_IB)
> - continue;
> + /* Only for UVD/VCE VM emulation */
> + if (p->job->ring->funcs->parse_cs) {
> + unsigned i, j;
> 
> - if (p->job->ring->funcs->parse_cs) {
> + for (i = 0, j = 0; i < p->nchunks && j < p->job->num_ibs; i++) {
> + struct drm_amdgpu_cs_chunk_ib *chunk_ib;
>   struct amdgpu_bo_va_mapping *m;
>   struct amdgpu_bo *aobj = NULL;
> + struct amdgpu_cs_chunk *chunk;
> + struct amdgpu_ib *ib;
>   uint64_t offset;
>   uint8_t *kptr;
> 
> + chunk = >chunks[i];
> + ib = >job->ibs[j];
> + chunk_ib = chunk->kdata;
> +
> + if (chunk->chunk_id != AMDGPU_CHUNK_ID_IB)
> + continue;
> +
>   r = amdgpu_cs_find_mapping(p, chunk_ib->va_start,
> - , );
> +, );
>   if (r) {
>   DRM_ERROR("IB va_start is invalid\n");
>   return r;
>   }
> 
>   if ((chunk_ib->va_start + chunk_ib->ib_bytes) >
> - (m->last + 1) * AMDGPU_GPU_PAGE_SIZE) {
> + (m->last + 1) * AMDGPU_GPU_PAGE_SIZE) {
>   DRM_ERROR("IB va_start+ib_bytes is
> invalid\n");
>   return -EINVAL;
>   }
> @@ -901,12 +902,12 @@ static int amdgpu_cs_ib_vm_chunk(struct
> amdgpu_device *adev,
>   memcpy(ib->ptr, kptr, chunk_ib->ib_bytes);
>   amdgpu_bo_kunmap(aobj);
> 
> - /* Only for UVD/VCE VM emulation */
>   r = amdgpu_ring_parse_cs(ring, p, j);
>   if (r)
>   return r;
> +
> + j++;
>   }
> - j++;
>   }
> 
>   if (p->job->vm) {
> --
> 2.7.4
> 
> ___
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


[PATCH] drm/amd/display: Explicit casting for grph object ids

2017-10-12 Thread Harry Wentland
C++ compilers don't like the implicit conversion

Signed-off-by: Harry Wentland 
---
 drivers/gpu/drm/amd/display/include/grph_object_id.h | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/include/grph_object_id.h 
b/drivers/gpu/drm/amd/display/include/grph_object_id.h
index 5eb2b4dc7b9c..836d99e9e0f5 100644
--- a/drivers/gpu/drm/amd/display/include/grph_object_id.h
+++ b/drivers/gpu/drm/amd/display/include/grph_object_id.h
@@ -248,7 +248,7 @@ static inline enum controller_id 
dal_graphics_object_id_get_controller_id(
struct graphics_object_id id)
 {
if (id.type == OBJECT_TYPE_CONTROLLER)
-   return id.id;
+   return (enum controller_id) id.id;
return CONTROLLER_ID_UNDEFINED;
 }
 
@@ -256,7 +256,7 @@ static inline enum clock_source_id 
dal_graphics_object_id_get_clock_source_id(
struct graphics_object_id id)
 {
if (id.type == OBJECT_TYPE_CLOCK_SOURCE)
-   return id.id;
+   return (enum clock_source) id.id;
return CLOCK_SOURCE_ID_UNDEFINED;
 }
 
@@ -264,7 +264,7 @@ static inline enum encoder_id 
dal_graphics_object_id_get_encoder_id(
struct graphics_object_id id)
 {
if (id.type == OBJECT_TYPE_ENCODER)
-   return id.id;
+   return (enum encoder_id) id.id;
return ENCODER_ID_UNKNOWN;
 }
 
@@ -272,7 +272,7 @@ static inline enum connector_id 
dal_graphics_object_id_get_connector_id(
struct graphics_object_id id)
 {
if (id.type == OBJECT_TYPE_CONNECTOR)
-   return id.id;
+   return (enum connector_id) id.id;
return CONNECTOR_ID_UNKNOWN;
 }
 
@@ -280,7 +280,7 @@ static inline enum audio_id 
dal_graphics_object_id_get_audio_id(
struct graphics_object_id id)
 {
if (id.type == OBJECT_TYPE_AUDIO)
-   return id.id;
+   return (enum audio_id) id.id;
return AUDIO_ID_UNKNOWN;
 }
 
@@ -288,7 +288,7 @@ static inline enum engine_id 
dal_graphics_object_id_get_engine_id(
struct graphics_object_id id)
 {
if (id.type == OBJECT_TYPE_ENGINE)
-   return id.id;
+   return (enum engine_id) id.id;
return ENGINE_ID_UNKNOWN;
 }
 #endif
-- 
2.14.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini

2017-10-12 Thread Michel Dänzer
On 12/10/17 03:50 PM, Christian König wrote:
> Am 12.10.2017 um 15:42 schrieb Michel Dänzer:
>> On 12/10/17 01:44 PM, Christian König wrote:
>>> Am 12.10.2017 um 13:00 schrieb Michel Dänzer:
>>>
 Anyway, unless anyone knows which commits from amd-staging-drm-next are
 needed to make 1d00402b4da2 stable in 4.14, the safe course of action
 seems to be reverting it (and ac7afe6b3cf3, which depends on it)?
>>> The amdgpu_ttm_bind change should be fixed by "70a9c6b drm/amdgpu: fix
>>> placement flags in amdgpu_ttm_bind".
>> Indeed, that fixes it for me.
>>
>>> But I've assumed they went both into 4.14.
>> Unfortunately, it looks like only 1d00402b4da2 made it into 4.14. Alex,
>> please send a fixes pull for 4.14 with a backport of 70a9c6b.
>>
>> For the other issue, do we want to backport Nicolai's commits
>> 6b37d03280a4..318d85de9c20 or revert 6af0883ed977?
>>
>> Christian, can you check that there are no other fixes missing from 4.14?
> 
> Not that I know of, and manually checking is nearly impossible since I
> don't know what Alex pushed to 4.14 and what not.
> 
> Manually checking what went in and what not would take quite some time.

Looking at commits which are in drm-next-4.15 / amd-staging-drm-next but
not in Linus' tree should be a good start. Something like:

gitk v4.14-rc4..origin/amd-staging-drm-next drivers/gpu/drm/{ttm,amd}
(in an internal tree)

gitk v4.14-rc4../drm-next-4.15 drivers/gpu/drm/{ttm,amd}


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


drm/amdgpu: LVDS-1 output only turns on with sleep 1s workaround

2017-10-12 Thread Peter Stuge
Hi,

I have a situation which might be a bug in drm/amd/amdgpu:

I've got a TFT screen unit with an integrated AMD G-T56N mainboard
used in a signage application.

I build kernel and xorg myself.

Kernels 4.9.29, 4.9.52 and 4.13.4 behave the same.

git.freedesktop.org/~airlied/linux.git drm-next commit 754270c7c (Sep 28)
had a regression where amdgpu would not take the console and mode 03h
remained indefinitely. No /dev/dri/card0 so X could not start.

That side issue was fixed though, and
git.freedesktop.org/~airlied/linux.git drm-next commit baf7c1f7e (Oct 6)
(I think 4.14.0 rc4) again behaves like 4.9 and 4.13.4, ie. also has this
problem that I am facing.


I use the xorg modeset video driver because one software image must
support different hardware and I think drm is a good way.

I want to lock FHD resolution so on the kernel cmdline I have:
drm_kms_helper.edid_firmware=edid/1920x1080.bin

This works well; X reports the LNX EDID.

systemd starts X with xinit and a script of mine which runs xrandr
to set rotation and to make all outputs same-as the primary output,
and finally exec the application.

The built-in TFT panel is on connector 0, which drm and X both call LVDS-1.

The unit also has an external HDMI connector, that's connector 1, which
drm calls HDMI-A-1 but X calls HDMI-1.

During boot, BIOS and mode 03h with MBR bootloader are visible on the
internal TFT.

When drm takes over and switches from mode 03h all connected panels
turn off and stay off, I guess because there's no kmscon. All good.


The problem:

When X has started (with or without my script) xrandr always reports
the internal panel (connector 0 AKA LVDS-1) as connected primary and
on with FHD resolution *but* the panel stays off.

Connector 0 seems to have a DP->HDMI encoder connected to it,
internally in the unit. I'm not sure - no docs, not very easy to
investigate. There is a BIOS setting to disable the encoder, that
leads to there never being any output on the internal panel.

My guess is that there is a timing issue between amdgpu and that
encoder, but that's pure speculation.


Connector 1 AKA HDMI-A-1 AKA HDMI-1 works reliably as expected. If a
screen is connected, it turns on and shows the correct contents as
set by xrandr. If not then not. Whether something is connected to
HDMI-1 has no effect on the internal panel.


Discovered workaround so far:

Adding the following to kiosk.sh turns the internal panel on:

xrandr --output LVDS-1 --off
sleep 1s
xrandr --output LVDS-1 --auto

Without the sleep, the internal panel does *not* turn on.


Questions:

What is the relationship between drm connector names and X output names?
More specifically: How could I find out that drm HDMI-A-1 == X HDMI-1?

Can I help find the cause of my problem and make the internal panel turn
on without my workaround? What information would be needed, and should I
send here or use bugzilla?

What is "ib test on ring 5" ? It currently needs 500 ms. Could I help
optimize that?


--8<-- dmesg drm snippet
[1.575518] Linux agpgart interface v0.103
[1.575888] [drm] radeon kernel modesetting enabled.
[1.578517] [drm] initializing kernel modesetting (PALM 0x1002:0x9806 
0x1022:0x1511 0x00).
[1.578806] ATOM BIOS: AMD
[1.578967] radeon :00:01.0: VRAM: 384M 0x - 
0x17FF (384M used)
[1.578978] radeon :00:01.0: GTT: 1024M 0x1800 - 
0x57FF
[1.578996] [drm] Detected VRAM RAM=384M, BAR=256M
[1.579085] [drm] RAM width 32bits DDR
[1.579600] [TTM] Zone  kernel: Available graphics memory: 432660 kiB
[1.579616] [TTM] Zone highmem: Available graphics memory: 1352552 kiB
[1.579624] [TTM] Initializing pool allocator
[1.58] [drm] radeon: 384M of VRAM memory ready
[1.580096] [drm] radeon: 1024M of GTT memory ready.
[1.580200] [drm] Loading PALM Microcode
[1.580219] [drm] Internal thermal controller without fan control
[1.580542] [drm] Found smc ucode version: 0x00010601
[1.580726] [drm] radeon: dpm initialized
[1.580830] [drm] GART: num cpu pages 262144, num gpu pages 262144
[1.604202] [drm] PCIE GART of 1024M enabled (table at 0x00162000).
[1.604648] radeon :00:01.0: WB enabled
[1.604662] radeon :00:01.0: fence driver on ring 0 use gpu addr 
0x18000c00 and cpu addr 0xff801c00
[1.604674] radeon :00:01.0: fence driver on ring 3 use gpu addr 
0x18000c0c and cpu addr 0xff801c0c
[1.605369] radeon :00:01.0: fence driver on ring 5 use gpu addr 
0x00072118 and cpu addr 0xf8232118
[1.605384] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[1.605392] [drm] Driver supports precise vblank timestamp query.
[1.605402] radeon :00:01.0: radeon: MSI limited to 32-bit
[1.605649] radeon :00:01.0: radeon: using MSI.
[1.605824] [drm] radeon: irq initialized.
[1.624412] [drm] ring test on 0 succeeded in 1 usecs
[1.624429] [drm] ring test on 3 

Re: [PATCH] amdgpu/dc: Use DRM new-style object iterators.

2017-10-12 Thread Leo



On 2017-10-12 02:00 AM, Maarten Lankhorst wrote:

Op 11-10-17 om 22:40 schreef Harry Wentland:

On 2017-10-11 03:46 PM, Maarten Lankhorst wrote:

Op 11-10-17 om 20:55 schreef Leo:


On 2017-10-11 10:30 AM, Maarten Lankhorst wrote:

Op 11-10-17 om 16:24 schreef sunpeng...@amd.com:

From: "Leo (Sunpeng) Li" 

Use the correct for_each_new/old_* iterators instead of for_each_*

List of affected functions:

amdgpu_dm_find_first_crtc_matching_connector: use for_each_new
  - Old from_state_var flag was always choosing the new state

amdgpu_dm_display_resume: use for_each_new
  - drm_atomic_helper_duplicate_state is called during suspend to
cache the state
  - It sets 'state' within the state triplet to 'new_state'

amdgpu_dm_commit_planes: use for_each_old
  - Called after the state was swapped (via atomic commit tail)

amdgpu_dm_atomic_commit: use for_each_new
  - Called before the state is swapped

amdgpu_dm_atomic_commit_tail: use for_each_old
  - Called after the state was swapped

dm_update_crtcs_state: use for_each_new
  - Called before the state is swapped (via atomic check)

amdgpu_dm_atomic_check: use for_each_new
  - Called before the state is swapped

Also fix some typos.

crct -> crtc
undersacn -> underscan

Signed-off-by: Leo (Sunpeng) Li 
---

Hi Dave,

This patch goes on top of your fixup for new api's patch. Please feel
free to squash them.

Thanks,
Leo

   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 38 
+--
   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h |  5 ++-
   2 files changed, 16 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
index 9bfe1f9..9394558 100644
--- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
+++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
@@ -568,25 +568,17 @@ static int dm_suspend(void *handle)
   return ret;
   }
   -struct amdgpu_dm_connector *amdgpu_dm_find_first_crct_matching_connector(
+struct amdgpu_dm_connector *amdgpu_dm_find_first_crtc_matching_connector(
   struct drm_atomic_state *state,
-struct drm_crtc *crtc,
-bool from_state_var)
+struct drm_crtc *crtc)
   {
   uint32_t i;
   struct drm_connector_state *conn_state;
   struct drm_connector *connector;
   struct drm_crtc *crtc_from_state;
   -for_each_new_connector_in_state(
-state,
-connector,
-conn_state,
-i) {
-crtc_from_state =
-from_state_var ?
-conn_state->crtc :
-connector->state->crtc;
+for_each_new_connector_in_state(state, connector, conn_state, i) {
+crtc_from_state = conn_state->crtc;

Please assign crtc_from_state here with drm_atomic_get_(new/old)_crtc_state.

We're trying to fetch a drm_crtc from a drm_connector_state, I don't think the 
state getters are applicable here.

Also, the function should really be named 'find_first_connector_matching_crtc'. 
I'll fix that.

Oh I misunderstood. But in general derefencing $obj->state should be frowned upon. 
If you ever want to support atomic flip depth > 1,
all those references should be gone from your driver, and replaced with 
get_old/new_state..

If I understand correctly this is the forward-looking way of how we get the
state now? I still see plenty of $obj->state in i915 and other drivers.

Any objections to doing this as a follow-up patch?

What's atomic flip depth > 1?


That we allow multiple uncompleted nonblocking atomic commits to the same crtc,
which requires that during commit, $obj->state is not the same as
new_$obj_state any more. It can be set to an even newer state, but the current
commit has to set the state that is in state->$obj[i].new_state. This can be
obtained with either the iterators, or drm_atomic_get_new_$obj_state.

This is why we we got rid of the old iterators, get_existing_state and 
$obj->state
were no longer sufficient for this.

Using drm_atomic_get_old/new_obj_state should be a separate patch, but can be 
fixed
opportunistically.

But something like

for_each_old_crtc_in_state(...) {
new_crtc_state = crtc->state


}

should definitely be fixed in this patch. It's what the iterators are for. :)

I know i915 is still dereferencing $obj->state, but we're trying to slowly fix 
these
when we come across them. This usually involves passing the correct state 
object up
the callchain, which can be quite deep.

Cheers,
Maarten


Thanks for the clarification.

We're in a similar situation with some of the deeply nested functions, 
most of which don't take in the state object. That said, there are a few 
functions lower in the call stack that are eating the correct atomic 
state (like the ones you've pointed out). We can target those for now, 
and target more when the opportunity comes.


Leo




Harry




   if (crtc_from_state == crtc)

Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini

2017-10-12 Thread Michel Dänzer
On 12/10/17 01:44 PM, Christian König wrote:
> Am 12.10.2017 um 13:00 schrieb Michel Dänzer:
> 
>> Anyway, unless anyone knows which commits from amd-staging-drm-next are
>> needed to make 1d00402b4da2 stable in 4.14, the safe course of action
>> seems to be reverting it (and ac7afe6b3cf3, which depends on it)?
> 
> The amdgpu_ttm_bind change should be fixed by "70a9c6b drm/amdgpu: fix
> placement flags in amdgpu_ttm_bind".

Indeed, that fixes it for me.

> But I've assumed they went both into 4.14.

Unfortunately, it looks like only 1d00402b4da2 made it into 4.14. Alex,
please send a fixes pull for 4.14 with a backport of 70a9c6b.

For the other issue, do we want to backport Nicolai's commits
6b37d03280a4..318d85de9c20 or revert 6af0883ed977?

Christian, can you check that there are no other fixes missing from 4.14?


BTW, this raises an issue: Since we push both fixes and new development
work to the same internal branch, sometimes it isn't clear which changes
should go upstream via -fixes or -next. Any ideas for mitigating the
risk of missing an important fix?


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: TDR and VRAM lost handling in KMD (v2)

2017-10-12 Thread Nicolai Hähnle

On 12.10.2017 13:49, Christian König wrote:

Am 12.10.2017 um 13:37 schrieb Liu, Monk:

Hi team
Very good, many policy and implement are agreed, looks we only have 
some arguments in amdgpu_ctx_query(), well I also confused with the 
current implement of it, ☹
First, I want to know if you guys agree that we*don't update 
ctx->reset_counter in amdgpu_ctx_query() *? because I want to make the 
query result always consistent upon a given context,


That sounds like a good idea to me, but I'm not sure if it won't break 
userspace (I don't think so). Nicolai or Marek need to comment.


That should be fine for Mesa.


Second, I want to say that for KERNEL, it shouldn't use the term from 
MESA or OGL or VULKAN, e.g. kernel shouldn't use AMDGPU_INNOCENT_RESET 
to map to GL_INNOCENT_RESET_ARB, etc...
Because that way kernel will be very limited to certain UMD, so I 
suggest we totally re-name the context status, and each UMD has its 
own way to map the kernel context's result to gl-context/vk-context/etc…


Yes, completely agree.


Kernel should only provide below **FLAG bits** on a given context:

  * Define AMDGPU_CTX_STATE_GUILTY 0x1 //as long as TDR
detects a job hang, KMD set the context behind this context as
"guilty"
  * Define AMDGPU_CTX_STATE_VRAMLOST 0x2  //as long as
there is a VRAM lost event hit after this context created, we mark
this context "VRAM_LOST", so UMD can say that all BO under this
context may lost their content, since kernel have no relationship
between context and BO so this is UMD's call to judge if a BO
considered "VRAM lost" or not.
  * Define AMDGPU_CTX_STATE_RESET   0x3 //as long as there is a
gpu reset occurred after context creation, this flag shall be set



That sounds sane, but unfortunately might not be possible with the 
existing IOCTL. Keep in mind that we need to keep backward compatibility 
here.


Agreed. FWIW, when Mesa sees an unknown (new) value being returned from 
amdgpu_ctx_query, it treats it the same as AMDGPU_CTX_NO_RESET.


Cheers,
Nicolai



Sample code:
Int amdgpu_ctx_query(struct amdgpu_ctx_query_parm * out, …..) {
    if (ctx- >vram_lost_counter != adev->vram_lost_counter)
    out- >status |= AMDGPU_CTX_STATE_VRAM_LOST;
if (ctx- >reset_counter != adev→reset_counter){
out- >status |= AMDGPU_CTX_STATE_RESET;
if (ctx- >guilty == TRUE)
    out- >status |= AMDGPU_CTX_STATE_GUILTY;
}
return 0;
}
For UMD if it found "out.status == 0" means there is no gpu reset even 
occurred, the context is totally regular,


  * *Anew IOCTL added for context:*

Void amdgpu_ctx_reinit(){
    Ctx→vram_lost_counter = adev→vram_lost_counter;
    Ctx→reset_counter = adev→reset_counter;
}


Mhm, is there any advantage to just creating a new context?

Regards,
Christian.

*if**UMD decide *not* to release the "guilty" context but continue 
using it **after**UMD acknowledged GPU hang **on certain job/context, 
I suggest **UMD call "amdgpu_ctx_reinit()":*
That way after you re-init() this context, you can get updated result 
from "amdgpu_ctx_query", which will probably give you "out.status == 
0" as long as no gpu reset/vram lost hit after re-init().

BR Monk
-Original Message-
From: Koenig, Christian
Sent: 2017年10月12日 18:13
To: Haehnle, Nicolai ; Michel Dänzer 
; Liu, Monk ; Olsak, Marek 
; Deucher, Alexander ; 
Zhou, David(ChunMing) ; Mao, David 

Cc: Ramirez, Alejandro ; 
amd-gfx@lists.freedesktop.org; Filipas, Mario ; 
Ding, Pixel ; Li, Bingley ; 
Jiang, Jerry (SW) 

Subject: Re: TDR and VRAM lost handling in KMD (v2)
Am 12.10.2017 um 11:44 schrieb Nicolai Hähnle:
> On 12.10.2017 11:35, Michel Dänzer wrote:
>> On 12/10/17 11:23 AM, Christian König wrote:
>>> Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle:
 On 12.10.2017 10:49, Christian König wrote:
>> However, !guilty && ctx->reset_counter != adev->reset_counter
>> does not imply that the context was lost.
>>
>> The way I understand it, we should return
>> AMDGPU_CTX_INNOCENT_RESET if !guilty && ctx->vram_lost_counter != 
adev->vram_lost_counter.
>>
>> As far as I understand it, the case of !guilty &&
>> ctx->reset_counter != adev->reset_counter &&
>> ctx->vram_lost_counter ==
>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET,
>> adev->because a
>> GPU reset occurred, but it didn't affect our context.
> I disagree on that.
>
> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there
> was a reset but we haven't been causing it.
>
> That the OpenGL extension is specified otherwise is unfortunate,
> but I think we shouldn't use that for the kernel interface here.

Re: [PATCH] drm/ttm: Fix unused variables with huge page support

2017-10-12 Thread Christian König

Am 12.10.2017 um 13:26 schrieb Tom St Denis:

Signed-off-by: Tom St Denis 


Reviewed-by: Christian König 


---
  drivers/gpu/drm/ttm/ttm_page_alloc.c | 9 -
  1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c 
b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index b6f16e73..95022473704b 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -723,7 +723,9 @@ static void ttm_put_pages(struct page **pages, unsigned 
npages, int flags,
  enum ttm_caching_state cstate)
  {
struct ttm_page_pool *pool = ttm_get_pool(flags, false, cstate);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
struct ttm_page_pool *huge = ttm_get_pool(flags, true, cstate);
+#endif
unsigned long irq_flags;
unsigned i;
  
@@ -825,7 +827,9 @@ static int ttm_get_pages(struct page **pages, unsigned npages, int flags,

 enum ttm_caching_state cstate)
  {
struct ttm_page_pool *pool = ttm_get_pool(flags, false, cstate);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
struct ttm_page_pool *huge = ttm_get_pool(flags, true, cstate);
+#endif
struct list_head plist;
struct page *p = NULL;
unsigned count;
@@ -834,7 +838,10 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
/* No pool for cached pages */
if (pool == NULL) {
gfp_t gfp_flags = GFP_USER;
-   unsigned i, j;
+   unsigned i;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   unsigned j;
+#endif
  
  		/* set zero flag for page allocation if required */

if (flags & TTM_PAGE_FLAG_ZERO_ALLOC)



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: TDR and VRAM lost handling in KMD (v2)

2017-10-12 Thread Christian König

Am 12.10.2017 um 13:37 schrieb Liu, Monk:

Hi team
Very good, many policy and implement are agreed, looks we only have 
some arguments in amdgpu_ctx_query(), well I also confused with the 
current implement of it, ☹
First, I want to know if you guys agree that we*don't update 
ctx->reset_counter in amdgpu_ctx_query() *? because I want to make the 
query result always consistent upon a given context,


That sounds like a good idea to me, but I'm not sure if it won't break 
userspace (I don't think so). Nicolai or Marek need to comment.


Second, I want to say that for KERNEL, it shouldn't use the term from 
MESA or OGL or VULKAN, e.g. kernel shouldn't use AMDGPU_INNOCENT_RESET 
to map to GL_INNOCENT_RESET_ARB, etc...
Because that way kernel will be very limited to certain UMD, so I 
suggest we totally re-name the context status, and each UMD has its 
own way to map the kernel context's result to gl-context/vk-context/etc…


Yes, completely agree.


Kernel should only provide below **FLAG bits** on a given context:

  * Define AMDGPU_CTX_STATE_GUILTY 0x1 //as long as TDR
detects a job hang, KMD set the context behind this context as
"guilty"
  * Define AMDGPU_CTX_STATE_VRAMLOST 0x2  //as long as
there is a VRAM lost event hit after this context created, we mark
this context "VRAM_LOST", so UMD can say that all BO under this
context may lost their content, since kernel have no relationship
between context and BO so this is UMD's call to judge if a BO
considered "VRAM lost" or not.
  * Define AMDGPU_CTX_STATE_RESET   0x3 //as long as there is a
gpu reset occurred after context creation, this flag shall be set



That sounds sane, but unfortunately might not be possible with the 
existing IOCTL. Keep in mind that we need to keep backward compatibility 
here.



Sample code:
Int amdgpu_ctx_query(struct amdgpu_ctx_query_parm * out, …..) {
if (ctx- >vram_lost_counter != adev->vram_lost_counter)
out- >status |= AMDGPU_CTX_STATE_VRAM_LOST;
if (ctx- >reset_counter != adev→reset_counter){
out- >status |= AMDGPU_CTX_STATE_RESET;
if (ctx- >guilty == TRUE)
out- >status |= AMDGPU_CTX_STATE_GUILTY;
}
return 0;
}
For UMD if it found "out.status == 0" means there is no gpu reset even 
occurred, the context is totally regular,


  * *Anew IOCTL added for context:*

Void amdgpu_ctx_reinit(){
Ctx→vram_lost_counter = adev→vram_lost_counter;
Ctx→reset_counter = adev→reset_counter;
}


Mhm, is there any advantage to just creating a new context?

Regards,
Christian.

*if**UMD decide *not* to release the "guilty" context but continue 
using it **after**UMD acknowledged GPU hang **on certain job/context, 
I suggest **UMD call "amdgpu_ctx_reinit()":*
That way after you re-init() this context, you can get updated result 
from "amdgpu_ctx_query", which will probably give you "out.status == 
0" as long as no gpu reset/vram lost hit after re-init().

BR Monk
-Original Message-
From: Koenig, Christian
Sent: 2017年10月12日 18:13
To: Haehnle, Nicolai ; Michel Dänzer 
; Liu, Monk ; Olsak, Marek 
; Deucher, Alexander ; 
Zhou, David(ChunMing) ; Mao, David 

Cc: Ramirez, Alejandro ; 
amd-gfx@lists.freedesktop.org; Filipas, Mario ; 
Ding, Pixel ; Li, Bingley ; 
Jiang, Jerry (SW) 

Subject: Re: TDR and VRAM lost handling in KMD (v2)
Am 12.10.2017 um 11:44 schrieb Nicolai Hähnle:
> On 12.10.2017 11:35, Michel Dänzer wrote:
>> On 12/10/17 11:23 AM, Christian König wrote:
>>> Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle:
 On 12.10.2017 10:49, Christian König wrote:
>> However, !guilty && ctx->reset_counter != adev->reset_counter
>> does not imply that the context was lost.
>>
>> The way I understand it, we should return
>> AMDGPU_CTX_INNOCENT_RESET if !guilty && ctx->vram_lost_counter != 
adev->vram_lost_counter.
>>
>> As far as I understand it, the case of !guilty &&
>> ctx->reset_counter != adev->reset_counter &&
>> ctx->vram_lost_counter ==
>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET,
>> adev->because a
>> GPU reset occurred, but it didn't affect our context.
> I disagree on that.
>
> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there
> was a reset but we haven't been causing it.
>
> That the OpenGL extension is specified otherwise is unfortunate,
> but I think we shouldn't use that for the kernel interface here.
 Two counterpoints:

 1. Why should any application care that there was a reset while it
 was idle? The OpenGL behavior is what makes sense.
>>>
>>> The application is certainly not interest if a reset happened or
>>> 

Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini

2017-10-12 Thread Christian König

Am 12.10.2017 um 13:00 schrieb Michel Dänzer:

On 12/10/17 10:05 AM, Christian König wrote:
[SNIP]

Is amd-staging-drm-next stable for you?

It seemed stable before the changes you pushed this morning. :) As of
cfb6dee86711 "drm/ttm: add transparent huge page support for cached
allocations v2", I get a flood of

  [TTM] Erroneous page count. Leaking pages.

in dmesg while running piglit, and it eventually hangs[0].


Great, going to take a look.


Anyway, unless anyone knows which commits from amd-staging-drm-next are
needed to make 1d00402b4da2 stable in 4.14, the safe course of action
seems to be reverting it (and ac7afe6b3cf3, which depends on it)?


The amdgpu_ttm_bind change should be fixed by "70a9c6b drm/amdgpu: fix 
placement flags in amdgpu_ttm_bind". But I've assumed they went both 
into 4.14.


Thanks for the bugreport,
Christian.




[0] I also got this, but I don't know yet if it's related:

  BUG: unable to handle kernel NULL pointer dereference at 0220
  IP: amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu]
  PGD 0
  P4D 0
  
  Oops:  [#1] SMP

  Modules linked in: cpufreq_powersave cpufreq_userspace cpufreq_conservative 
amdkfd(O) edac_mce_amd kvm amdgpu(O) irqbypass crct10dif_pclmul crc32_pclmul 
chash snd_hda_codec_realtek ghash_clmulni_intel snd_hda_codec_generic 
snd_hda_codec_hdmi pcbc binfmt_misc ttm(O) efi_pstore snd_hda_intel 
drm_kms_helper(O) snd_hda_codec nls_ascii drm(O) snd_hda_core nls_cp437 
i2c_algo_bit aesni_intel snd_hwdep fb_sys_fops aes_x86_64 crypto_simd vfat 
syscopyarea glue_helper sysfillrect snd_pcm fat sysimgblt sp5100_tco wmi_bmof 
ppdev r8169 snd_timer cryptd pcspkr efivars mfd_core mii ccp i2c_piix4 snd 
soundcore rng_core sg wmi parport_pc parport i2c_designware_platform 
i2c_designware_core button acpi_cpufreq tcp_bbr sch_fq sunrpc nct6775 hwmon_vid 
efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache
   jbd2 fscrypto raid10 raid1 raid0 multipath linear md_mod dm_mod sd_mod evdev 
hid_generic usbhid hid crc32c_intel ahci libahci xhci_pci libata xhci_hcd 
scsi_mod usbcore shpchp gpio_amdpt gpio_generic
  CPU: 13 PID: 1075 Comm: max-texture-siz Tainted: GW  O4.13.0-rc5+ 
#28
  Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 TOMAHAWK 
(MS-7A34), BIOS 1.80 09/13/2017
  task: 9d2982c75a00 task.stack: b2744e9bc000
  RIP: 0010:amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu]
  RSP: 0018:b2744e9bf6e8 EFLAGS: 00010202
  RAX:  RBX: 9d2848642820 RCX: 9d28c77fdae0
  RDX: 0001 RSI: 9d28c77fd800 RDI: 9d288f286008
  RBP: b2744e9bf728 R08: 00ff R09: 
  R10: 0078 R11: 9d298ba170a0 R12: 9d28c77fd800
  R13: 0001 R14: 9d288f286000 R15: 9d2848642800
  FS:  7f809fc5c300() GS:9d298e94() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 0220 CR3: 00030e05a000 CR4: 003406e0
  Call Trace:
   amdgpu_bo_move_notify+0x42/0xd0 [amdgpu]
   ttm_bo_unmap_virtual_locked+0x298/0xac0 [ttm]
   ? ttm_bo_mem_space+0x391/0x580 [ttm]
   ttm_bo_unmap_virtual_locked+0x737/0xac0 [ttm]
   ttm_bo_unmap_virtual_locked+0xa6f/0xac0 [ttm]
   ttm_bo_mem_space+0x306/0x580 [ttm]
   ttm_bo_validate+0xd4/0x150 [ttm]
   ttm_bo_init_reserved+0x22e/0x440 [ttm]
   amdgpu_ttm_placement_from_domain+0x33c/0x580 [amdgpu]
   ? amdgpu_fill_buffer+0x300/0x420 [amdgpu]
   amdgpu_bo_create+0x50/0x2b0 [amdgpu]
   amdgpu_gem_object_create+0x9f/0x110 [amdgpu]
   amdgpu_gem_create_ioctl+0x12f/0x270 [amdgpu]
   ? amdgpu_gem_object_close+0x210/0x210 [amdgpu]
   drm_ioctl_kernel+0x5d/0xf0 [drm]
   drm_ioctl+0x32a/0x630 [drm]
   ? amdgpu_gem_object_close+0x210/0x210 [amdgpu]
   ? lru_cache_add_active_or_unevictable+0x36/0xb0
   ? __handle_mm_fault+0x90d/0xff0
   amdgpu_drm_ioctl+0x4f/0x1c20 [amdgpu]
   do_vfs_ioctl+0xa5/0x600
   ? handle_mm_fault+0xd8/0x230
   ? __do_page_fault+0x267/0x4c0
   SyS_ioctl+0x79/0x90
   entry_SYSCALL_64_fastpath+0x1e/0xa9
  RIP: 0033:0x7f809c8f3dc7
  RSP: 002b:7ffcc8c485f8 EFLAGS: 0246 ORIG_RAX: 0010
  RAX: ffda RBX: 7f809cbaab00 RCX: 7f809c8f3dc7
  RDX: 7ffcc8c48640 RSI: c0206440 RDI: 0006
  RBP: 4010 R08: 7f809cbaabe8 R09: 0060
  R10: 0004 R11: 0246 R12: 40001000
  R13: 7f809cbaab58 R14: 1000 R15: 7f809cbaab00
  Code: 49 8b 47 10 48 39 45 d0 4c 8d 78 f0 0f 84 87 00 00 00 4d 8b 37 45 84 ed 41 c6 
47 30 01 49 8d 5f 20 49 8d 7e 08 74 19 49 8b 46 58 <48> 8b 80 20 02 00 00 49 39 
84 24 20 02 00 00 0f 84 ab 00 00 00
  RIP: amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu] RSP: b2744e9bf6e8
  CR2: 0220




___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


RE: TDR and VRAM lost handling in KMD (v2)

2017-10-12 Thread Liu, Monk
Hi team

Very good, many policy and implement are agreed, looks we only have some 
arguments in amdgpu_ctx_query(), well I also confused with the current 
implement of it, ☹

First, I want to know if you guys agree that we don't update ctx->reset_counter 
in amdgpu_ctx_query() ? because I want to make the query result always 
consistent upon a given context,

Second, I want to say that for KERNEL, it shouldn't use the term from MESA or 
OGL or VULKAN, e.g. kernel shouldn't use AMDGPU_INNOCENT_RESET to map to 
GL_INNOCENT_RESET_ARB, etc...
Because that way kernel will be very limited to certain UMD, so I suggest we 
totally re-name the context status, and each UMD has its own way to map the 
kernel context's result to gl-context/vk-context/etc…

Kernel should only provide below *FLAG bits* on a given context:

•   Define AMDGPU_CTX_STATE_GUILTY  0x1 //as long as TDR 
detects a job hang, KMD set the context behind this context as "guilty"
•   Define AMDGPU_CTX_STATE_VRAMLOST0x2 //as long as there is a 
VRAM lost event hit after this context created, we mark this context 
"VRAM_LOST", so UMD can say that all BO under this context may lost their 
content,  since kernel have no relationship between context and BO so this is 
UMD's call to judge if a BO considered "VRAM lost" or not.
•   Define AMDGPU_CTX_STATE_RESET   0x3 //as long as there is a gpu 
reset occurred after context creation, this flag shall be set

Sample code:

Int amdgpu_ctx_query(struct amdgpu_ctx_query_parm * out, …..) {
if (ctx- >vram_lost_counter != adev->vram_lost_counter)
out- >status |= AMDGPU_CTX_STATE_VRAM_LOST;

if (ctx- >reset_counter != adev→reset_counter) {
out- >status |= AMDGPU_CTX_STATE_RESET;

if (ctx- >guilty == TRUE)
out- >status |= AMDGPU_CTX_STATE_GUILTY;
}

return 0;
}

For UMD if it found "out.status == 0" means there is no gpu reset even 
occurred, the context is totally regular,

•   A new IOCTL added for context:

Void amdgpu_ctx_reinit(){
Ctx→vram_lost_counter = adev→vram_lost_counter;
Ctx→reset_counter = adev→reset_counter;
}


if UMD decide *not* to release the "guilty" context but continue using it after 
UMD acknowledged GPU hang on certain job/context, I suggest UMD call 
"amdgpu_ctx_reinit()":

That way after you re-init() this context, you can get updated result from 
"amdgpu_ctx_query", which will probably give you "out.status == 0" as long as 
no gpu reset/vram lost hit after re-init().

BR Monk


-Original Message-
From: Koenig, Christian
Sent: 2017年10月12日 18:13
To: Haehnle, Nicolai ; Michel Dänzer 
; Liu, Monk ; Olsak, Marek 
; Deucher, Alexander ; Zhou, 
David(ChunMing) ; Mao, David 
Cc: Ramirez, Alejandro ; 
amd-gfx@lists.freedesktop.org; Filipas, Mario ; Ding, 
Pixel ; Li, Bingley ; Jiang, Jerry (SW) 

Subject: Re: TDR and VRAM lost handling in KMD (v2)

Am 12.10.2017 um 11:44 schrieb Nicolai Hähnle:
> On 12.10.2017 11:35, Michel Dänzer wrote:
>> On 12/10/17 11:23 AM, Christian König wrote:
>>> Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle:
 On 12.10.2017 10:49, Christian König wrote:
>> However, !guilty && ctx->reset_counter != adev->reset_counter
>> does not imply that the context was lost.
>>
>> The way I understand it, we should return
>> AMDGPU_CTX_INNOCENT_RESET if !guilty && ctx->vram_lost_counter != 
>> adev->vram_lost_counter.
>>
>> As far as I understand it, the case of !guilty &&
>> ctx->reset_counter != adev->reset_counter &&
>> ctx->vram_lost_counter ==
>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET,
>> adev->because a
>> GPU reset occurred, but it didn't affect our context.
> I disagree on that.
>
> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there
> was a reset but we haven't been causing it.
>
> That the OpenGL extension is specified otherwise is unfortunate,
> but I think we shouldn't use that for the kernel interface here.
 Two counterpoints:

 1. Why should any application care that there was a reset while it
 was idle? The OpenGL behavior is what makes sense.
>>>
>>> The application is certainly not interest if a reset happened or
>>> not, but I though that the driver stack might be.
>>>

 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today
 because we never return it :)

>>>
>>> Good point.
>>>
 amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which
 is in line with the OpenGL spec: we're conservatively returning
 that a reset happened because we don't know whether the context 

[PATCH] drm/ttm: Fix unused variables with huge page support

2017-10-12 Thread Tom St Denis
Signed-off-by: Tom St Denis 
---
 drivers/gpu/drm/ttm/ttm_page_alloc.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c 
b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index b6f16e73..95022473704b 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -723,7 +723,9 @@ static void ttm_put_pages(struct page **pages, unsigned 
npages, int flags,
  enum ttm_caching_state cstate)
 {
struct ttm_page_pool *pool = ttm_get_pool(flags, false, cstate);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
struct ttm_page_pool *huge = ttm_get_pool(flags, true, cstate);
+#endif
unsigned long irq_flags;
unsigned i;
 
@@ -825,7 +827,9 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
 enum ttm_caching_state cstate)
 {
struct ttm_page_pool *pool = ttm_get_pool(flags, false, cstate);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
struct ttm_page_pool *huge = ttm_get_pool(flags, true, cstate);
+#endif
struct list_head plist;
struct page *p = NULL;
unsigned count;
@@ -834,7 +838,10 @@ static int ttm_get_pages(struct page **pages, unsigned 
npages, int flags,
/* No pool for cached pages */
if (pool == NULL) {
gfp_t gfp_flags = GFP_USER;
-   unsigned i, j;
+   unsigned i;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+   unsigned j;
+#endif
 
/* set zero flag for page allocation if required */
if (flags & TTM_PAGE_FLAG_ZERO_ALLOC)
-- 
2.12.0

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini

2017-10-12 Thread Michel Dänzer
On 12/10/17 10:05 AM, Christian König wrote:
> Am 11.10.2017 um 18:30 schrieb Michel Dänzer:
>> On 28/09/17 04:55 PM, Nicolai Hähnle wrote:
>>> From: Nicolai Hähnle 
>>>
>>> Highly concurrent Piglit runs can trigger a race condition where a
>>> pending
>>> SDMA job on a buffer object is never executed because the corresponding
>>> process is killed (perhaps due to a crash). Since the job's fences were
>>> never signaled, the buffer object was effectively leaked. Worse, the
>>> buffer was stuck wherever it happened to be at the time, possibly in
>>> VRAM.
>>>
>>> The symptom was user space processes stuck in interruptible waits with
>>> kernel stacks like:
>>>
>>>  [] dma_fence_default_wait+0x112/0x250
>>>  [] dma_fence_wait_timeout+0x39/0xf0
>>>  []
>>> reservation_object_wait_timeout_rcu+0x1c2/0x300
>>>  [] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0
>>> [ttm]
>>>  [] ttm_mem_evict_first+0xba/0x1a0 [ttm]
>>>  [] ttm_bo_mem_space+0x341/0x4c0 [ttm]
>>>  [] ttm_bo_validate+0xd4/0x150 [ttm]
>>>  [] ttm_bo_init_reserved+0x2ed/0x420 [ttm]
>>>  [] amdgpu_bo_create_restricted+0x1f3/0x470
>>> [amdgpu]
>>>  [] amdgpu_bo_create+0xda/0x220 [amdgpu]
>>>  [] amdgpu_gem_object_create+0xaa/0x140 [amdgpu]
>>>  [] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu]
>>>  [] drm_ioctl+0x1fa/0x480 [drm]
>>>  [] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
>>>  [] do_vfs_ioctl+0xa3/0x5f0
>>>  [] SyS_ioctl+0x79/0x90
>>>  [] entry_SYSCALL_64_fastpath+0x1e/0xad
>>>  [] 0x
>>>
>>> Signed-off-by: Nicolai Hähnle 
>>> Acked-by: Christian König 
>> Since Christian's commit which introduced the problem (6af0883ed977
>> "drm/amdgpu: discard commands of killed processes") is in 4.14, we need
>> a solution for that. Should we backport Nicolai's five commits fixing
>> the problem, or revert 6af0883ed977?

BTW, any preference for this Christian or Nicolai?


>> While looking into this, I noticed that the following commits by
>> Christian in 4.14 each also cause hangs for me when running the piglit
>> gpu profile on Tonga:
>>
>> 457e0fee04b0 "drm/amdgpu: remove the GART copy hack"
>> 1d00402b4da2 "drm/amdgpu: fix amdgpu_ttm_bind"
>>
>> Are there fixes for these that can be backported to 4.14, or do they
>> need to be reverted there?
> Well I'm not aware that any of those two can cause problems.
> 
> For "drm/amdgpu: remove the GART copy hack" I also don't have the
> slightest idea how that could be an issue. It just removes an unused
> code path.

I also thought it's weird, and indeed I can no longer reproduce a hang
with only 457e0fee04b0; but I still can with only 1d00402b4da2. I guess
one of my bisections went wrong and incorrectly identified 457e0fee04b0
instead of 1d00402b4da2.


> Is amd-staging-drm-next stable for you?

It seemed stable before the changes you pushed this morning. :) As of
cfb6dee86711 "drm/ttm: add transparent huge page support for cached
allocations v2", I get a flood of

 [TTM] Erroneous page count. Leaking pages.

in dmesg while running piglit, and it eventually hangs[0].


Anyway, unless anyone knows which commits from amd-staging-drm-next are
needed to make 1d00402b4da2 stable in 4.14, the safe course of action
seems to be reverting it (and ac7afe6b3cf3, which depends on it)?


[0] I also got this, but I don't know yet if it's related:

 BUG: unable to handle kernel NULL pointer dereference at 0220
 IP: amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu]
 PGD 0 
 P4D 0 
 
 Oops:  [#1] SMP
 Modules linked in: cpufreq_powersave cpufreq_userspace cpufreq_conservative 
amdkfd(O) edac_mce_amd kvm amdgpu(O) irqbypass crct10dif_pclmul crc32_pclmul 
chash snd_hda_codec_realtek ghash_clmulni_intel snd_hda_codec_generic 
snd_hda_codec_hdmi pcbc binfmt_misc ttm(O) efi_pstore snd_hda_intel 
drm_kms_helper(O) snd_hda_codec nls_ascii drm(O) snd_hda_core nls_cp437 
i2c_algo_bit aesni_intel snd_hwdep fb_sys_fops aes_x86_64 crypto_simd vfat 
syscopyarea glue_helper sysfillrect snd_pcm fat sysimgblt sp5100_tco wmi_bmof 
ppdev r8169 snd_timer cryptd pcspkr efivars mfd_core mii ccp i2c_piix4 snd 
soundcore rng_core sg wmi parport_pc parport i2c_designware_platform 
i2c_designware_core button acpi_cpufreq tcp_bbr sch_fq sunrpc nct6775 hwmon_vid 
efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache
  jbd2 fscrypto raid10 raid1 raid0 multipath linear md_mod dm_mod sd_mod evdev 
hid_generic usbhid hid crc32c_intel ahci libahci xhci_pci libata xhci_hcd 
scsi_mod usbcore shpchp gpio_amdpt gpio_generic
 CPU: 13 PID: 1075 Comm: max-texture-siz Tainted: GW  O4.13.0-rc5+ 
#28
 Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 TOMAHAWK 
(MS-7A34), BIOS 1.80 09/13/2017
 task: 9d2982c75a00 task.stack: b2744e9bc000
 RIP: 0010:amdgpu_vm_bo_invalidate+0x88/0x210 [amdgpu]
 RSP: 0018:b2744e9bf6e8 EFLAGS: 00010202
 RAX:  RBX: 9d2848642820 

[PATCH] drm/amdgpu: minor CS optimization

2017-10-12 Thread Christian König
From: Christian König 

We only need to loop over all IBs for old UVD/VCE command stream patching.

Signed-off-by: Christian König 
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 37 +-
 1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 2ae5d52..20e2a71 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -855,36 +855,37 @@ static int amdgpu_cs_ib_vm_chunk(struct amdgpu_device 
*adev,
struct amdgpu_fpriv *fpriv = p->filp->driver_priv;
struct amdgpu_vm *vm = >vm;
struct amdgpu_ring *ring = p->job->ring;
-   int i, j, r;
-
-   for (i = 0, j = 0; i < p->nchunks && j < p->job->num_ibs; i++) {
-
-   struct amdgpu_cs_chunk *chunk;
-   struct amdgpu_ib *ib;
-   struct drm_amdgpu_cs_chunk_ib *chunk_ib;
-
-   chunk = >chunks[i];
-   ib = >job->ibs[j];
-   chunk_ib = (struct drm_amdgpu_cs_chunk_ib *)chunk->kdata;
+   int r;
 
-   if (chunk->chunk_id != AMDGPU_CHUNK_ID_IB)
-   continue;
+   /* Only for UVD/VCE VM emulation */
+   if (p->job->ring->funcs->parse_cs) {
+   unsigned i, j;
 
-   if (p->job->ring->funcs->parse_cs) {
+   for (i = 0, j = 0; i < p->nchunks && j < p->job->num_ibs; i++) {
+   struct drm_amdgpu_cs_chunk_ib *chunk_ib;
struct amdgpu_bo_va_mapping *m;
struct amdgpu_bo *aobj = NULL;
+   struct amdgpu_cs_chunk *chunk;
+   struct amdgpu_ib *ib;
uint64_t offset;
uint8_t *kptr;
 
+   chunk = >chunks[i];
+   ib = >job->ibs[j];
+   chunk_ib = chunk->kdata;
+
+   if (chunk->chunk_id != AMDGPU_CHUNK_ID_IB)
+   continue;
+
r = amdgpu_cs_find_mapping(p, chunk_ib->va_start,
-   , );
+  , );
if (r) {
DRM_ERROR("IB va_start is invalid\n");
return r;
}
 
if ((chunk_ib->va_start + chunk_ib->ib_bytes) >
-   (m->last + 1) * AMDGPU_GPU_PAGE_SIZE) {
+   (m->last + 1) * AMDGPU_GPU_PAGE_SIZE) {
DRM_ERROR("IB va_start+ib_bytes is invalid\n");
return -EINVAL;
}
@@ -901,12 +902,12 @@ static int amdgpu_cs_ib_vm_chunk(struct amdgpu_device 
*adev,
memcpy(ib->ptr, kptr, chunk_ib->ib_bytes);
amdgpu_bo_kunmap(aobj);
 
-   /* Only for UVD/VCE VM emulation */
r = amdgpu_ring_parse_cs(ring, p, j);
if (r)
return r;
+
+   j++;
}
-   j++;
}
 
if (p->job->vm) {
-- 
2.7.4

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: TDR and VRAM lost handling in KMD (v2)

2017-10-12 Thread Mao, David
Thanks Monk for the summary!

Hi Nicolai,
In order to block the usage of new context reference the old allocation, i 
think we need to do something in UMD so that KMD don’t need to monitor the 
resource list.
I want to make sure we are on the same page.
If you agree, then there might have two options to do that in UMD: (You can do 
whatever you want, i just want to elaborate the idea a little bit to facilitate 
the discussion).
-  If sharelist is valid, driver need to compare the current vram_lost_count 
and share list’s vram_lost count, The context will fail to create if share list 
created before the reset.
- Or, we can copy the vram_lost count from the sharelist, kernel will fail the 
submission if the vram_lost count is smaller than current one.
I personally want to go first for OrcaGL.

Thanks.
Best Regards,
David
-
On 12 Oct 2017, at 4:03 PM, Liu, Monk 
> wrote:

V2 summary

Hi team

please give your comments

•  When a job timed out (set from lockup_timeout kernel parameter), What KMD 
should do in TDR routine :

1.Update adev->gpu_reset_counter, and stop scheduler first
2.Set its fence error status to “ECANCELED”,
3.Find the context behind this job, and set this context as “guilty” 
(will have a new member field in context structure – bool guilty)
a)   There will be “bool * guilty” in entity structure, which points to its 
father context’s member – “bool guilty” when context initialized , so no matter 
we get context or entity, we always know if it is “guilty”
b)   For kernel entity that used for VM updates, there is no context back 
it, so kernel entity’s “bool *guilty” always “NULL”.
c)The idea to skip the whole context is for consistence consideration, 
because we’ll fake signal the hang job in job_run(), so all jobs in its context 
shall be dropped otherwise either bad drawing/computing results or more GPU 
hang.

4.Do GPU reset, which is can be some callbacks to let bare-metal and 
SR-IOV implement with their favor style
5.After reset, KMD need to aware if the VRAM lost happens or not, 
bare-metal can implement some function to judge, while for SR-IOV I prefer to 
read it from GIM side (for initial version we consider it’s always VRAM lost, 
till GIM side change aligned)
6.If VRAM lost hit, update adev->vram_lost_counter.
7.Do GTT recovery and shadow buffer recovery.
8.Re-schedule all JOBs in mirror list and restart scheduler

•  For GPU scheduler function --- job_run()
1.Before schedule a job to ring, checks if job->vram_lost_counter == 
adev->vram_lost_counter, and drop this job if mismatch
2.Before schedule a job to ring, checks if job->entity->guilty is NULL 
or not, and drop this job if (guilty!=NULL && *guilty == TRUE)
3.if a job is dropped:
a)   set job’s sched_fence status to “ECANCELED”
b)   fake/force signal job’s hw fence (no need to set hw fence’s status)

•  For cs_wait() IOCTL:
After it found fence signaled, it should check if there is error on this fence 
and return the error status of this fence

•  For cs_wait_fences() IOCTL:
Similar with above approach

•  For cs_submit() IOCTL:
1.check if current ctx been marked “guilty” and return “ECANCELED”  if 
so.
2.set job->vram_lost_counter with adev->vram_lost_counter, and return 
“ECANCELED” if ctx->vram_lost_counter != job->vram_lost_counter(Christian 
already submitted this patch)
a)   discussion: can we return “ENODEV” if vram_lost_counter mismatch ? 
that way UMD know this context is under “device lost”

•  Introduce a new IOCTL to let UMD query latest adev->vram_lost_counter:

•  For amdgpu_ctx_query():
•  Don’t update ctx->reset_counter when querying this function, otherwise the 
query result is not consistent
•  Set out->state.reset_status to “AMDGPU_CTX_GUILTY_RESET” if the ctx is 
“guilty”, no need to check “ctx->reset_counter”
•  Set out->state.reset_status to “AMDGPU_CTX_INNOCENT_RESET”if the ctx isn’t 
“guilty” && ctx->reset_counter != adev->reset_counter
•  Set out->state.reset_status to “AMDGPU_CTX_NO_RESET” if ctx->reset_counter 
== adev->reset_counter
•  Set out->state.flags to “AMDGPU_CTX_FLAG_VRAM_LOST” if 
ctx->vram_lost_counter != adev->vram_lost_counter
•  discussion: can we return “ENODEV” for amdgpu_ctx_query() if 
ctx->vram_lost_counter != adev->vram_lost_counter ? that way UMD know this 
context is under “device lost”
•  UMD shall release this context if it is AMDGPU_CTX_GUILTY_RESET or its flags 
is “AMDGPU_CTX_FLAG_VRAM_LOST”

For UMD behavior we still have something need to consider:
If MESA creates a new context from an old context (share list?? I’m not 
familiar with UMD , David Mao shall have some discuss on it with Nicolai), the 
new created context’s vram_lost_counter
And reset_counter shall all be ported from that old context , otherwise 
CS_SUBMIT will not block it which isn’t correct



Need your feedback, thx


From: amd-gfx 

Re: TDR and VRAM lost handling in KMD (v2)

2017-10-12 Thread Christian König

Am 12.10.2017 um 11:44 schrieb Nicolai Hähnle:

On 12.10.2017 11:35, Michel Dänzer wrote:

On 12/10/17 11:23 AM, Christian König wrote:

Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle:

On 12.10.2017 10:49, Christian König wrote:

However, !guilty && ctx->reset_counter != adev->reset_counter does
not imply that the context was lost.

The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET
if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.

As far as I understand it, the case of !guilty && ctx->reset_counter
!= adev->reset_counter && ctx->vram_lost_counter ==
adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a
GPU reset occurred, but it didn't affect our context.

I disagree on that.

AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there
was a reset but we haven't been causing it.

That the OpenGL extension is specified otherwise is unfortunate, but
I think we shouldn't use that for the kernel interface here.

Two counterpoints:

1. Why should any application care that there was a reset while it was
idle? The OpenGL behavior is what makes sense.


The application is certainly not interest if a reset happened or not,
but I though that the driver stack might be.



2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today
because we never return it :)



Good point.


amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which is
in line with the OpenGL spec: we're conservatively returning that a
reset happened because we don't know whether the context was affected,
and we return UNKNOWN because we also don't know whether the context
was guilty or not.

Returning AMDGPU_CTX_NO_RESET in the case of !guilty &&
ctx->vram_lost_counter == adev->vram_lost_counter is simply a
refinement and improvement of the current, overly conservative 
behavior.


Ok let's reenumerate what I think the different return values should 
mean:


* AMDGPU_CTX_GUILTY_RESET

guilty is set to true for this context.

* AMDGPU_CTX_INNOCENT_RESET

guilty is not set and vram_lost_counter has changed.

* AMDGPU_CTX_UNKNOWN_RESET

guilty is not set and vram_lost_counter has not changed, but
gpu_reset_counter has changed.


I don't understand the distinction you're proposing between
AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I think both
cases you're describing should return either AMDGPU_CTX_INNOCENT_RESET,
if the value of guilty is reliable, or AMDGPU_CTX_UNKNOWN_RESET if 
it's not.


I think it'd make more sense if it was called 
"AMDGPU_CTX_UNAFFECTED_RESET".


So:
- AMDGPU_CTX_GUILTY_RESET --> the context was affected by a reset, and 
we know that it's the context's fault
- AMDGPU_CTX_INNOCENT_RESET --> the context was affected by a reset, 
and we know that it *wasn't* the context's fault (no context job active)
- AMDGPU_CTX_UNKNOWN_RESET --> the context was affected by a reset, 
and we don't know who's responsible (this could be returned in the 
unlikely case where context A's gfx job has not yet finished, but 
context B's gfx job has already started; it could be the fault of A, 
it could be the fault of B -- which somehow manages to hang a part of 
the hardware that then prevents A's job from finishing -- or it could 
be both; but it's a bit academic)
- AMDGPU_CTX_UNAFFECTED_RESET --> there was a reset, but this context 
wasn't affected


This last value would currently just be discarded by Mesa (because we 
should only disturb applications when we have to), but perhaps 
somebody else could find it useful?


Yes, that's what I had in mind as well.

Cause otherwise we would return AMDGPU_CTX_NO_RESET while there actually 
was a reset and that certainly doesn't sound correct to me.


Regards,
Christian.



Cheers,
Nicolai



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: TDR and VRAM lost handling in KMD (v2)

2017-10-12 Thread Nicolai Hähnle

On 12.10.2017 11:35, Michel Dänzer wrote:

On 12/10/17 11:23 AM, Christian König wrote:

Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle:

On 12.10.2017 10:49, Christian König wrote:

However, !guilty && ctx->reset_counter != adev->reset_counter does
not imply that the context was lost.

The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET
if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.

As far as I understand it, the case of !guilty && ctx->reset_counter
!= adev->reset_counter && ctx->vram_lost_counter ==
adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a
GPU reset occurred, but it didn't affect our context.

I disagree on that.

AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there
was a reset but we haven't been causing it.

That the OpenGL extension is specified otherwise is unfortunate, but
I think we shouldn't use that for the kernel interface here.

Two counterpoints:

1. Why should any application care that there was a reset while it was
idle? The OpenGL behavior is what makes sense.


The application is certainly not interest if a reset happened or not,
but I though that the driver stack might be.



2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today
because we never return it :)



Good point.


amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which is
in line with the OpenGL spec: we're conservatively returning that a
reset happened because we don't know whether the context was affected,
and we return UNKNOWN because we also don't know whether the context
was guilty or not.

Returning AMDGPU_CTX_NO_RESET in the case of !guilty &&
ctx->vram_lost_counter == adev->vram_lost_counter is simply a
refinement and improvement of the current, overly conservative behavior.


Ok let's reenumerate what I think the different return values should mean:

* AMDGPU_CTX_GUILTY_RESET

guilty is set to true for this context.

* AMDGPU_CTX_INNOCENT_RESET

guilty is not set and vram_lost_counter has changed.

* AMDGPU_CTX_UNKNOWN_RESET

guilty is not set and vram_lost_counter has not changed, but
gpu_reset_counter has changed.


I don't understand the distinction you're proposing between
AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I think both
cases you're describing should return either AMDGPU_CTX_INNOCENT_RESET,
if the value of guilty is reliable, or AMDGPU_CTX_UNKNOWN_RESET if it's not.


I think it'd make more sense if it was called "AMDGPU_CTX_UNAFFECTED_RESET".

So:
- AMDGPU_CTX_GUILTY_RESET --> the context was affected by a reset, and 
we know that it's the context's fault
- AMDGPU_CTX_INNOCENT_RESET --> the context was affected by a reset, and 
we know that it *wasn't* the context's fault (no context job active)
- AMDGPU_CTX_UNKNOWN_RESET --> the context was affected by a reset, and 
we don't know who's responsible (this could be returned in the unlikely 
case where context A's gfx job has not yet finished, but context B's gfx 
job has already started; it could be the fault of A, it could be the 
fault of B -- which somehow manages to hang a part of the hardware that 
then prevents A's job from finishing -- or it could be both; but it's a 
bit academic)
- AMDGPU_CTX_UNAFFECTED_RESET --> there was a reset, but this context 
wasn't affected


This last value would currently just be discarded by Mesa (because we 
should only disturb applications when we have to), but perhaps somebody 
else could find it useful?


Cheers,
Nicolai
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: TDR and VRAM lost handling in KMD (v2)

2017-10-12 Thread Michel Dänzer
On 12/10/17 11:23 AM, Christian König wrote:
> Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle:
>> On 12.10.2017 10:49, Christian König wrote:
 However, !guilty && ctx->reset_counter != adev->reset_counter does
 not imply that the context was lost.

 The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET
 if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.

 As far as I understand it, the case of !guilty && ctx->reset_counter
 != adev->reset_counter && ctx->vram_lost_counter ==
 adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a
 GPU reset occurred, but it didn't affect our context.
>>> I disagree on that.
>>>
>>> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there
>>> was a reset but we haven't been causing it.
>>>
>>> That the OpenGL extension is specified otherwise is unfortunate, but
>>> I think we shouldn't use that for the kernel interface here.
>> Two counterpoints:
>>
>> 1. Why should any application care that there was a reset while it was
>> idle? The OpenGL behavior is what makes sense.
> 
> The application is certainly not interest if a reset happened or not,
> but I though that the driver stack might be.
> 
>>
>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today
>> because we never return it :)
>>
> 
> Good point.
> 
>> amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which is
>> in line with the OpenGL spec: we're conservatively returning that a
>> reset happened because we don't know whether the context was affected,
>> and we return UNKNOWN because we also don't know whether the context
>> was guilty or not.
>>
>> Returning AMDGPU_CTX_NO_RESET in the case of !guilty &&
>> ctx->vram_lost_counter == adev->vram_lost_counter is simply a
>> refinement and improvement of the current, overly conservative behavior.
> 
> Ok let's reenumerate what I think the different return values should mean:
> 
> * AMDGPU_CTX_GUILTY_RESET
> 
> guilty is set to true for this context.
> 
> * AMDGPU_CTX_INNOCENT_RESET
> 
> guilty is not set and vram_lost_counter has changed.
> 
> * AMDGPU_CTX_UNKNOWN_RESET
> 
> guilty is not set and vram_lost_counter has not changed, but
> gpu_reset_counter has changed.

I don't understand the distinction you're proposing between
AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I think both
cases you're describing should return either AMDGPU_CTX_INNOCENT_RESET,
if the value of guilty is reliable, or AMDGPU_CTX_UNKNOWN_RESET if it's not.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: TDR and VRAM lost handling in KMD (v2)

2017-10-12 Thread Christian König

Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle:

On 12.10.2017 10:49, Christian König wrote:
However, !guilty && ctx->reset_counter != adev->reset_counter does 
not imply that the context was lost.


The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET 
if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.


As far as I understand it, the case of !guilty && ctx->reset_counter 
!= adev->reset_counter && ctx->vram_lost_counter == 
adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a 
GPU reset occurred, but it didn't affect our context.

I disagree on that.

AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there 
was a reset but we haven't been causing it.


That the OpenGL extension is specified otherwise is unfortunate, but 
I think we shouldn't use that for the kernel interface here.

Two counterpoints:

1. Why should any application care that there was a reset while it was 
idle? The OpenGL behavior is what makes sense.


The application is certainly not interest if a reset happened or not, 
but I though that the driver stack might be.




2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today 
because we never return it :)




Good point.

amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which is 
in line with the OpenGL spec: we're conservatively returning that a 
reset happened because we don't know whether the context was affected, 
and we return UNKNOWN because we also don't know whether the context 
was guilty or not.


Returning AMDGPU_CTX_NO_RESET in the case of !guilty && 
ctx->vram_lost_counter == adev->vram_lost_counter is simply a 
refinement and improvement of the current, overly conservative behavior.


Ok let's reenumerate what I think the different return values should mean:

* AMDGPU_CTX_GUILTY_RESET

guilty is set to true for this context.

* AMDGPU_CTX_INNOCENT_RESET

guilty is not set and vram_lost_counter has changed.

* AMDGPU_CTX_UNKNOWN_RESET

guilty is not set and vram_lost_counter has not changed, but 
gpu_reset_counter has changed.


* AMDGPU_CTX_NO_RESET

There wasn't a reset. So neither guilty is set, nor gpu_reset_counter 
nor vram_lost_counter has changed.


If we get to a point where we are certain that a GPU reset without 
loosing VRAM is harmless we can drop using gpu_reset_counter and also 
return AMDGPU_CTX_NO_RESET if we are certain that an application didn't 
do anything nasty. But till then I would rather say we should keep this.


Regards,
Christian.



Cheers,
Nicolai




Regards,
Christian.

Am 12.10.2017 um 10:44 schrieb Nicolai Hähnle:
I think we should stick to the plan where kernel contexts stay 
"stuck" after a GPU reset. This is the most robust behavior for the 
kernel.


Even if the OpenGL spec says that an OpenGL context can be re-used 
without destroying and re-creating it, the UMD can take care of 
re-creating the kernel context.


This means amdgpu_ctx_query should *not* reset ctx->reset_counter.

Cheers,
Nicolai


On 12.10.2017 10:41, Nicolai Hähnle wrote:

Hi Monk,

Thanks for the summary. Most of it looks good to me, though I can't 
speak to all the kernel internals.


Just some comments:

On 12.10.2017 10:03, Liu, Monk wrote:

lFor cs_submit() IOCTL:

1.check if current ctx been marked “*guilty*”and return 
“*ECANCELED*”   if so.


2.set job->*vram_lost_counter* with adev->*vram_lost_counter*, and 
return “*ECANCELED*” if ctx->*vram_lost_counter* != 
job->*vram_lost_counter* (Christian already submitted this patch)


a)discussion: can we return “ENODEV” if vram_lost_counter mismatch 
? that way UMD know this context is under “device lost”


My plan for UMD is to always query the VRAM lost counter when any 
kind of context lost situation is detected. So cs_submit() should 
return an error in this situation, but it could just be ECANCELED. 
We don't need to distinguish between different types of errors here.



lIntroduce a new IOCTL to let UMD query latest 
adev->*vram_lost_counter*:


Christian already sent a patch for this.



lFor amdgpu_ctx_query():

n*Don’t update ctx->reset_counter when querying this function, 
otherwise the query result is not consistent *


Hmm. I misremembered part of the spec, see below.


nSet out->state.reset_status to “AMDGPU_CTX_GUILTY_RESET” if the 
ctx is “*guilty*”, no need to check “ctx->reset_counter”


Agreed.


nSet out->state.reset_status to “AMDGPU_CTX_INNOCENT_RESET” *if 
the ctx isn’t “guilty” && ctx->reset_counter != adev->reset_counter *


I disagree. The meaning of AMDGPU_CTX_*_RESET should reflect the 
corresponding enums in user space APIs. I don't know how it works 
in Vulkan, but in OpenGL, returning GL_INNOCENT_CONTEXT_RESET_ARB 
means that the context was lost.


However, !guilty && ctx->reset_counter != adev->reset_counter does 
not imply that the context was lost.


The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET 
if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.


As far as I 

Re: TDR and VRAM lost handling in KMD (v2)

2017-10-12 Thread Nicolai Hähnle

On 12.10.2017 10:49, Christian König wrote:
However, !guilty && ctx->reset_counter != adev->reset_counter does not 
imply that the context was lost.


The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET if 
!guilty && ctx->vram_lost_counter != adev->vram_lost_counter.


As far as I understand it, the case of !guilty && ctx->reset_counter 
!= adev->reset_counter && ctx->vram_lost_counter == 
adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a 
GPU reset occurred, but it didn't affect our context.

I disagree on that.

AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there was a 
reset but we haven't been causing it.


That the OpenGL extension is specified otherwise is unfortunate, but I 
think we shouldn't use that for the kernel interface here.

Two counterpoints:

1. Why should any application care that there was a reset while it was 
idle? The OpenGL behavior is what makes sense.


2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today 
because we never return it :)


amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which is in 
line with the OpenGL spec: we're conservatively returning that a reset 
happened because we don't know whether the context was affected, and we 
return UNKNOWN because we also don't know whether the context was guilty 
or not.


Returning AMDGPU_CTX_NO_RESET in the case of !guilty && 
ctx->vram_lost_counter == adev->vram_lost_counter is simply a refinement 
and improvement of the current, overly conservative behavior.


Cheers,
Nicolai




Regards,
Christian.

Am 12.10.2017 um 10:44 schrieb Nicolai Hähnle:
I think we should stick to the plan where kernel contexts stay "stuck" 
after a GPU reset. This is the most robust behavior for the kernel.


Even if the OpenGL spec says that an OpenGL context can be re-used 
without destroying and re-creating it, the UMD can take care of 
re-creating the kernel context.


This means amdgpu_ctx_query should *not* reset ctx->reset_counter.

Cheers,
Nicolai


On 12.10.2017 10:41, Nicolai Hähnle wrote:

Hi Monk,

Thanks for the summary. Most of it looks good to me, though I can't 
speak to all the kernel internals.


Just some comments:

On 12.10.2017 10:03, Liu, Monk wrote:

lFor cs_submit() IOCTL:

1.check if current ctx been marked “*guilty*”and return 
“*ECANCELED*”   if so.


2.set job->*vram_lost_counter* with adev->*vram_lost_counter*, and 
return “*ECANCELED*” if ctx->*vram_lost_counter* != 
job->*vram_lost_counter* (Christian already submitted this patch)


a)discussion: can we return “ENODEV” if vram_lost_counter mismatch ? 
that way UMD know this context is under “device lost”


My plan for UMD is to always query the VRAM lost counter when any 
kind of context lost situation is detected. So cs_submit() should 
return an error in this situation, but it could just be ECANCELED. We 
don't need to distinguish between different types of errors here.



lIntroduce a new IOCTL to let UMD query latest 
adev->*vram_lost_counter*:


Christian already sent a patch for this.



lFor amdgpu_ctx_query():

n*Don’t update ctx->reset_counter when querying this function, 
otherwise the query result is not consistent *


Hmm. I misremembered part of the spec, see below.


nSet out->state.reset_status to “AMDGPU_CTX_GUILTY_RESET” if the ctx 
is “*guilty*”, no need to check “ctx->reset_counter”


Agreed.


nSet out->state.reset_status to “AMDGPU_CTX_INNOCENT_RESET” *if the 
ctx isn’t “guilty” && ctx->reset_counter != adev->reset_counter *


I disagree. The meaning of AMDGPU_CTX_*_RESET should reflect the 
corresponding enums in user space APIs. I don't know how it works in 
Vulkan, but in OpenGL, returning GL_INNOCENT_CONTEXT_RESET_ARB means 
that the context was lost.


However, !guilty && ctx->reset_counter != adev->reset_counter does 
not imply that the context was lost.


The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET 
if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.


As far as I understand it, the case of !guilty && ctx->reset_counter 
!= adev->reset_counter && ctx->vram_lost_counter == 
adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a 
GPU reset occurred, but it didn't affect our context.


I unfortunately noticed another subtlety while re-reading the OpenGL 
spec. OpenGL says that the OpenGL context itself does *not* have to 
be re-created in order to recover from the reset. Re-creating all 
objects in the context is sufficient.


I believe this is the original motivation for why amdgpu_ctx_query() 
will reset the ctx->reset_counter.


For Mesa, it's still okay if the kernel keeps blocking submissions as 
we can just recreate the kernel context. But OrcaGL is also affected.


Does anybody know off-hand where the relevant parts of the Vulkan 
spec are? I didn't actually find anything in a quick search.



[snip]

For UMD behavior we still have something need to consider:

If MESA creates a new context from an old context 

Re: TDR and VRAM lost handling in KMD (v2)

2017-10-12 Thread Nicolai Hähnle

Hi Monk,

Thanks for the summary. Most of it looks good to me, though I can't 
speak to all the kernel internals.


Just some comments:

On 12.10.2017 10:03, Liu, Monk wrote:

lFor cs_submit() IOCTL:

1.check if current ctx been marked “*guilty*”and return “*ECANCELED*” 
  if so.


2.set job->*vram_lost_counter* with adev->*vram_lost_counter*, and 
return “*ECANCELED*” if ctx->*vram_lost_counter* != 
job->*vram_lost_counter* (Christian already submitted this patch)


a)discussion: can we return “ENODEV” if vram_lost_counter mismatch ? 
that way UMD know this context is under “device lost”


My plan for UMD is to always query the VRAM lost counter when any kind 
of context lost situation is detected. So cs_submit() should return an 
error in this situation, but it could just be ECANCELED. We don't need 
to distinguish between different types of errors here.




lIntroduce a new IOCTL to let UMD query latest adev->*vram_lost_counter*:


Christian already sent a patch for this.



lFor amdgpu_ctx_query():

n*Don’t update ctx->reset_counter when querying this function, otherwise 
the query result is not consistent *


Hmm. I misremembered part of the spec, see below.


nSet out->state.reset_status to “AMDGPU_CTX_GUILTY_RESET” if the ctx is 
“*guilty*”, no need to check “ctx->reset_counter”


Agreed.


nSet out->state.reset_status to “AMDGPU_CTX_INNOCENT_RESET” *if the ctx 
isn’t “guilty” && ctx->reset_counter != adev->reset_counter *


I disagree. The meaning of AMDGPU_CTX_*_RESET should reflect the 
corresponding enums in user space APIs. I don't know how it works in 
Vulkan, but in OpenGL, returning GL_INNOCENT_CONTEXT_RESET_ARB means 
that the context was lost.


However, !guilty && ctx->reset_counter != adev->reset_counter does not 
imply that the context was lost.


The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET if 
!guilty && ctx->vram_lost_counter != adev->vram_lost_counter.


As far as I understand it, the case of !guilty && ctx->reset_counter != 
adev->reset_counter && ctx->vram_lost_counter == adev->vram_lost_counter 
should return AMDGPU_CTX_NO_RESET, because a GPU reset occurred, but it 
didn't affect our context.


I unfortunately noticed another subtlety while re-reading the OpenGL 
spec. OpenGL says that the OpenGL context itself does *not* have to be 
re-created in order to recover from the reset. Re-creating all objects 
in the context is sufficient.


I believe this is the original motivation for why amdgpu_ctx_query() 
will reset the ctx->reset_counter.


For Mesa, it's still okay if the kernel keeps blocking submissions as we 
can just recreate the kernel context. But OrcaGL is also affected.


Does anybody know off-hand where the relevant parts of the Vulkan spec 
are? I didn't actually find anything in a quick search.



[snip]

For UMD behavior we still have something need to consider:

If MESA creates a new context from an old context (share list?? I’m not 
familiar with UMD , David Mao shall have some discuss on it with 
Nicolai), the new created context’s vram_lost_counter


And reset_counter shall all be ported from that old context , otherwise 
CS_SUBMIT will not block it which isn’t correct


The kernel doesn't have to do anything for this, it is entirely the 
UMD's responsibility. All UMD needs from KMD is the function for 
querying the vram_lost_counter.


Cheers,
Nicolai




Need your feedback, thx

*From:*amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] *On Behalf 
Of *Liu, Monk

*Sent:* 2017年10月11日13:34
*To:* Koenig, Christian ; Haehnle, Nicolai 
; Olsak, Marek ; Deucher, 
Alexander 
*Cc:* Ramirez, Alejandro ; 
amd-gfx@lists.freedesktop.org; Filipas, Mario ; 
Ding, Pixel ; Li, Bingley ; 
Jiang, Jerry (SW) 

*Subject:* TDR and VRAM lost handling in KMD:

Hi Christian & Nicolai,

We need to achieve some agreements on what should MESA/UMD do and what 
should KMD do, *please give your comments with **“okay”or “No”and your 
idea on below items,*


lWhen a job timed out (set from lockup_timeout kernel parameter), What 
KMD should do in TDR routine :


1.Update adev->*gpu_reset_counter*, and stop scheduler first, 
(*gpu_reset_counter* is used to force vm flush after GPU reset, out of 
this thread’s scope so no more discussion on it)


2.Set its fence error status to “*ETIME*”,

3.Find the entity/ctx behind this job, and set this ctx as “*guilty*”

4.Kick out this job from scheduler’s mirror list, so this job won’t get 
re-scheduled to ring anymore.


5.Kick out all jobs in this “guilty”ctx’s KFIFO queue, and set all their 
fence status to “*ECANCELED*”


*6.*Force signal all fences that get kicked out by above two 
steps,*otherwise UMD will block forever if waiting on those fences*


7.Do gpu reset, which is can be some callbacks to let bare-metal 

Re: TDR and VRAM lost handling in KMD (v2)

2017-10-12 Thread Nicolai Hähnle

Hi David,

Agreed. I'd also like to use the first variant (UMD compares the current 
vram_lost count with the share list's vram_lost count) for Mesa.


Cheers,
Nicolai

On 12.10.2017 10:43, Mao, David wrote:

Thanks Monk for the summary!

Hi Nicolai,
In order to block the usage of new context reference the old allocation, 
i think we need to do something in UMD so that KMD don’t need to monitor 
the resource list.

I want to make sure we are on the same page.
If you agree, then there might have two options to do that in UMD: (You 
can do whatever you want, i just want to elaborate the idea a little bit 
to facilitate the discussion).
-  If sharelist is valid, driver need to compare the current 
vram_lost_count and share list’s vram_lost count, The context will fail 
to create if share list created before the reset.
- Or, we can copy the vram_lost count from the sharelist, kernel will 
fail the submission if the vram_lost count is smaller than current one.

I personally want to go first for OrcaGL.

Thanks.
Best Regards,
David
-
On 12 Oct 2017, at 4:03 PM, Liu, Monk > wrote:


V2 summary
Hi team
*please give your comments*
lWhen a job timed out (set from lockup_timeout kernel parameter), What 
KMD should do in TDR routine :

1.Update adev->*gpu_reset_counter*, and stop scheduler first
2.Set its fence error status to“*ECANCELED*”,
3.Find the*context*behind this job, and set 
this*context*as“*guilty*”(will have a new member field in context 
structure –*bool guilty*)
a)There will be “*bool * guilty*” in entity structure, which points to 
its father context’s member – “*bool guilty”*when context 
initialized**, so no matter we get context or entity, we always know 
if it is “guilty”
b)For kernel entity that used for VM updates, there is no context back 
it, so kernel entity’s “bool *guilty” always “NULL”.
c)The idea to skip the whole context is for consistence consideration, 
because we’ll fake signal the hang job in job_run(), so all jobs in 
its context shall be dropped otherwise either bad drawing/computing 
results or more GPU hang.

**
4.Do GPU reset, which is can be some callbacks to let bare-metal and 
SR-IOV implement with their favor style
5.After reset, KMD need to aware if the VRAM lost happens or not, 
bare-metal can implement some function to judge, while for SR-IOV I 
prefer to read it from GIM side (for initial version we consider it’s 
always VRAM lost, till GIM side change aligned)

6.If VRAM lost hit, update adev->*vram_lost_counter*.
7.Do GTT recovery and shadow buffer recovery.
8.Re-schedule all JOBs in mirror list and restart scheduler
lFor GPU scheduler function --- job_run()
1.Before schedule a job to ring, checks if job->*vram_lost_counter*== 
adev->*vram_lost_counter*, and drop this job if mismatch
2.Before schedule a job to ring, checks if job->entity->*guilty*is 
NULL or not,*and drop this job if (guilty!=NULL && *guilty == TRUE)*

3.if a job is dropped:
a)set job’s sched_fence status to “*ECANCELED*”
b)fake/force signal job’s hw fence (no need to set hw fence’s status)
lFor cs_wait() IOCTL:
After it found fence signaled, it should check if there is error on 
this fence and return the error status of this fence

lFor cs_wait_fences() IOCTL:
Similar with above approach
lFor cs_submit() IOCTL:
1.check if current ctx been marked“*guilty*”and return“*ECANCELED*” if so.
2.set job->*vram_lost_counter*with adev->*vram_lost_counter*, and 
return “*ECANCELED*” if ctx->*vram_lost_counter*!= 
job->*vram_lost_counter*(Christian already submitted this patch)
a)discussion: can we return “ENODEV” if vram_lost_counter mismatch ? 
that way UMD know this context is under “device lost”

lIntroduce a new IOCTL to let UMD query latest adev->*vram_lost_counter*:
lFor amdgpu_ctx_query():
n*Don’t update ctx->reset_counter when querying this function, 
otherwise the query result is not consistent*
nSet out->state.reset_status to “AMDGPU_CTX_GUILTY_RESET” if the ctx 
is “*guilty*”, no need to check “ctx->reset_counter”
nSet out->state.reset_status to “AMDGPU_CTX_INNOCENT_RESET”*if the ctx 
isn’t “guilty” && ctx->reset_counter != adev->reset_counter*
nSet out->state.reset_status to “AMDGPU_CTX_NO_RESET” if 
ctx->reset_counter == adev->reset_counter
nSet out->state.flags to “AMDGPU_CTX_FLAG_VRAM_LOST” if 
ctx->vram_lost_counter != adev->vram_lost_counter
udiscussion: can we return “ENODEV” for amdgpu_ctx_query() if 
ctx->vram_lost_counter != adev->vram_lost_counter ? that way UMD know 
this context is under “device lost”
nUMD shall release this context if it is AMDGPU_CTX_GUILTY_RESET or 
its flags is “AMDGPU_CTX_FLAG_VRAM_LOST”

For UMD behavior we still have something need to consider:
If MESA creates a new context from an old context (share list?? I’m 
not familiar with UMD , David Mao shall have some discuss on it with 
Nicolai), the new created context’s vram_lost_counter
And reset_counter shall all be ported from that old context , 
otherwise CS_SUBMIT will 

Re: TDR and VRAM lost handling in KMD (v2)

2017-10-12 Thread Christian König
However, !guilty && ctx->reset_counter != adev->reset_counter does not 
imply that the context was lost.


The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET if 
!guilty && ctx->vram_lost_counter != adev->vram_lost_counter.


As far as I understand it, the case of !guilty && ctx->reset_counter 
!= adev->reset_counter && ctx->vram_lost_counter == 
adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a 
GPU reset occurred, but it didn't affect our context.

I disagree on that.

AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there was a 
reset but we haven't been causing it.


That the OpenGL extension is specified otherwise is unfortunate, but I 
think we shouldn't use that for the kernel interface here.


Regards,
Christian.

Am 12.10.2017 um 10:44 schrieb Nicolai Hähnle:
I think we should stick to the plan where kernel contexts stay "stuck" 
after a GPU reset. This is the most robust behavior for the kernel.


Even if the OpenGL spec says that an OpenGL context can be re-used 
without destroying and re-creating it, the UMD can take care of 
re-creating the kernel context.


This means amdgpu_ctx_query should *not* reset ctx->reset_counter.

Cheers,
Nicolai


On 12.10.2017 10:41, Nicolai Hähnle wrote:

Hi Monk,

Thanks for the summary. Most of it looks good to me, though I can't 
speak to all the kernel internals.


Just some comments:

On 12.10.2017 10:03, Liu, Monk wrote:

lFor cs_submit() IOCTL:

1.check if current ctx been marked “*guilty*”and return 
“*ECANCELED*”   if so.


2.set job->*vram_lost_counter* with adev->*vram_lost_counter*, and 
return “*ECANCELED*” if ctx->*vram_lost_counter* != 
job->*vram_lost_counter* (Christian already submitted this patch)


a)discussion: can we return “ENODEV” if vram_lost_counter mismatch ? 
that way UMD know this context is under “device lost”


My plan for UMD is to always query the VRAM lost counter when any 
kind of context lost situation is detected. So cs_submit() should 
return an error in this situation, but it could just be ECANCELED. We 
don't need to distinguish between different types of errors here.



lIntroduce a new IOCTL to let UMD query latest 
adev->*vram_lost_counter*:


Christian already sent a patch for this.



lFor amdgpu_ctx_query():

n*Don’t update ctx->reset_counter when querying this function, 
otherwise the query result is not consistent *


Hmm. I misremembered part of the spec, see below.


nSet out->state.reset_status to “AMDGPU_CTX_GUILTY_RESET” if the ctx 
is “*guilty*”, no need to check “ctx->reset_counter”


Agreed.


nSet out->state.reset_status to “AMDGPU_CTX_INNOCENT_RESET” *if the 
ctx isn’t “guilty” && ctx->reset_counter != adev->reset_counter *


I disagree. The meaning of AMDGPU_CTX_*_RESET should reflect the 
corresponding enums in user space APIs. I don't know how it works in 
Vulkan, but in OpenGL, returning GL_INNOCENT_CONTEXT_RESET_ARB means 
that the context was lost.


However, !guilty && ctx->reset_counter != adev->reset_counter does 
not imply that the context was lost.


The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET 
if !guilty && ctx->vram_lost_counter != adev->vram_lost_counter.


As far as I understand it, the case of !guilty && ctx->reset_counter 
!= adev->reset_counter && ctx->vram_lost_counter == 
adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET, because a 
GPU reset occurred, but it didn't affect our context.


I unfortunately noticed another subtlety while re-reading the OpenGL 
spec. OpenGL says that the OpenGL context itself does *not* have to 
be re-created in order to recover from the reset. Re-creating all 
objects in the context is sufficient.


I believe this is the original motivation for why amdgpu_ctx_query() 
will reset the ctx->reset_counter.


For Mesa, it's still okay if the kernel keeps blocking submissions as 
we can just recreate the kernel context. But OrcaGL is also affected.


Does anybody know off-hand where the relevant parts of the Vulkan 
spec are? I didn't actually find anything in a quick search.



[snip]

For UMD behavior we still have something need to consider:

If MESA creates a new context from an old context (share list?? I’m 
not familiar with UMD , David Mao shall have some discuss on it with 
Nicolai), the new created context’s vram_lost_counter


And reset_counter shall all be ported from that old context , 
otherwise CS_SUBMIT will not block it which isn’t correct


The kernel doesn't have to do anything for this, it is entirely the 
UMD's responsibility. All UMD needs from KMD is the function for 
querying the vram_lost_counter.


Cheers,
Nicolai




Need your feedback, thx

*From:*amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] *On 
Behalf Of *Liu, Monk

*Sent:* 2017年10月11日13:34
*To:* Koenig, Christian ; Haehnle, Nicolai 
; Olsak, Marek ; 
Deucher, Alexander 
*Cc:* Ramirez, Alejandro 

Re: TDR and VRAM lost handling in KMD (v2)

2017-10-12 Thread Nicolai Hähnle
I think we should stick to the plan where kernel contexts stay "stuck" 
after a GPU reset. This is the most robust behavior for the kernel.


Even if the OpenGL spec says that an OpenGL context can be re-used 
without destroying and re-creating it, the UMD can take care of 
re-creating the kernel context.


This means amdgpu_ctx_query should *not* reset ctx->reset_counter.

Cheers,
Nicolai


On 12.10.2017 10:41, Nicolai Hähnle wrote:

Hi Monk,

Thanks for the summary. Most of it looks good to me, though I can't 
speak to all the kernel internals.


Just some comments:

On 12.10.2017 10:03, Liu, Monk wrote:

lFor cs_submit() IOCTL:

1.check if current ctx been marked “*guilty*”and return “*ECANCELED*” 
  if so.


2.set job->*vram_lost_counter* with adev->*vram_lost_counter*, and 
return “*ECANCELED*” if ctx->*vram_lost_counter* != 
job->*vram_lost_counter* (Christian already submitted this patch)


a)discussion: can we return “ENODEV” if vram_lost_counter mismatch ? 
that way UMD know this context is under “device lost”


My plan for UMD is to always query the VRAM lost counter when any kind 
of context lost situation is detected. So cs_submit() should return an 
error in this situation, but it could just be ECANCELED. We don't need 
to distinguish between different types of errors here.




lIntroduce a new IOCTL to let UMD query latest adev->*vram_lost_counter*:


Christian already sent a patch for this.



lFor amdgpu_ctx_query():

n*Don’t update ctx->reset_counter when querying this function, 
otherwise the query result is not consistent *


Hmm. I misremembered part of the spec, see below.


nSet out->state.reset_status to “AMDGPU_CTX_GUILTY_RESET” if the ctx 
is “*guilty*”, no need to check “ctx->reset_counter”


Agreed.


nSet out->state.reset_status to “AMDGPU_CTX_INNOCENT_RESET” *if the 
ctx isn’t “guilty” && ctx->reset_counter != adev->reset_counter *


I disagree. The meaning of AMDGPU_CTX_*_RESET should reflect the 
corresponding enums in user space APIs. I don't know how it works in 
Vulkan, but in OpenGL, returning GL_INNOCENT_CONTEXT_RESET_ARB means 
that the context was lost.


However, !guilty && ctx->reset_counter != adev->reset_counter does not 
imply that the context was lost.


The way I understand it, we should return AMDGPU_CTX_INNOCENT_RESET if 
!guilty && ctx->vram_lost_counter != adev->vram_lost_counter.


As far as I understand it, the case of !guilty && ctx->reset_counter != 
adev->reset_counter && ctx->vram_lost_counter == adev->vram_lost_counter 
should return AMDGPU_CTX_NO_RESET, because a GPU reset occurred, but it 
didn't affect our context.


I unfortunately noticed another subtlety while re-reading the OpenGL 
spec. OpenGL says that the OpenGL context itself does *not* have to be 
re-created in order to recover from the reset. Re-creating all objects 
in the context is sufficient.


I believe this is the original motivation for why amdgpu_ctx_query() 
will reset the ctx->reset_counter.


For Mesa, it's still okay if the kernel keeps blocking submissions as we 
can just recreate the kernel context. But OrcaGL is also affected.


Does anybody know off-hand where the relevant parts of the Vulkan spec 
are? I didn't actually find anything in a quick search.



[snip]

For UMD behavior we still have something need to consider:

If MESA creates a new context from an old context (share list?? I’m 
not familiar with UMD , David Mao shall have some discuss on it with 
Nicolai), the new created context’s vram_lost_counter


And reset_counter shall all be ported from that old context , 
otherwise CS_SUBMIT will not block it which isn’t correct


The kernel doesn't have to do anything for this, it is entirely the 
UMD's responsibility. All UMD needs from KMD is the function for 
querying the vram_lost_counter.


Cheers,
Nicolai




Need your feedback, thx

*From:*amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] *On 
Behalf Of *Liu, Monk

*Sent:* 2017年10月11日13:34
*To:* Koenig, Christian ; Haehnle, Nicolai 
; Olsak, Marek ; 
Deucher, Alexander 
*Cc:* Ramirez, Alejandro ; 
amd-gfx@lists.freedesktop.org; Filipas, Mario ; 
Ding, Pixel ; Li, Bingley ; 
Jiang, Jerry (SW) 

*Subject:* TDR and VRAM lost handling in KMD:

Hi Christian & Nicolai,

We need to achieve some agreements on what should MESA/UMD do and what 
should KMD do, *please give your comments with **“okay”or “No”and your 
idea on below items,*


lWhen a job timed out (set from lockup_timeout kernel parameter), What 
KMD should do in TDR routine :


1.Update adev->*gpu_reset_counter*, and stop scheduler first, 
(*gpu_reset_counter* is used to force vm flush after GPU reset, out of 
this thread’s scope so no more discussion on it)


2.Set its fence error status to “*ETIME*”,

3.Find the entity/ctx behind this job, and 

Re: TDR and VRAM lost handling in KMD (v2)

2017-10-12 Thread Christian König

Yeah, that sounds like a plan to me.

Going to commit the patches I've already done in a minute, they still 
seem to implement quite a bit of what we want to do here.


Regards,
Christian.

Am 12.10.2017 um 10:03 schrieb Liu, Monk:


V2 summary

Hi team

*please give your comments*

lWhen a job timed out (set from lockup_timeout kernel parameter), What 
KMD should do in TDR routine :


1.Update adev->*gpu_reset_counter*, and stop scheduler first

2.Set its fence error status to “*ECANCELED*”,

3.Find the *context* behind this job, and set this *context* as 
“*guilty*”(will have a new member field in context structure – *bool 
guilty*)


a)There will be “*bool * guilty*” in entity structure, which points to 
its father context’s member – “*bool guilty” *when context 
initialized**, so no matter we get context or entity, we always know 
if it is “guilty”


b)For kernel entity that used for VM updates, there is no context back 
it, so kernel entity’s “bool *guilty” always “NULL”.


c)The idea to skip the whole context is for consistence consideration, 
because we’ll fake signal the hang job in job_run(), so all jobs in 
its context shall be dropped otherwise either bad drawing/computing 
results or more GPU hang.


**

4.Do GPU reset, which is can be some callbacks to let bare-metal and 
SR-IOV implement with their favor style


5.After reset, KMD need to aware if the VRAM lost happens or not, 
bare-metal can implement some function to judge, while for SR-IOV I 
prefer to read it from GIM side (for initial version we consider it’s 
always VRAM lost, till GIM side change aligned)


6.If VRAM lost hit, update adev->*vram_lost_counter*.

7.Do GTT recovery and shadow buffer recovery.

8.Re-schedule all JOBs in mirror list and restart scheduler

lFor GPU scheduler function --- job_run()

1.Before schedule a job to ring, checks if job->*vram_lost_counter* == 
adev->*vram_lost_counter*, and drop this job if mismatch


2.Before schedule a job to ring, checks if job->entity->*guilty* is 
NULL or not, *and drop this job if (guilty!=NULL && *guilty == TRUE)*


3.if a job is dropped:

a)set job’s sched_fence status to “*ECANCELED*”

b)fake/force signal job’s hw fence (no need to set hw fence’s status)

lFor cs_wait() IOCTL:

After it found fence signaled, it should check if there is error on 
this fence and return the error status of this fence


lFor cs_wait_fences() IOCTL:

Similar with above approach

lFor cs_submit() IOCTL:

1.check if current ctx been marked “*guilty*”and return “*ECANCELED*” 
 if so.


2.set job->*vram_lost_counter* with adev->*vram_lost_counter*, and 
return “*ECANCELED*” if ctx->*vram_lost_counter* != 
job->*vram_lost_counter* (Christian already submitted this patch)


a)discussion: can we return “ENODEV” if vram_lost_counter mismatch ? 
that way UMD know this context is under “device lost”


lIntroduce a new IOCTL to let UMD query latest adev->*vram_lost_counter*:

lFor amdgpu_ctx_query():

n*Don’t update ctx->reset_counter when querying this function, 
otherwise the query result is not consistent *


nSet out->state.reset_status to “AMDGPU_CTX_GUILTY_RESET” if the ctx 
is “*guilty*”, no need to check “ctx->reset_counter”


nSet out->state.reset_status to “AMDGPU_CTX_INNOCENT_RESET” *if the 
ctx isn’t “guilty” && ctx->reset_counter != adev->reset_counter *


nSet out->state.reset_status to “AMDGPU_CTX_NO_RESET” if 
ctx->reset_counter == adev->reset_counter


nSet out->state.flags to “AMDGPU_CTX_FLAG_VRAM_LOST” if 
ctx->vram_lost_counter != adev->vram_lost_counter


udiscussion: can we return “ENODEV” for amdgpu_ctx_query() if 
ctx->vram_lost_counter != adev->vram_lost_counter ? that way UMD know 
this context is under “device lost”


nUMD shall release this context if it is AMDGPU_CTX_GUILTY_RESET or 
its flags is “AMDGPU_CTX_FLAG_VRAM_LOST”


For UMD behavior we still have something need to consider:

If MESA creates a new context from an old context (share list?? I’m 
not familiar with UMD , David Mao shall have some discuss on it with 
Nicolai), the new created context’s vram_lost_counter


And reset_counter shall all be ported from that old context , 
otherwise CS_SUBMIT will not block it which isn’t correct


Need your feedback, thx

*From:*amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] *On 
Behalf Of *Liu, Monk

*Sent:* 2017年10月11日13:34
*To:* Koenig, Christian ; Haehnle, Nicolai 
; Olsak, Marek ; 
Deucher, Alexander 
*Cc:* Ramirez, Alejandro ; 
amd-gfx@lists.freedesktop.org; Filipas, Mario ; 
Ding, Pixel ; Li, Bingley ; 
Jiang, Jerry (SW) 

*Subject:* TDR and VRAM lost handling in KMD:

Hi Christian & Nicolai,

We need to achieve some agreements on what should MESA/UMD do and what 
should KMD do, *please give your comments with **“okay”or “No”and your 
idea on below items,*

Re: [PATCH] drm/amd/pp: add new sysfs pp_alloc_mem_for_smu

2017-10-12 Thread Christian König

Am 12.10.2017 um 05:48 schrieb Alex Deucher:

On Wed, Oct 11, 2017 at 7:28 AM, Rex Zhu  wrote:

Change-Id: Ie06f87445e7d6945472d88ac976693c98d96cd43
Signed-off-by: Rex Zhu 

Please add a better patch description.  Something like:
Add a sysfs interface to allocate a smu logging buffer on the fly.

Additionally, wouldn't this be better in debugfs rather than sysfs?
It's not really a user configuration option per se.  It's really only
there for debugging power stuff.  That said, I'm not sure how tricky
it would be to add it to the existing amdgpu drm debugfs files since
those are read only at the moment and that would be the most logical
place for it.  I guess calling debugfs_create_file and using
adev->ddev->primary->debugfs_root should work.


We also already have writeable debugfs files. Just take a look at how 
the register or VRAM accessors work.


Should be trivial to move over if you have sysfs already working.

Regards,
Christian.





---
  drivers/gpu/drm/amd/amdgpu/amdgpu.h|   3 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |   4 +
  drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c | 134 +
  3 files changed, 141 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index da48f97..9d2f095 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1636,6 +1636,9 @@ struct amdgpu_device {
 bool has_hw_reset;
 u8  reset_magic[AMDGPU_RESET_MAGIC_NUM];

+   struct amdgpu_bo*smu_prv_buffer;
+   u32 smu_prv_buffer_size;


Please put these in the amdgpu_pm or amd_powerplay structures instead
since they are power related.


+
 /* record last mm index being written through WREG32*/
 unsigned long last_mm_index;
 boolin_sriov_reset;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index fdfe418..c9e3019 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2430,6 +2430,10 @@ void amdgpu_device_fini(struct amdgpu_device *adev)
 release_firmware(adev->firmware.gpu_info_fw);
 adev->firmware.gpu_info_fw = NULL;
 }
+   if (adev->smu_prv_buffer) {
+   amdgpu_bo_free_kernel(>smu_prv_buffer, NULL, NULL);
+   adev->smu_prv_buffer = NULL;
+   }
 adev->accel_working = false;
 cancel_delayed_work_sync(>late_init_work);
 /* free i2c buses */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
index f3afa66..84e67fb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c
@@ -737,6 +737,129 @@ static ssize_t amdgpu_set_pp_compute_power_profile(struct 
device *dev,
 return amdgpu_set_pp_power_profile(dev, buf, count, );
  }

+static ssize_t amdgpu_get_pp_alloc_mem_for_smu(struct device *dev,
+   struct device_attribute *attr,
+   char *buf)
+{
+   struct drm_device *ddev = dev_get_drvdata(dev);
+   struct amdgpu_device *adev = ddev->dev_private;
+   struct sysinfo si;
+   bool is_os_64 = (sizeof(void *) == 8) ? true : false;
+   uint64_t total_memory;
+   uint64_t dram_size_seven_GB = 0x1B800;
+   uint64_t dram_size_three_GB = 0xB800;
+   u32 buf_len = 0;
+   u32 gart_size;
+
+   if (!is_os_64) {
+   adev->smu_prv_buffer_size = 0;
+   buf_len += snprintf(buf + buf_len, PAGE_SIZE,
+   "Feature only supported on 64-bit 
OS\n");
+   return buf_len;
+   }
+
+   si_meminfo();
+   total_memory = (uint64_t)si.totalram * si.mem_unit;
+
+   if (total_memory < dram_size_three_GB)  {
+   adev->smu_prv_buffer_size = 0;
+   buf_len += snprintf(buf + buf_len, PAGE_SIZE,
+   "System memory not enough, Feature not 
enabled\n");
+   return buf_len;
+   }
+
+   if (total_memory < dram_size_seven_GB)
+   adev->smu_prv_buffer_size = 512;
+   else
+   adev->smu_prv_buffer_size = 2048;
+
+   buf_len += snprintf(buf + buf_len, PAGE_SIZE,
+   "Max support: %d Mb",
+adev->smu_prv_buffer_size);
+
+   gart_size = adev->mc.gart_size >> 20;
+   if (amdgpu_gart_size == -1)
+   buf_len += snprintf(buf + buf_len, PAGE_SIZE,
+   "  (Need to set gartsize more than %d Mb)\n",
+adev->smu_prv_buffer_size + gart_size);
+   else
+   buf_len += snprintf(buf + buf_len, PAGE_SIZE,
+   "\n");
+

Why do we need all this logic?  just return the 

Re: [PATCH v2 2/2] drm/amdgpu: Add amdgpu_find_mm_node()

2017-10-12 Thread Christian König

Am 11.10.2017 um 21:49 schrieb Harish Kasiviswanathan:

v2: Use amdgpu_find_mm_node() in amdgpu_ttm_io_mem_pfn()

Change-Id: I12231e18bb60152843cd0e0213ddd0d0e04e7497
Signed-off-by: Harish Kasiviswanathan 


Reviewed-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 49 ++---
  1 file changed, 27 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index c5e1e35..99b1971 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -290,7 +290,24 @@ static uint64_t amdgpu_mm_node_addr(struct 
ttm_buffer_object *bo,
  }
  
  /**

- * amdgpu_ttm_copy_mem_to_mem - Helper function for copy
+ * amdgpu_find_mm_node - Helper function finds the drm_mm_node
+ *  corresponding to @offset. It also modifies the offset to be
+ *  within the drm_mm_node returned
+ */
+static struct drm_mm_node *amdgpu_find_mm_node(struct ttm_mem_reg *mem,
+  unsigned long *offset)
+{
+   struct drm_mm_node *mm_node = mem->mm_node;
+
+   while (*offset >= (mm_node->size << PAGE_SHIFT)) {
+   *offset -= (mm_node->size << PAGE_SHIFT);
+   ++mm_node;
+   }
+   return mm_node;
+}
+
+/**
+ * amdgpu_copy_ttm_mem_to_mem - Helper function for copy
   *
   * The function copies @size bytes from {src->mem + src->offset} to
   * {dst->mem + dst->offset}. src->bo and dst->bo could be same BO for a
@@ -319,21 +336,13 @@ int amdgpu_ttm_copy_mem_to_mem(struct amdgpu_device *adev,
return -EINVAL;
}
  
-	src_mm = src->mem->mm_node;

-   while (src->offset >= (src_mm->size << PAGE_SHIFT)) {
-   src->offset -= (src_mm->size << PAGE_SHIFT);
-   ++src_mm;
-   }
+   src_mm = amdgpu_find_mm_node(src->mem, >offset);
src_node_start = amdgpu_mm_node_addr(src->bo, src_mm, src->mem) +
 src->offset;
src_node_size = (src_mm->size << PAGE_SHIFT) - src->offset;
src_page_offset = src_node_start & (PAGE_SIZE - 1);
  
-	dst_mm = dst->mem->mm_node;

-   while (dst->offset >= (dst_mm->size << PAGE_SHIFT)) {
-   dst->offset -= (dst_mm->size << PAGE_SHIFT);
-   ++dst_mm;
-   }
+   dst_mm = amdgpu_find_mm_node(dst->mem, >offset);
dst_node_start = amdgpu_mm_node_addr(dst->bo, dst_mm, dst->mem) +
 dst->offset;
dst_node_size = (dst_mm->size << PAGE_SHIFT) - dst->offset;
@@ -653,13 +662,12 @@ static void amdgpu_ttm_io_mem_free(struct ttm_bo_device 
*bdev, struct ttm_mem_re
  static unsigned long amdgpu_ttm_io_mem_pfn(struct ttm_buffer_object *bo,
   unsigned long page_offset)
  {
-   struct drm_mm_node *mm = bo->mem.mm_node;
-   uint64_t size = mm->size;
-   uint64_t offset = page_offset;
+   struct drm_mm_node *mm;
+   unsigned long offset = (page_offset << PAGE_SHIFT);
  
-	page_offset = do_div(offset, size);

-   mm += offset;
-   return (bo->mem.bus.base >> PAGE_SHIFT) + mm->start + page_offset;
+   mm = amdgpu_find_mm_node(>mem, );
+   return (bo->mem.bus.base >> PAGE_SHIFT) + mm->start +
+   (offset >> PAGE_SHIFT);
  }
  
  /*

@@ -1216,7 +1224,7 @@ static int amdgpu_ttm_access_memory(struct 
ttm_buffer_object *bo,
  {
struct amdgpu_bo *abo = container_of(bo, struct amdgpu_bo, tbo);
struct amdgpu_device *adev = amdgpu_ttm_adev(abo->tbo.bdev);
-   struct drm_mm_node *nodes = abo->tbo.mem.mm_node;
+   struct drm_mm_node *nodes;
uint32_t value = 0;
int ret = 0;
uint64_t pos;
@@ -1225,10 +1233,7 @@ static int amdgpu_ttm_access_memory(struct 
ttm_buffer_object *bo,
if (bo->mem.mem_type != TTM_PL_VRAM)
return -EIO;
  
-	while (offset >= (nodes->size << PAGE_SHIFT)) {

-   offset -= nodes->size << PAGE_SHIFT;
-   ++nodes;
-   }
+   nodes = amdgpu_find_mm_node(>tbo.mem, );
pos = (nodes->start << PAGE_SHIFT) + offset;
  
  	while (len && pos < adev->mc.mc_vram_size) {



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH 5/5] drm/amd/sched: signal and free remaining fences in amd_sched_entity_fini

2017-10-12 Thread Christian König

Am 11.10.2017 um 18:30 schrieb Michel Dänzer:

On 28/09/17 04:55 PM, Nicolai Hähnle wrote:

From: Nicolai Hähnle 

Highly concurrent Piglit runs can trigger a race condition where a pending
SDMA job on a buffer object is never executed because the corresponding
process is killed (perhaps due to a crash). Since the job's fences were
never signaled, the buffer object was effectively leaked. Worse, the
buffer was stuck wherever it happened to be at the time, possibly in VRAM.

The symptom was user space processes stuck in interruptible waits with
kernel stacks like:

 [] dma_fence_default_wait+0x112/0x250
 [] dma_fence_wait_timeout+0x39/0xf0
 [] reservation_object_wait_timeout_rcu+0x1c2/0x300
 [] ttm_bo_cleanup_refs_and_unlock+0xff/0x1a0 [ttm]
 [] ttm_mem_evict_first+0xba/0x1a0 [ttm]
 [] ttm_bo_mem_space+0x341/0x4c0 [ttm]
 [] ttm_bo_validate+0xd4/0x150 [ttm]
 [] ttm_bo_init_reserved+0x2ed/0x420 [ttm]
 [] amdgpu_bo_create_restricted+0x1f3/0x470 [amdgpu]
 [] amdgpu_bo_create+0xda/0x220 [amdgpu]
 [] amdgpu_gem_object_create+0xaa/0x140 [amdgpu]
 [] amdgpu_gem_create_ioctl+0x97/0x120 [amdgpu]
 [] drm_ioctl+0x1fa/0x480 [drm]
 [] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]
 [] do_vfs_ioctl+0xa3/0x5f0
 [] SyS_ioctl+0x79/0x90
 [] entry_SYSCALL_64_fastpath+0x1e/0xad
 [] 0x

Signed-off-by: Nicolai Hähnle 
Acked-by: Christian König 

Since Christian's commit which introduced the problem (6af0883ed977
"drm/amdgpu: discard commands of killed processes") is in 4.14, we need
a solution for that. Should we backport Nicolai's five commits fixing
the problem, or revert 6af0883ed977?


While looking into this, I noticed that the following commits by
Christian in 4.14 each also cause hangs for me when running the piglit
gpu profile on Tonga:

457e0fee04b0 "drm/amdgpu: remove the GART copy hack"
1d00402b4da2 "drm/amdgpu: fix amdgpu_ttm_bind"

Are there fixes for these that can be backported to 4.14, or do they
need to be reverted there?

Well I'm not aware that any of those two can cause problems.

For "drm/amdgpu: remove the GART copy hack" I also don't have the 
slightest idea how that could be an issue. It just removes an unused 
code path.


Is amd-staging-drm-next stable for you?

Thanks,
Christian.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


TDR and VRAM lost handling in KMD (v2)

2017-10-12 Thread Liu, Monk
V2 summary

Hi team

please give your comments


l  When a job timed out (set from lockup_timeout kernel parameter), What KMD 
should do in TDR routine :


1.Update adev->gpu_reset_counter, and stop scheduler first

2.Set its fence error status to “ECANCELED”,

3.Find the context behind this job, and set this context as “guilty” 
(will have a new member field in context structure – bool guilty)

a)   There will be “bool * guilty” in entity structure, which points to its 
father context’s member – “bool guilty” when context initialized , so no matter 
we get context or entity, we always know if it is “guilty”

b)   For kernel entity that used for VM updates, there is no context back 
it, so kernel entity’s “bool *guilty” always “NULL”.

c)The idea to skip the whole context is for consistence consideration, 
because we’ll fake signal the hang job in job_run(), so all jobs in its context 
shall be dropped otherwise either bad drawing/computing results or more GPU 
hang.



4.Do GPU reset, which is can be some callbacks to let bare-metal and 
SR-IOV implement with their favor style

5.After reset, KMD need to aware if the VRAM lost happens or not, 
bare-metal can implement some function to judge, while for SR-IOV I prefer to 
read it from GIM side (for initial version we consider it’s always VRAM lost, 
till GIM side change aligned)

6.If VRAM lost hit, update adev->vram_lost_counter.

7.Do GTT recovery and shadow buffer recovery.

8.Re-schedule all JOBs in mirror list and restart scheduler


l  For GPU scheduler function --- job_run()

1.Before schedule a job to ring, checks if job->vram_lost_counter == 
adev->vram_lost_counter, and drop this job if mismatch

2.Before schedule a job to ring, checks if job->entity->guilty is NULL 
or not, and drop this job if (guilty!=NULL && *guilty == TRUE)

3.if a job is dropped:

a)   set job’s sched_fence status to “ECANCELED”

b)   fake/force signal job’s hw fence (no need to set hw fence’s status)


l  For cs_wait() IOCTL:
After it found fence signaled, it should check if there is error on this fence 
and return the error status of this fence


l  For cs_wait_fences() IOCTL:
Similar with above approach


l  For cs_submit() IOCTL:

1.check if current ctx been marked “guilty” and return “ECANCELED”  if 
so.

2.set job->vram_lost_counter with adev->vram_lost_counter, and return 
“ECANCELED” if ctx->vram_lost_counter != job->vram_lost_counter (Christian 
already submitted this patch)

a)   discussion: can we return “ENODEV” if vram_lost_counter mismatch ? 
that way UMD know this context is under “device lost”


l  Introduce a new IOCTL to let UMD query latest adev->vram_lost_counter:


l  For amdgpu_ctx_query():

n  Don’t update ctx->reset_counter when querying this function, otherwise the 
query result is not consistent

n  Set out->state.reset_status to “AMDGPU_CTX_GUILTY_RESET” if the ctx is 
“guilty”, no need to check “ctx->reset_counter”

n  Set out->state.reset_status to “AMDGPU_CTX_INNOCENT_RESET” if the ctx isn’t 
“guilty” && ctx->reset_counter != adev->reset_counter

n  Set out->state.reset_status to “AMDGPU_CTX_NO_RESET” if ctx->reset_counter 
== adev->reset_counter

n  Set out->state.flags to “AMDGPU_CTX_FLAG_VRAM_LOST” if 
ctx->vram_lost_counter != adev->vram_lost_counter

u  discussion: can we return “ENODEV” for amdgpu_ctx_query() if 
ctx->vram_lost_counter != adev->vram_lost_counter ? that way UMD know this 
context is under “device lost”

n  UMD shall release this context if it is AMDGPU_CTX_GUILTY_RESET or its flags 
is “AMDGPU_CTX_FLAG_VRAM_LOST”


For UMD behavior we still have something need to consider:
If MESA creates a new context from an old context (share list?? I’m not 
familiar with UMD , David Mao shall have some discuss on it with Nicolai), the 
new created context’s vram_lost_counter
And reset_counter shall all be ported from that old context , otherwise 
CS_SUBMIT will not block it which isn’t correct



Need your feedback, thx


From: amd-gfx [mailto:amd-gfx-boun...@lists.freedesktop.org] On Behalf Of Liu, 
Monk
Sent: 2017年10月11日 13:34
To: Koenig, Christian ; Haehnle, Nicolai 
; Olsak, Marek ; Deucher, 
Alexander 
Cc: Ramirez, Alejandro ; 
amd-gfx@lists.freedesktop.org; Filipas, Mario ; Ding, 
Pixel ; Li, Bingley ; Jiang, Jerry (SW) 

Subject: TDR and VRAM lost handling in KMD:

Hi Christian & Nicolai,

We need to achieve some agreements on what should MESA/UMD do and what should 
KMD do, please give your comments with “okay” or “No” and your idea on below 
items,


l  When a job timed out (set from lockup_timeout kernel parameter), What KMD 
should do in TDR routine :


1.Update 

Re: [PATCH] drm/amd/display: Add comment for NaN checks in DCN calcs

2017-10-12 Thread Nils Wallménius
Hi Harry, couldn't a simple isnan() macro be used? Also a question below.

Den 11 okt. 2017 17:01 skrev "Harry Wentland" :

This is confusing as-is and really needs a comment.

Signed-off-by: Harry Wentland 
---
 drivers/gpu/drm/amd/display/dc/calcs/dcn_calc_math.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/display/dc/calcs/dcn_calc_math.c
b/drivers/gpu/drm/amd/display/dc/calcs/dcn_calc_math.c
index b6abe0f3bb15..f95dc4ff9a23 100644
--- a/drivers/gpu/drm/amd/display/dc/calcs/dcn_calc_math.c
+++ b/drivers/gpu/drm/amd/display/dc/calcs/dcn_calc_math.c
@@ -27,6 +27,7 @@

 float dcn_bw_mod(const float arg1, const float arg2)
 {
+   /* NaN checks */
if (arg1 != arg1)
return arg2;
if (arg2 != arg2)
@@ -36,6 +37,7 @@ float dcn_bw_mod(const float arg1, const float arg2)

 float dcn_bw_min2(const float arg1, const float arg2)
 {
+   /* NaN checks */
if (arg1 != arg1)
return arg2;
if (arg2 != arg2)
@@ -45,6 +47,7 @@ float dcn_bw_min2(const float arg1, const float arg2)

 unsigned int dcn_bw_max(const unsigned int arg1, const unsigned int arg2)
 {
+   /* NaN checks */


How can the int arguments be NaN?

BR
Nils

if (arg1 != arg1)
return arg2;
if (arg2 != arg2)
@@ -53,6 +56,7 @@ unsigned int dcn_bw_max(const unsigned int arg1, const
unsigned int arg2)
 }
 float dcn_bw_max2(const float arg1, const float arg2)
 {
+   /* NaN checks */
if (arg1 != arg1)
return arg2;
if (arg2 != arg2)
--
2.14.1

___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] drm/amdgpu: Fix extra call to amdgpu_ctx_put.

2017-10-12 Thread Christian König

Am 11.10.2017 um 23:13 schrieb Andrey Grodzovsky:

In amdgpu_cs_parser_init() in case of error handling
amdgpu_ctx_put() is called without setting p->ctx to NULL after that,
later amdgpu_cs_parser_fini() also calls amdgpu_ctx_put() again and
mess up the reference count.

Signed-off-by: Andrey Grodzovsky 


Reviewed-by: Christian König 


---
  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 6 ++
  1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 5de092e..8513e44 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -97,7 +97,7 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser *p, 
void *data)
if (copy_from_user(chunk_array, chunk_array_user,
   sizeof(uint64_t)*cs->in.num_chunks)) {
ret = -EFAULT;
-   goto put_ctx;
+   goto free_chunk;
}
  
  	p->nchunks = cs->in.num_chunks;

@@ -105,7 +105,7 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser 
*p, void *data)
GFP_KERNEL);
if (!p->chunks) {
ret = -ENOMEM;
-   goto put_ctx;
+   goto free_chunk;
}
  
  	for (i = 0; i < p->nchunks; i++) {

@@ -185,8 +185,6 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser 
*p, void *data)
kfree(p->chunks);
p->chunks = NULL;
p->nchunks = 0;
-put_ctx:
-   amdgpu_ctx_put(p->ctx);
  free_chunk:
kfree(chunk_array);
  



___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: [PATCH] amdgpu/dc: Use DRM new-style object iterators.

2017-10-12 Thread Maarten Lankhorst
Op 11-10-17 om 22:40 schreef Harry Wentland:
> On 2017-10-11 03:46 PM, Maarten Lankhorst wrote:
>> Op 11-10-17 om 20:55 schreef Leo:
>>>
>>> On 2017-10-11 10:30 AM, Maarten Lankhorst wrote:
 Op 11-10-17 om 16:24 schreef sunpeng...@amd.com:
> From: "Leo (Sunpeng) Li" 
>
> Use the correct for_each_new/old_* iterators instead of for_each_*
>
> List of affected functions:
>
> amdgpu_dm_find_first_crtc_matching_connector: use for_each_new
>  - Old from_state_var flag was always choosing the new state
>
> amdgpu_dm_display_resume: use for_each_new
>  - drm_atomic_helper_duplicate_state is called during suspend to
>cache the state
>  - It sets 'state' within the state triplet to 'new_state'
>
> amdgpu_dm_commit_planes: use for_each_old
>  - Called after the state was swapped (via atomic commit tail)
>
> amdgpu_dm_atomic_commit: use for_each_new
>  - Called before the state is swapped
>
> amdgpu_dm_atomic_commit_tail: use for_each_old
>  - Called after the state was swapped
>
> dm_update_crtcs_state: use for_each_new
>  - Called before the state is swapped (via atomic check)
>
> amdgpu_dm_atomic_check: use for_each_new
>  - Called before the state is swapped
>
> Also fix some typos.
>
> crct -> crtc
> undersacn -> underscan
>
> Signed-off-by: Leo (Sunpeng) Li 
> ---
>
> Hi Dave,
>
> This patch goes on top of your fixup for new api's patch. Please feel
> free to squash them.
>
> Thanks,
> Leo
>
>   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 38 
> +--
>   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h |  5 ++-
>   2 files changed, 16 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c 
> b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> index 9bfe1f9..9394558 100644
> --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c
> @@ -568,25 +568,17 @@ static int dm_suspend(void *handle)
>   return ret;
>   }
>   -struct amdgpu_dm_connector 
> *amdgpu_dm_find_first_crct_matching_connector(
> +struct amdgpu_dm_connector *amdgpu_dm_find_first_crtc_matching_connector(
>   struct drm_atomic_state *state,
> -struct drm_crtc *crtc,
> -bool from_state_var)
> +struct drm_crtc *crtc)
>   {
>   uint32_t i;
>   struct drm_connector_state *conn_state;
>   struct drm_connector *connector;
>   struct drm_crtc *crtc_from_state;
>   -for_each_new_connector_in_state(
> -state,
> -connector,
> -conn_state,
> -i) {
> -crtc_from_state =
> -from_state_var ?
> -conn_state->crtc :
> -connector->state->crtc;
> +for_each_new_connector_in_state(state, connector, conn_state, i) {
> +crtc_from_state = conn_state->crtc;
 Please assign crtc_from_state here with 
 drm_atomic_get_(new/old)_crtc_state.
>>> We're trying to fetch a drm_crtc from a drm_connector_state, I don't think 
>>> the state getters are applicable here.
>>>
>>> Also, the function should really be named 
>>> 'find_first_connector_matching_crtc'. I'll fix that.
>> Oh I misunderstood. But in general derefencing $obj->state should be frowned 
>> upon. If you ever want to support atomic flip depth > 1,
>> all those references should be gone from your driver, and replaced with 
>> get_old/new_state..
> If I understand correctly this is the forward-looking way of how we get the
> state now? I still see plenty of $obj->state in i915 and other drivers.
>
> Any objections to doing this as a follow-up patch?
>
> What's atomic flip depth > 1?

That we allow multiple uncompleted nonblocking atomic commits to the same crtc,
which requires that during commit, $obj->state is not the same as
new_$obj_state any more. It can be set to an even newer state, but the current
commit has to set the state that is in state->$obj[i].new_state. This can be
obtained with either the iterators, or drm_atomic_get_new_$obj_state.

This is why we we got rid of the old iterators, get_existing_state and 
$obj->state
were no longer sufficient for this.

Using drm_atomic_get_old/new_obj_state should be a separate patch, but can be 
fixed
opportunistically.

But something like

for_each_old_crtc_in_state(...) {
new_crtc_state = crtc->state


}

should definitely be fixed in this patch. It's what the iterators are for. :)

I know i915 is still dereferencing $obj->state, but we're trying to slowly fix 
these
when we come across them. This usually involves passing the correct state 
object up
the callchain, which can be quite