Re: [PATCH] drm/radeon: disable any GPU activity after unrecovered lockup v5

2012-06-28 Thread Michel Dänzer
On Mit, 2012-06-27 at 14:14 -0400, j.gli...@gmail.com wrote: 
 From: Jerome Glisse jgli...@redhat.com
 
 After unrecovered GPU lockup avoid any GPU activities to avoid
 things like kernel segfault and alike to happen in any of the
 path that assume hw is working.
 
 The segfault is due to PCIE vram gart table being unmapped after
 suspend in the GPU reset path. To avoid segault to happen and to
 avoid further GPU activity if unsuccessful at reseting GPU we
 use the accel_working boolean to transform ttm activities into
 noop. It does not impact the module load path because in that
 path ttm have an empty schedule queue and accel_working will be
 set to true as soon as the gart table is in valid state. Because
 ttm might have work queued it is better to use the accel working
 then disabling radeon_bo ioctl.
 
 To trigger the segfault launch a program that repeatly create bo
 in ttm and let it run in background, then trigger gpu lockup from
 another process.
 
 This patch also for video mode restoring on r1xx,r2xx,r3xx,r4xx,
 r5xx,rs4xx,rs6xx GPU even if GPU reset fail. When GPU reset fails
 it is very likely (so far i never had it not working) that the
 modesetting part of the GPU is still alive. So we can have a
 chance to get kernel backtrace or other debugging informations
 on the screen if we always restore the video mode.
 
 v2: fix spelling error and disable accel before suspend and reenable
 it after pcie gart initialization to be even more cautious about
 possible segfault. Improve commit message
 v3: Improve commit message to describe the video mode restoring no
 matter what.
 v4: Avoid issue after successfull GPU lockup recovery. Don't do noop
 ttm move, instead report error if move needs bind or unbind or
 fallback to memcpy. Don't restrict new bo domain instead refuse
 to create new bo if gpu reset failed. Disable accel working
 in gart vram table unpin thus we don't change the behavior of
 the suspend path.
 v5: Avoid set domain to also trigger noop bind/unbind, instead force
 it to wait for GPU reset to go through or return failure if
 gpu reset fails.
 
 cc: sta...@vger.kernel.org
 Signed-off-by: Jerome Glisse jgli...@redhat.com

[...]

 + /* try memcpy */
 + goto memcpy;

This comment is redundant. :)

Either way though,

Reviewed-by: Michel Dänzer michel.daen...@amd.com


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast |  Debian, X and DRI developer
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] drm/radeon: disable any GPU activity after unrecovered lockup

2012-06-27 Thread Michel Dänzer
On Die, 2012-06-26 at 17:04 -0400, j.gli...@gmail.com wrote:
 From: Jerome Glisse jgli...@redhat.com
 
 After unrecovered GPU lockup avoid any GPU activities to avoid
 things like kernel segfault and alike to happen in any of the
 path that assume hw is working.

Has the patch been tested and confirmed to actually fix such a problem?


   r = radeon_asic_reset(rdev);
   if (!r) {
   dev_info(rdev-dev, GPU reset succeed\n);
   radeon_resume(rdev);
 - radeon_restore_bios_scratch_regs(rdev);
 - drm_helper_resume_force_mode(rdev-ddev);
 - ttm_bo_unlock_delayed_workqueue(rdev-mman.bdev, resched);
   }
  
 + /* no matter what restore video mode */
 + radeon_restore_bios_scratch_regs(rdev);
 + drm_helper_resume_force_mode(rdev-ddev);
 + ttm_bo_unlock_delayed_workqueue(rdev-mman.bdev, resched);

Maybe this should be in a separate patch.


 @@ -399,6 +418,14 @@ static int radeon_bo_move(struct ttm_buffer_object *bo,
   radeon_move_null(bo, new_mem);
   return 0;
   }
 + if (!rdev-accel_working) {
 + /* when accel is not working GPU is in broken state just
 +  * do nothing for any ttm operation to avoid making the
 +  * situation worst than it's

'worse than it is', same in the following two hunks.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast |  Debian, X and DRI developer
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] drm/radeon: disable any GPU activity after unrecovered lockup

2012-06-27 Thread Jerome Glisse
On Wed, Jun 27, 2012 at 5:19 AM, Michel Dänzer mic...@daenzer.net wrote:
 On Die, 2012-06-26 at 17:04 -0400, j.gli...@gmail.com wrote:
 From: Jerome Glisse jgli...@redhat.com

 After unrecovered GPU lockup avoid any GPU activities to avoid
 things like kernel segfault and alike to happen in any of the
 path that assume hw is working.

 Has the patch been tested and confirmed to actually fix such a problem?

Yes it has been tested i dont send untested patch to ml.



       r = radeon_asic_reset(rdev);
       if (!r) {
               dev_info(rdev-dev, GPU reset succeed\n);
               radeon_resume(rdev);
 -             radeon_restore_bios_scratch_regs(rdev);
 -             drm_helper_resume_force_mode(rdev-ddev);
 -             ttm_bo_unlock_delayed_workqueue(rdev-mman.bdev, resched);
       }

 +     /* no matter what restore video mode */
 +     radeon_restore_bios_scratch_regs(rdev);
 +     drm_helper_resume_force_mode(rdev-ddev);
 +     ttm_bo_unlock_delayed_workqueue(rdev-mman.bdev, resched);

 Maybe this should be in a separate patch.

Idea is to send this patch to stable thus having one patch that have it all.



 @@ -399,6 +418,14 @@ static int radeon_bo_move(struct ttm_buffer_object *bo,
               radeon_move_null(bo, new_mem);
               return 0;
       }
 +     if (!rdev-accel_working) {
 +             /* when accel is not working GPU is in broken state just
 +              * do nothing for any ttm operation to avoid making the
 +              * situation worst than it's

 'worse than it is', same in the following two hunks.


Cheers,
Jerome
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] drm/radeon: disable any GPU activity after unrecovered lockup

2012-06-27 Thread Michel Dänzer
On Mit, 2012-06-27 at 10:49 -0400, Jerome Glisse wrote: 
 On Wed, Jun 27, 2012 at 5:19 AM, Michel Dänzer mic...@daenzer.net wrote:
  On Die, 2012-06-26 at 17:04 -0400, j.gli...@gmail.com wrote:
  From: Jerome Glisse jgli...@redhat.com
 
  After unrecovered GPU lockup avoid any GPU activities to avoid
  things like kernel segfault and alike to happen in any of the
  path that assume hw is working.
 
  Has the patch been tested and confirmed to actually fix such a problem?
 
 Yes it has been tested i dont send untested patch to ml.

I didn't expect (or mean to suggest) otherwise. I think I misread the
related IRC conversation from last night: I thought you basically
whipped up this patch in response to a report of such problems. But on
re-reading now, I guess you wrote this patch a while ago and are just
sending it now in response to the report on IRC.


r = radeon_asic_reset(rdev);
if (!r) {
dev_info(rdev-dev, GPU reset succeed\n);
radeon_resume(rdev);
  - radeon_restore_bios_scratch_regs(rdev);
  - drm_helper_resume_force_mode(rdev-ddev);
  - ttm_bo_unlock_delayed_workqueue(rdev-mman.bdev, resched);
}
 
  + /* no matter what restore video mode */
  + radeon_restore_bios_scratch_regs(rdev);
  + drm_helper_resume_force_mode(rdev-ddev);
  + ttm_bo_unlock_delayed_workqueue(rdev-mman.bdev, resched);
 
  Maybe this should be in a separate patch.
 
 Idea is to send this patch to stable thus having one patch that have it all.

That doesn't make sense. Either the changes belong into a single patch
(but then the commit log should describe all of them) or not. They can
be sent to stable[0] either way.

[0] Actually, patches with Cc: stable are picked up automagically once
they hit mainline, there's no point in sending them there directly.


  @@ -399,6 +418,14 @@ static int radeon_bo_move(struct ttm_buffer_object 
  *bo,
radeon_move_null(bo, new_mem);
return 0;
}
  + if (!rdev-accel_working) {
  + /* when accel is not working GPU is in broken state just
  +  * do nothing for any ttm operation to avoid making the
  +  * situation worst than it's
 
  'worse than it is', same in the following two hunks.

Are you gonna fix these typos?


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast |  Debian, X and DRI developer
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel