Re: [PATCH] drm/radeon: disable any GPU activity after unrecovered lockup v5
On Mit, 2012-06-27 at 14:14 -0400, j.gli...@gmail.com wrote: From: Jerome Glisse jgli...@redhat.com After unrecovered GPU lockup avoid any GPU activities to avoid things like kernel segfault and alike to happen in any of the path that assume hw is working. The segfault is due to PCIE vram gart table being unmapped after suspend in the GPU reset path. To avoid segault to happen and to avoid further GPU activity if unsuccessful at reseting GPU we use the accel_working boolean to transform ttm activities into noop. It does not impact the module load path because in that path ttm have an empty schedule queue and accel_working will be set to true as soon as the gart table is in valid state. Because ttm might have work queued it is better to use the accel working then disabling radeon_bo ioctl. To trigger the segfault launch a program that repeatly create bo in ttm and let it run in background, then trigger gpu lockup from another process. This patch also for video mode restoring on r1xx,r2xx,r3xx,r4xx, r5xx,rs4xx,rs6xx GPU even if GPU reset fail. When GPU reset fails it is very likely (so far i never had it not working) that the modesetting part of the GPU is still alive. So we can have a chance to get kernel backtrace or other debugging informations on the screen if we always restore the video mode. v2: fix spelling error and disable accel before suspend and reenable it after pcie gart initialization to be even more cautious about possible segfault. Improve commit message v3: Improve commit message to describe the video mode restoring no matter what. v4: Avoid issue after successfull GPU lockup recovery. Don't do noop ttm move, instead report error if move needs bind or unbind or fallback to memcpy. Don't restrict new bo domain instead refuse to create new bo if gpu reset failed. Disable accel working in gart vram table unpin thus we don't change the behavior of the suspend path. v5: Avoid set domain to also trigger noop bind/unbind, instead force it to wait for GPU reset to go through or return failure if gpu reset fails. cc: sta...@vger.kernel.org Signed-off-by: Jerome Glisse jgli...@redhat.com [...] + /* try memcpy */ + goto memcpy; This comment is redundant. :) Either way though, Reviewed-by: Michel Dänzer michel.daen...@amd.com -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Debian, X and DRI developer ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] drm/radeon: disable any GPU activity after unrecovered lockup
On Die, 2012-06-26 at 17:04 -0400, j.gli...@gmail.com wrote: From: Jerome Glisse jgli...@redhat.com After unrecovered GPU lockup avoid any GPU activities to avoid things like kernel segfault and alike to happen in any of the path that assume hw is working. Has the patch been tested and confirmed to actually fix such a problem? r = radeon_asic_reset(rdev); if (!r) { dev_info(rdev-dev, GPU reset succeed\n); radeon_resume(rdev); - radeon_restore_bios_scratch_regs(rdev); - drm_helper_resume_force_mode(rdev-ddev); - ttm_bo_unlock_delayed_workqueue(rdev-mman.bdev, resched); } + /* no matter what restore video mode */ + radeon_restore_bios_scratch_regs(rdev); + drm_helper_resume_force_mode(rdev-ddev); + ttm_bo_unlock_delayed_workqueue(rdev-mman.bdev, resched); Maybe this should be in a separate patch. @@ -399,6 +418,14 @@ static int radeon_bo_move(struct ttm_buffer_object *bo, radeon_move_null(bo, new_mem); return 0; } + if (!rdev-accel_working) { + /* when accel is not working GPU is in broken state just + * do nothing for any ttm operation to avoid making the + * situation worst than it's 'worse than it is', same in the following two hunks. -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Debian, X and DRI developer ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] drm/radeon: disable any GPU activity after unrecovered lockup
On Wed, Jun 27, 2012 at 5:19 AM, Michel Dänzer mic...@daenzer.net wrote: On Die, 2012-06-26 at 17:04 -0400, j.gli...@gmail.com wrote: From: Jerome Glisse jgli...@redhat.com After unrecovered GPU lockup avoid any GPU activities to avoid things like kernel segfault and alike to happen in any of the path that assume hw is working. Has the patch been tested and confirmed to actually fix such a problem? Yes it has been tested i dont send untested patch to ml. r = radeon_asic_reset(rdev); if (!r) { dev_info(rdev-dev, GPU reset succeed\n); radeon_resume(rdev); - radeon_restore_bios_scratch_regs(rdev); - drm_helper_resume_force_mode(rdev-ddev); - ttm_bo_unlock_delayed_workqueue(rdev-mman.bdev, resched); } + /* no matter what restore video mode */ + radeon_restore_bios_scratch_regs(rdev); + drm_helper_resume_force_mode(rdev-ddev); + ttm_bo_unlock_delayed_workqueue(rdev-mman.bdev, resched); Maybe this should be in a separate patch. Idea is to send this patch to stable thus having one patch that have it all. @@ -399,6 +418,14 @@ static int radeon_bo_move(struct ttm_buffer_object *bo, radeon_move_null(bo, new_mem); return 0; } + if (!rdev-accel_working) { + /* when accel is not working GPU is in broken state just + * do nothing for any ttm operation to avoid making the + * situation worst than it's 'worse than it is', same in the following two hunks. Cheers, Jerome ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH] drm/radeon: disable any GPU activity after unrecovered lockup
On Mit, 2012-06-27 at 10:49 -0400, Jerome Glisse wrote: On Wed, Jun 27, 2012 at 5:19 AM, Michel Dänzer mic...@daenzer.net wrote: On Die, 2012-06-26 at 17:04 -0400, j.gli...@gmail.com wrote: From: Jerome Glisse jgli...@redhat.com After unrecovered GPU lockup avoid any GPU activities to avoid things like kernel segfault and alike to happen in any of the path that assume hw is working. Has the patch been tested and confirmed to actually fix such a problem? Yes it has been tested i dont send untested patch to ml. I didn't expect (or mean to suggest) otherwise. I think I misread the related IRC conversation from last night: I thought you basically whipped up this patch in response to a report of such problems. But on re-reading now, I guess you wrote this patch a while ago and are just sending it now in response to the report on IRC. r = radeon_asic_reset(rdev); if (!r) { dev_info(rdev-dev, GPU reset succeed\n); radeon_resume(rdev); - radeon_restore_bios_scratch_regs(rdev); - drm_helper_resume_force_mode(rdev-ddev); - ttm_bo_unlock_delayed_workqueue(rdev-mman.bdev, resched); } + /* no matter what restore video mode */ + radeon_restore_bios_scratch_regs(rdev); + drm_helper_resume_force_mode(rdev-ddev); + ttm_bo_unlock_delayed_workqueue(rdev-mman.bdev, resched); Maybe this should be in a separate patch. Idea is to send this patch to stable thus having one patch that have it all. That doesn't make sense. Either the changes belong into a single patch (but then the commit log should describe all of them) or not. They can be sent to stable[0] either way. [0] Actually, patches with Cc: stable are picked up automagically once they hit mainline, there's no point in sending them there directly. @@ -399,6 +418,14 @@ static int radeon_bo_move(struct ttm_buffer_object *bo, radeon_move_null(bo, new_mem); return 0; } + if (!rdev-accel_working) { + /* when accel is not working GPU is in broken state just + * do nothing for any ttm operation to avoid making the + * situation worst than it's 'worse than it is', same in the following two hunks. Are you gonna fix these typos? -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Debian, X and DRI developer ___ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel