Re: [PATCH] drm/radeon/kms: fence cleanup + more reliable GPU lockup detection
On Mon, Mar 01, 2010 at 04:00:02PM +1000, Dave Airlie wrote: > On Mon, Mar 1, 2010 at 2:47 AM, Jerome Glisse wrote: > > On Sun, Feb 28, 2010 at 12:22:52PM +, Alan Swanson wrote: > >> On Fri, 2010-02-26 at 15:49 +0100, Jerome Glisse wrote: > >> > This patch cleanup the fence code, it drops the timeout field of > >> > fence as the time to complete each IB is unpredictable and shouldn't > >> > be bound. > >> > > >> > The fence cleanup lead to GPU lockup detection improvement, this > >> > patch introduce a callback, allowing to do asic specific test for > >> > lockup detection. In this patch the CP is use as a first indicator > >> > of GPU lockup. If CP doesn't make progress during 1second we assume > >> > we are facing a GPU lockup. > >> > > >> > To avoid overhead of testing GPU lockup frequently due to fence > >> > taking time to be signaled we query the lockup callback every > >> > 100msec. There is plenty code comment explaining the design & choise > >> > inside the code. > >> > >> Every 100msec? Is this running all the time? If so, that's not very good > >> for CPU power saving to lower C-states in an idle system. We could at > >> least use one of the round_jiffies. > >> > > > > This run only when userspace call bo wait thus it only happen when userspace > > is waiting for something. > > Why not just test when the old timeout code used to test? every second or so? > > I'm not sure why with the old code instead of assuming a fence timeout implied > a lockup you didn't just change it to test if it was a real lockup and > continue waiting > if the GPU was making progress. This seems simpler, though maybe the cleanups > are worth it. > > Dave. > The old code was misleading we didn't test every second or so but it was depending on fence timeout and it was quite small amount of time don't remember of hand. Here on the fast r7xx with 2 quake3 bunch of gears and compiz the average time for fence is 20ms. But i think i will switch back to half a second timeout some irq will likely wake up us. Note that if you remove the comment lines in my patch i am pretty sure my patch is removing code rather than adding some :) I will do a V3 today. Cheers, Jerome -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [PATCH] drm/radeon/kms: fence cleanup + more reliable GPU lockup detection
On Mon, Mar 1, 2010 at 2:47 AM, Jerome Glisse wrote: > On Sun, Feb 28, 2010 at 12:22:52PM +, Alan Swanson wrote: >> On Fri, 2010-02-26 at 15:49 +0100, Jerome Glisse wrote: >> > This patch cleanup the fence code, it drops the timeout field of >> > fence as the time to complete each IB is unpredictable and shouldn't >> > be bound. >> > >> > The fence cleanup lead to GPU lockup detection improvement, this >> > patch introduce a callback, allowing to do asic specific test for >> > lockup detection. In this patch the CP is use as a first indicator >> > of GPU lockup. If CP doesn't make progress during 1second we assume >> > we are facing a GPU lockup. >> > >> > To avoid overhead of testing GPU lockup frequently due to fence >> > taking time to be signaled we query the lockup callback every >> > 100msec. There is plenty code comment explaining the design & choise >> > inside the code. >> >> Every 100msec? Is this running all the time? If so, that's not very good >> for CPU power saving to lower C-states in an idle system. We could at >> least use one of the round_jiffies. >> > > This run only when userspace call bo wait thus it only happen when userspace > is waiting for something. Why not just test when the old timeout code used to test? every second or so? I'm not sure why with the old code instead of assuming a fence timeout implied a lockup you didn't just change it to test if it was a real lockup and continue waiting if the GPU was making progress. This seems simpler, though maybe the cleanups are worth it. Dave. -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [PATCH] drm/radeon/kms: fence cleanup + more reliable GPU lockup detection
On Sun, Feb 28, 2010 at 12:22:52PM +, Alan Swanson wrote: > On Fri, 2010-02-26 at 15:49 +0100, Jerome Glisse wrote: > > This patch cleanup the fence code, it drops the timeout field of > > fence as the time to complete each IB is unpredictable and shouldn't > > be bound. > > > > The fence cleanup lead to GPU lockup detection improvement, this > > patch introduce a callback, allowing to do asic specific test for > > lockup detection. In this patch the CP is use as a first indicator > > of GPU lockup. If CP doesn't make progress during 1second we assume > > we are facing a GPU lockup. > > > > To avoid overhead of testing GPU lockup frequently due to fence > > taking time to be signaled we query the lockup callback every > > 100msec. There is plenty code comment explaining the design & choise > > inside the code. > > Every 100msec? Is this running all the time? If so, that's not very good > for CPU power saving to lower C-states in an idle system. We could at > least use one of the round_jiffies. > This run only when userspace call bo wait thus it only happen when userspace is waiting for something. Cheers, Jerome -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [PATCH] drm/radeon/kms: fence cleanup + more reliable GPU lockup detection
On Fri, 2010-02-26 at 15:49 +0100, Jerome Glisse wrote: > This patch cleanup the fence code, it drops the timeout field of > fence as the time to complete each IB is unpredictable and shouldn't > be bound. > > The fence cleanup lead to GPU lockup detection improvement, this > patch introduce a callback, allowing to do asic specific test for > lockup detection. In this patch the CP is use as a first indicator > of GPU lockup. If CP doesn't make progress during 1second we assume > we are facing a GPU lockup. > > To avoid overhead of testing GPU lockup frequently due to fence > taking time to be signaled we query the lockup callback every > 100msec. There is plenty code comment explaining the design & choise > inside the code. Every 100msec? Is this running all the time? If so, that's not very good for CPU power saving to lower C-states in an idle system. We could at least use one of the round_jiffies. > This have been tested mostly on R3XX/R5XX hw, in normal running > destkop (compiz firefox, quake3 running) the lockup callback wasn't > call once (1 hour session). Also tested with forcing GPU lockup and > lockup was reported after the 1s CP activity timeout. > > Signed-off-by: Jerome Glisse -- Alan. "One must never be purposelessnessnesslessness." -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
Re: [PATCH] drm/radeon/kms: fence cleanup + more reliable GPU lockup detection
2010/2/26 Jerome Glisse : > The fence cleanup lead to GPU lockup detection improvement, this > patch introduce a callback, allowing to do asic specific test for > lockup detection. In this patch the CP is use as a first indicator > of GPU lockup. If CP doesn't make progress during 1second we assume > we are facing a GPU lockup. > > To avoid overhead of testing GPU lockup frequently due to fence > taking time to be signaled we query the lockup callback every > 100msec. There is plenty code comment explaining the design & choise > inside the code. Let me check this with my reproducible lock up :) Can I test this on top of any branch (like drm-next or drm-radeon-testing)? Or does it depend on some specific patches? -- Rafał -- Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev -- ___ Dri-devel mailing list Dri-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dri-devel
[PATCH] drm/radeon/kms: fence cleanup + more reliable GPU lockup detection
This patch cleanup the fence code, it drops the timeout field of fence as the time to complete each IB is unpredictable and shouldn't be bound. The fence cleanup lead to GPU lockup detection improvement, this patch introduce a callback, allowing to do asic specific test for lockup detection. In this patch the CP is use as a first indicator of GPU lockup. If CP doesn't make progress during 1second we assume we are facing a GPU lockup. To avoid overhead of testing GPU lockup frequently due to fence taking time to be signaled we query the lockup callback every 100msec. There is plenty code comment explaining the design & choise inside the code. This have been tested mostly on R3XX/R5XX hw, in normal running destkop (compiz firefox, quake3 running) the lockup callback wasn't call once (1 hour session). Also tested with forcing GPU lockup and lockup was reported after the 1s CP activity timeout. Signed-off-by: Jerome Glisse --- drivers/gpu/drm/radeon/evergreen.c|6 ++ drivers/gpu/drm/radeon/r100.c | 84 +++ drivers/gpu/drm/radeon/r300.c | 27 - drivers/gpu/drm/radeon/r600.c | 33 +- drivers/gpu/drm/radeon/radeon.h | 101 ++-- drivers/gpu/drm/radeon/radeon_asic.h | 20 ++- drivers/gpu/drm/radeon/radeon_fence.c | 87 ++-- drivers/gpu/drm/radeon/rv770.c|6 -- 8 files changed, 262 insertions(+), 102 deletions(-) diff --git a/drivers/gpu/drm/radeon/evergreen.c b/drivers/gpu/drm/radeon/evergreen.c index b8cd119..11688e2 100644 --- a/drivers/gpu/drm/radeon/evergreen.c +++ b/drivers/gpu/drm/radeon/evergreen.c @@ -485,6 +485,12 @@ int evergreen_mc_init(struct radeon_device *rdev) return 0; } +bool evergreen_gpu_is_lockup(struct radeon_device *rdev) +{ + /* FIXME: implement for evergreen */ + return false; +} + int evergreen_gpu_reset(struct radeon_device *rdev) { /* FIXME: implement for evergreen */ diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c index 932ce24..a77e754 100644 --- a/drivers/gpu/drm/radeon/r100.c +++ b/drivers/gpu/drm/radeon/r100.c @@ -1780,6 +1780,90 @@ int r100_rb2d_reset(struct radeon_device *rdev) return -1; } +void r100_gpu_lockup_update(struct r100_gpu_lockup *lockup, struct radeon_cp *cp) +{ + lockup->last_cp_rptr = cp->rptr; + lockup->last_jiffies = jiffies; +} + +/** + * r100_gpu_cp_is_lockup() - check if CP is lockup by recording information + * @rdev: radeon device structure + * @lockup:r100_gpu_lockup structure holding CP lockup tracking informations + * @cp:radeon_cp structure holding CP information + * + * We don't need to initialize the lockup tracking information as we will either + * have CP rptr to a different value of jiffies wrap around which will force + * initialization of the lockup tracking informations. + * + * A possible false positivie is if we get call after while and last_cp_rptr == + * the current CP rptr, even if it's unlikely it might happen. To avoid this + * if the elapsed time since last call is bigger than 2 second than we return + * false and update the tracking information. Due to this the caller must call + * r100_gpu_cp_is_lockup several time in less than 2sec for lockup to be reported + * the fencing code should be cautious about that. + * + * Caller should write to the ring to force CP to do something so we don't get + * false positive when CP is just gived nothing to do. + * + **/ +bool r100_gpu_cp_is_lockup(struct radeon_device *rdev, struct r100_gpu_lockup *lockup, struct radeon_cp *cp) +{ + unsigned long cjiffies, elapsed; + + cjiffies = jiffies; + if (!time_after(cjiffies, lockup->last_jiffies)) { + /* likely a wrap around */ + lockup->last_jiffies = jiffies; + return false; + } + if (cp->rptr != lockup->last_cp_rptr) { + /* CP is still working no lockup */ + lockup->last_cp_rptr = cp->rptr; + lockup->last_jiffies = jiffies; + return false; + } + elapsed = jiffies_to_msecs(cjiffies - lockup->last_jiffies); + if (elapsed >= 2000) { + /* very likely the improbable case where current +* rptr is equal to last recorded, a while ago, rptr +* this is more likely a false positive update tracking +* information which should force us to be recall at +* latter point +*/ + lockup->last_cp_rptr = cp->rptr; + lockup->last_jiffies = jiffies; + return false; + } + if (elapsed >= 1000) { + dev_err(rdev->dev, "GPU lockup CP stall for more than %lumsec\n", elapsed); + return true; + } + /* give a chance to the GPU ... */ + return false; +} + +bool r100_gpu_is_lockup(str