Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
Just to follow up on this, I decided not to file the report, since Nick's latest patch that "Fixes:" this has fixed my issue. Thanks for the good work on that one Nick. Might want to make sure that makes it in to the 5.7 fixes :) Thanks again guys. ~Matt On 5/5/20 11:59 AM, Kazlauskas, Nicholas wrote: > Can you file a full bug report on the gitlab tracker? > > FreeSync is still working on my Navi setups with this patch applied, and > this patch is essentially just a revert of another patch already (where > FreeSync worked before). > > I can understand the v2 of this series causing issues, but the v1 > shouldn't be - so I'd like to understand more about the setup where this > is causing issues - ASIC, OS, compositor, displays, dmesg log, X log, etc. > > Regards, > Nicholas Kazlauskas > > On 2020-05-05 1:03 p.m., Alex Deucher wrote: >> Mario or Nick any thoughts? >> >> Alex >> >> On Mon, May 4, 2020 at 1:35 PM Matt Coffin wrote: >>> >>> Hey guys, >>> >>> This is still an issue for me, and I'm still having to run a patch to >>> revert this as of 5.7-rc4. To avoid breaking a lot of people's Navi >>> setups in 5.7, is there any news on this? Has anyone else at the very >>> least been able to reproduce the problem? >>> >>> It happens for me in every single program that mesa allows to utilize >>> variable refresh rates, and reverting it "fixes" the issue. >>> >>> Cheers, and sorry for the extra email, just making sure this is still on >>> someone's radar, >>> Matt >>> >>> On 4/14/20 5:32 PM, Matt Coffin wrote: Hey everyone, This patch broke variable refresh rate in games (all that I've tried so far... Project CARS 2, DiRT Rally 2.0, Assetto Corsa Competizione) as well as a simple freesync tester application. FreeSync tester I've been using: https://github.com/Nixola/VRRTest I'm not at all familiar with the page flipping code, so it would take me a long time to find the *right* way to fix it, but does someone else see why it would do that? The symptom is that the refresh rate of the display constantly bounces between the two ends of the FreeSync range (for me 40 -> 144), and the game stutters like a madman. Any help on where to start, ideas on how to fix it (other than just revert this commit, which I've done in the interim), or alternative patches would be appreciated. Thanks in advance for the work/help, Matt On 3/13/20 8:42 AM, Michel Dänzer wrote: > On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote: >> On 2020-03-12 10:32 a.m., Alex Deucher wrote: >>> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner >>> wrote: Commit '16f17eda8bad ("drm/amd/display: Send vblank and user events at vsartup for DCN")' introduces a new way of pageflip completion handling for DCN, and some trouble. The current implementation introduces a race condition, which can cause pageflip completion events to be sent out one vblank too early, thereby confusing userspace and causing flicker: prepare_flip_isr(): 1. Pageflip programming takes the ddev->event_lock. 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED 3. Releases ddev->event_lock. --> Deadline for surface address regs double-buffering passes on target pipe. 4. dc_commit_updates_for_stream() MMIO programs the new pageflip into hw, but too late for current vblank. => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete in current vblank due to missing the double-buffering deadline by a tiny bit. 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires, dm_dcn_crtc_high_irq() gets called. 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the pageflip has been completed/will complete in this vblank and sends out pageflip completion event to userspace and resets pflip_status = AMDGPU_FLIP_NONE. => Flip completion event sent out one vblank too early. This behaviour has been observed during my testing with measurement hardware a couple of time. The commit message says that the extra flip event code was added to dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events in case the pflip irq doesn't fire, because the "DCH HUBP" component is clock gated and doesn't fire pflip irqs in that state. Also that this clock gating may happen if no planes are active. According to Nicholas, the clock gating can also happen if psr is active, and the gating is controlled independently by the hardware, so difficult to detect if and when
Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
Sure, I'll file one after work today. To clarify though, FreeSync still "works" as in the monitor refresh rate is updating, but it constantly bounces between the maximum and minimum freesync refresh rates, causing it to look VERY stuttery. Thanks for the attention, I'll file the but tonight. If you want another reference person, it got a little discussion in this bug report... https://gitlab.freedesktop.org/drm/amd/-/issues/1002#note_486494 Cheers, Matt On 5/5/20 11:59 AM, Kazlauskas, Nicholas wrote: > Can you file a full bug report on the gitlab tracker? > > FreeSync is still working on my Navi setups with this patch applied, and > this patch is essentially just a revert of another patch already (where > FreeSync worked before). > > I can understand the v2 of this series causing issues, but the v1 > shouldn't be - so I'd like to understand more about the setup where this > is causing issues - ASIC, OS, compositor, displays, dmesg log, X log, etc. > > Regards, > Nicholas Kazlauskas > > On 2020-05-05 1:03 p.m., Alex Deucher wrote: >> Mario or Nick any thoughts? >> >> Alex >> >> On Mon, May 4, 2020 at 1:35 PM Matt Coffin wrote: >>> >>> Hey guys, >>> >>> This is still an issue for me, and I'm still having to run a patch to >>> revert this as of 5.7-rc4. To avoid breaking a lot of people's Navi >>> setups in 5.7, is there any news on this? Has anyone else at the very >>> least been able to reproduce the problem? >>> >>> It happens for me in every single program that mesa allows to utilize >>> variable refresh rates, and reverting it "fixes" the issue. >>> >>> Cheers, and sorry for the extra email, just making sure this is still on >>> someone's radar, >>> Matt >>> >>> On 4/14/20 5:32 PM, Matt Coffin wrote: Hey everyone, This patch broke variable refresh rate in games (all that I've tried so far... Project CARS 2, DiRT Rally 2.0, Assetto Corsa Competizione) as well as a simple freesync tester application. FreeSync tester I've been using: https://github.com/Nixola/VRRTest I'm not at all familiar with the page flipping code, so it would take me a long time to find the *right* way to fix it, but does someone else see why it would do that? The symptom is that the refresh rate of the display constantly bounces between the two ends of the FreeSync range (for me 40 -> 144), and the game stutters like a madman. Any help on where to start, ideas on how to fix it (other than just revert this commit, which I've done in the interim), or alternative patches would be appreciated. Thanks in advance for the work/help, Matt On 3/13/20 8:42 AM, Michel Dänzer wrote: > On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote: >> On 2020-03-12 10:32 a.m., Alex Deucher wrote: >>> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner >>> wrote: Commit '16f17eda8bad ("drm/amd/display: Send vblank and user events at vsartup for DCN")' introduces a new way of pageflip completion handling for DCN, and some trouble. The current implementation introduces a race condition, which can cause pageflip completion events to be sent out one vblank too early, thereby confusing userspace and causing flicker: prepare_flip_isr(): 1. Pageflip programming takes the ddev->event_lock. 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED 3. Releases ddev->event_lock. --> Deadline for surface address regs double-buffering passes on target pipe. 4. dc_commit_updates_for_stream() MMIO programs the new pageflip into hw, but too late for current vblank. => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete in current vblank due to missing the double-buffering deadline by a tiny bit. 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires, dm_dcn_crtc_high_irq() gets called. 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the pageflip has been completed/will complete in this vblank and sends out pageflip completion event to userspace and resets pflip_status = AMDGPU_FLIP_NONE. => Flip completion event sent out one vblank too early. This behaviour has been observed during my testing with measurement hardware a couple of time. The commit message says that the extra flip event code was added to dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events in case the pflip irq doesn't fire, because the "DCH HUBP" component is clock gated and doesn't fire pflip irqs in that state. Also that this clock gating may happen if no planes are active.
Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
Can you file a full bug report on the gitlab tracker? FreeSync is still working on my Navi setups with this patch applied, and this patch is essentially just a revert of another patch already (where FreeSync worked before). I can understand the v2 of this series causing issues, but the v1 shouldn't be - so I'd like to understand more about the setup where this is causing issues - ASIC, OS, compositor, displays, dmesg log, X log, etc. Regards, Nicholas Kazlauskas On 2020-05-05 1:03 p.m., Alex Deucher wrote: Mario or Nick any thoughts? Alex On Mon, May 4, 2020 at 1:35 PM Matt Coffin wrote: Hey guys, This is still an issue for me, and I'm still having to run a patch to revert this as of 5.7-rc4. To avoid breaking a lot of people's Navi setups in 5.7, is there any news on this? Has anyone else at the very least been able to reproduce the problem? It happens for me in every single program that mesa allows to utilize variable refresh rates, and reverting it "fixes" the issue. Cheers, and sorry for the extra email, just making sure this is still on someone's radar, Matt On 4/14/20 5:32 PM, Matt Coffin wrote: Hey everyone, This patch broke variable refresh rate in games (all that I've tried so far... Project CARS 2, DiRT Rally 2.0, Assetto Corsa Competizione) as well as a simple freesync tester application. FreeSync tester I've been using: https://github.com/Nixola/VRRTest I'm not at all familiar with the page flipping code, so it would take me a long time to find the *right* way to fix it, but does someone else see why it would do that? The symptom is that the refresh rate of the display constantly bounces between the two ends of the FreeSync range (for me 40 -> 144), and the game stutters like a madman. Any help on where to start, ideas on how to fix it (other than just revert this commit, which I've done in the interim), or alternative patches would be appreciated. Thanks in advance for the work/help, Matt On 3/13/20 8:42 AM, Michel Dänzer wrote: On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote: On 2020-03-12 10:32 a.m., Alex Deucher wrote: On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner wrote: Commit '16f17eda8bad ("drm/amd/display: Send vblank and user events at vsartup for DCN")' introduces a new way of pageflip completion handling for DCN, and some trouble. The current implementation introduces a race condition, which can cause pageflip completion events to be sent out one vblank too early, thereby confusing userspace and causing flicker: prepare_flip_isr(): 1. Pageflip programming takes the ddev->event_lock. 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED 3. Releases ddev->event_lock. --> Deadline for surface address regs double-buffering passes on target pipe. 4. dc_commit_updates_for_stream() MMIO programs the new pageflip into hw, but too late for current vblank. => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete in current vblank due to missing the double-buffering deadline by a tiny bit. 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires, dm_dcn_crtc_high_irq() gets called. 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the pageflip has been completed/will complete in this vblank and sends out pageflip completion event to userspace and resets pflip_status = AMDGPU_FLIP_NONE. => Flip completion event sent out one vblank too early. This behaviour has been observed during my testing with measurement hardware a couple of time. The commit message says that the extra flip event code was added to dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events in case the pflip irq doesn't fire, because the "DCH HUBP" component is clock gated and doesn't fire pflip irqs in that state. Also that this clock gating may happen if no planes are active. According to Nicholas, the clock gating can also happen if psr is active, and the gating is controlled independently by the hardware, so difficult to detect if and when the completion code in above commit is needed. This patch tries the following solution: It only executes the extra pflip completion code in dm_dcn_crtc_high_irq() iff the hardware reports that there aren't any surface updated pending in the double-buffered surface scanout address registers. Otherwise it leaves pflip completion to the pflip irq handler, for a more race-free experience. This would only guard against the order of events mentioned above. If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help at all, because 1-3 + 5 might happen even without the hw being programmed at all, ie. no surface update pending because none yet programmed into hw. Therefore this patch also changes locking in amdgpu_dm_commit_planes(), so that prepare_flip_isr() and dc_commit_updates_for_stream() are done under event_lock protection within the same critical section. v2: Take Nicholas comments into account, try a different solution. Lightly tested on Pol
Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
Mario or Nick any thoughts? Alex On Mon, May 4, 2020 at 1:35 PM Matt Coffin wrote: > > Hey guys, > > This is still an issue for me, and I'm still having to run a patch to > revert this as of 5.7-rc4. To avoid breaking a lot of people's Navi > setups in 5.7, is there any news on this? Has anyone else at the very > least been able to reproduce the problem? > > It happens for me in every single program that mesa allows to utilize > variable refresh rates, and reverting it "fixes" the issue. > > Cheers, and sorry for the extra email, just making sure this is still on > someone's radar, > Matt > > On 4/14/20 5:32 PM, Matt Coffin wrote: > > Hey everyone, > > > > This patch broke variable refresh rate in games (all that I've tried so > > far... Project CARS 2, DiRT Rally 2.0, Assetto Corsa Competizione) as > > well as a simple freesync tester application. > > > > FreeSync tester I've been using: https://github.com/Nixola/VRRTest > > > > I'm not at all familiar with the page flipping code, so it would take me > > a long time to find the *right* way to fix it, but does someone else see > > why it would do that? > > > > The symptom is that the refresh rate of the display constantly bounces > > between the two ends of the FreeSync range (for me 40 -> 144), and the > > game stutters like a madman. > > > > Any help on where to start, ideas on how to fix it (other than just > > revert this commit, which I've done in the interim), or alternative > > patches would be appreciated. > > > > Thanks in advance for the work/help, > > Matt > > > > On 3/13/20 8:42 AM, Michel Dänzer wrote: > >> On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote: > >>> On 2020-03-12 10:32 a.m., Alex Deucher wrote: > On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner > wrote: > > > > Commit '16f17eda8bad ("drm/amd/display: Send vblank and user > > events at vsartup for DCN")' introduces a new way of pageflip > > completion handling for DCN, and some trouble. > > > > The current implementation introduces a race condition, which > > can cause pageflip completion events to be sent out one vblank > > too early, thereby confusing userspace and causing flicker: > > > > prepare_flip_isr(): > > > > 1. Pageflip programming takes the ddev->event_lock. > > 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED > > 3. Releases ddev->event_lock. > > > > --> Deadline for surface address regs double-buffering passes on > > target pipe. > > > > 4. dc_commit_updates_for_stream() MMIO programs the new pageflip > > into hw, but too late for current vblank. > > > > => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete > > in current vblank due to missing the double-buffering deadline > > by a tiny bit. > > > > 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires, > > dm_dcn_crtc_high_irq() gets called. > > > > 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the > > pageflip has been completed/will complete in this vblank and > > sends out pageflip completion event to userspace and resets > > pflip_status = AMDGPU_FLIP_NONE. > > > > => Flip completion event sent out one vblank too early. > > > > This behaviour has been observed during my testing with measurement > > hardware a couple of time. > > > > The commit message says that the extra flip event code was added to > > dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events > > in case the pflip irq doesn't fire, because the "DCH HUBP" component > > is clock gated and doesn't fire pflip irqs in that state. Also that > > this clock gating may happen if no planes are active. According to > > Nicholas, the clock gating can also happen if psr is active, and the > > gating is controlled independently by the hardware, so difficult to > > detect if and when the completion code in above commit is needed. > > > > This patch tries the following solution: It only executes the extra > > pflip > > completion code in dm_dcn_crtc_high_irq() iff the hardware reports > > that there aren't any surface updated pending in the double-buffered > > surface scanout address registers. Otherwise it leaves pflip completion > > to the pflip irq handler, for a more race-free experience. > > > > This would only guard against the order of events mentioned above. > > If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help > > at all, because 1-3 + 5 might happen even without the hw being > > programmed > > at all, ie. no surface update pending because none yet programmed > > into hw. > > > > Therefore this patch also changes locking in amdgpu_dm_commit_planes(), > > so that prepare_flip_isr() and dc_commit_updates_for_stream() are done > > under event_lock protection within the same critical
Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
Hey guys, This is still an issue for me, and I'm still having to run a patch to revert this as of 5.7-rc4. To avoid breaking a lot of people's Navi setups in 5.7, is there any news on this? Has anyone else at the very least been able to reproduce the problem? It happens for me in every single program that mesa allows to utilize variable refresh rates, and reverting it "fixes" the issue. Cheers, and sorry for the extra email, just making sure this is still on someone's radar, Matt On 4/14/20 5:32 PM, Matt Coffin wrote: > Hey everyone, > > This patch broke variable refresh rate in games (all that I've tried so > far... Project CARS 2, DiRT Rally 2.0, Assetto Corsa Competizione) as > well as a simple freesync tester application. > > FreeSync tester I've been using: https://github.com/Nixola/VRRTest > > I'm not at all familiar with the page flipping code, so it would take me > a long time to find the *right* way to fix it, but does someone else see > why it would do that? > > The symptom is that the refresh rate of the display constantly bounces > between the two ends of the FreeSync range (for me 40 -> 144), and the > game stutters like a madman. > > Any help on where to start, ideas on how to fix it (other than just > revert this commit, which I've done in the interim), or alternative > patches would be appreciated. > > Thanks in advance for the work/help, > Matt > > On 3/13/20 8:42 AM, Michel Dänzer wrote: >> On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote: >>> On 2020-03-12 10:32 a.m., Alex Deucher wrote: On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner wrote: > > Commit '16f17eda8bad ("drm/amd/display: Send vblank and user > events at vsartup for DCN")' introduces a new way of pageflip > completion handling for DCN, and some trouble. > > The current implementation introduces a race condition, which > can cause pageflip completion events to be sent out one vblank > too early, thereby confusing userspace and causing flicker: > > prepare_flip_isr(): > > 1. Pageflip programming takes the ddev->event_lock. > 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED > 3. Releases ddev->event_lock. > > --> Deadline for surface address regs double-buffering passes on > target pipe. > > 4. dc_commit_updates_for_stream() MMIO programs the new pageflip > into hw, but too late for current vblank. > > => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete > in current vblank due to missing the double-buffering deadline > by a tiny bit. > > 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires, > dm_dcn_crtc_high_irq() gets called. > > 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the > pageflip has been completed/will complete in this vblank and > sends out pageflip completion event to userspace and resets > pflip_status = AMDGPU_FLIP_NONE. > > => Flip completion event sent out one vblank too early. > > This behaviour has been observed during my testing with measurement > hardware a couple of time. > > The commit message says that the extra flip event code was added to > dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events > in case the pflip irq doesn't fire, because the "DCH HUBP" component > is clock gated and doesn't fire pflip irqs in that state. Also that > this clock gating may happen if no planes are active. According to > Nicholas, the clock gating can also happen if psr is active, and the > gating is controlled independently by the hardware, so difficult to > detect if and when the completion code in above commit is needed. > > This patch tries the following solution: It only executes the extra > pflip > completion code in dm_dcn_crtc_high_irq() iff the hardware reports > that there aren't any surface updated pending in the double-buffered > surface scanout address registers. Otherwise it leaves pflip completion > to the pflip irq handler, for a more race-free experience. > > This would only guard against the order of events mentioned above. > If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help > at all, because 1-3 + 5 might happen even without the hw being > programmed > at all, ie. no surface update pending because none yet programmed > into hw. > > Therefore this patch also changes locking in amdgpu_dm_commit_planes(), > so that prepare_flip_isr() and dc_commit_updates_for_stream() are done > under event_lock protection within the same critical section. > > v2: Take Nicholas comments into account, try a different solution. > > Lightly tested on Polaris (locking) and Raven (the whole DCN stuff). > Seems to work without causing obvious new trouble. Nick, any comments on this? Can we get this committed
Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
Hey everyone, This patch broke variable refresh rate in games (all that I've tried so far... Project CARS 2, DiRT Rally 2.0, Assetto Corsa Competizione) as well as a simple freesync tester application. FreeSync tester I've been using: https://github.com/Nixola/VRRTest I'm not at all familiar with the page flipping code, so it would take me a long time to find the *right* way to fix it, but does someone else see why it would do that? The symptom is that the refresh rate of the display constantly bounces between the two ends of the FreeSync range (for me 40 -> 144), and the game stutters like a madman. Any help on where to start, ideas on how to fix it (other than just revert this commit, which I've done in the interim), or alternative patches would be appreciated. Thanks in advance for the work/help, Matt On 3/13/20 8:42 AM, Michel Dänzer wrote: > On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote: >> On 2020-03-12 10:32 a.m., Alex Deucher wrote: >>> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner >>> wrote: Commit '16f17eda8bad ("drm/amd/display: Send vblank and user events at vsartup for DCN")' introduces a new way of pageflip completion handling for DCN, and some trouble. The current implementation introduces a race condition, which can cause pageflip completion events to be sent out one vblank too early, thereby confusing userspace and causing flicker: prepare_flip_isr(): 1. Pageflip programming takes the ddev->event_lock. 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED 3. Releases ddev->event_lock. --> Deadline for surface address regs double-buffering passes on target pipe. 4. dc_commit_updates_for_stream() MMIO programs the new pageflip into hw, but too late for current vblank. => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete in current vblank due to missing the double-buffering deadline by a tiny bit. 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires, dm_dcn_crtc_high_irq() gets called. 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the pageflip has been completed/will complete in this vblank and sends out pageflip completion event to userspace and resets pflip_status = AMDGPU_FLIP_NONE. => Flip completion event sent out one vblank too early. This behaviour has been observed during my testing with measurement hardware a couple of time. The commit message says that the extra flip event code was added to dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events in case the pflip irq doesn't fire, because the "DCH HUBP" component is clock gated and doesn't fire pflip irqs in that state. Also that this clock gating may happen if no planes are active. According to Nicholas, the clock gating can also happen if psr is active, and the gating is controlled independently by the hardware, so difficult to detect if and when the completion code in above commit is needed. This patch tries the following solution: It only executes the extra pflip completion code in dm_dcn_crtc_high_irq() iff the hardware reports that there aren't any surface updated pending in the double-buffered surface scanout address registers. Otherwise it leaves pflip completion to the pflip irq handler, for a more race-free experience. This would only guard against the order of events mentioned above. If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help at all, because 1-3 + 5 might happen even without the hw being programmed at all, ie. no surface update pending because none yet programmed into hw. Therefore this patch also changes locking in amdgpu_dm_commit_planes(), so that prepare_flip_isr() and dc_commit_updates_for_stream() are done under event_lock protection within the same critical section. v2: Take Nicholas comments into account, try a different solution. Lightly tested on Polaris (locking) and Raven (the whole DCN stuff). Seems to work without causing obvious new trouble. >>> >>> Nick, any comments on this? Can we get this committed or do you think >>> it needs additional rework? >>> >>> Thanks, >>> >>> Alex >> >> Hi Alex, Mario, >> >> This might be a little strange, but if we want to get this in as a fix >> for regressions caused by the original vblank and user events at >> vstartup patch then I'm actually going to give my reviewed by on the >> *v1* of this patch (but not this v2): >> >> Reviewed-by: Nicholas Kazlauskas >> >> You can feel free to apply that one. >> >> Reason 1: After having thought about it some more I don't think we >> enable anything today that has hubp powered down at the same time we >> expect to be waiting for a flip - eg. DMCU powering down HUBP
Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
On Fri, Mar 13, 2020 at 5:02 PM Michel Dänzer wrote: > On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote: > > On 2020-03-12 10:32 a.m., Alex Deucher wrote: > >> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner > >> wrote: > >>> > >>> Commit '16f17eda8bad ("drm/amd/display: Send vblank and user > >>> events at vsartup for DCN")' introduces a new way of pageflip > >>> completion handling for DCN, and some trouble. > >>> > >>> The current implementation introduces a race condition, which > >>> can cause pageflip completion events to be sent out one vblank > >>> too early, thereby confusing userspace and causing flicker: > >>> > >>> prepare_flip_isr(): > >>> > >>> 1. Pageflip programming takes the ddev->event_lock. > >>> 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED > >>> 3. Releases ddev->event_lock. > >>> > >>> --> Deadline for surface address regs double-buffering passes on > >>> target pipe. > >>> > >>> 4. dc_commit_updates_for_stream() MMIO programs the new pageflip > >>> into hw, but too late for current vblank. > >>> > >>> => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete > >>> in current vblank due to missing the double-buffering deadline > >>> by a tiny bit. > >>> > >>> 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires, > >>> dm_dcn_crtc_high_irq() gets called. > >>> > >>> 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the > >>> pageflip has been completed/will complete in this vblank and > >>> sends out pageflip completion event to userspace and resets > >>> pflip_status = AMDGPU_FLIP_NONE. > >>> > >>> => Flip completion event sent out one vblank too early. > >>> > >>> This behaviour has been observed during my testing with measurement > >>> hardware a couple of time. > >>> > >>> The commit message says that the extra flip event code was added to > >>> dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events > >>> in case the pflip irq doesn't fire, because the "DCH HUBP" component > >>> is clock gated and doesn't fire pflip irqs in that state. Also that > >>> this clock gating may happen if no planes are active. According to > >>> Nicholas, the clock gating can also happen if psr is active, and the > >>> gating is controlled independently by the hardware, so difficult to > >>> detect if and when the completion code in above commit is needed. > >>> > >>> This patch tries the following solution: It only executes the extra > >>> pflip > >>> completion code in dm_dcn_crtc_high_irq() iff the hardware reports > >>> that there aren't any surface updated pending in the double-buffered > >>> surface scanout address registers. Otherwise it leaves pflip completion > >>> to the pflip irq handler, for a more race-free experience. > >>> > >>> This would only guard against the order of events mentioned above. > >>> If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help > >>> at all, because 1-3 + 5 might happen even without the hw being > >>> programmed > >>> at all, ie. no surface update pending because none yet programmed > >>> into hw. > >>> > >>> Therefore this patch also changes locking in amdgpu_dm_commit_planes(), > >>> so that prepare_flip_isr() and dc_commit_updates_for_stream() are done > >>> under event_lock protection within the same critical section. > >>> > >>> v2: Take Nicholas comments into account, try a different solution. > >>> > >>> Lightly tested on Polaris (locking) and Raven (the whole DCN stuff). > >>> Seems to work without causing obvious new trouble. > >> > >> Nick, any comments on this? Can we get this committed or do you think > >> it needs additional rework? > >> > >> Thanks, > >> > >> Alex > > > > Hi Alex, Mario, > > > > This might be a little strange, but if we want to get this in as a fix > > for regressions caused by the original vblank and user events at > > vstartup patch then I'm actually going to give my reviewed by on the > > *v1* of this patch (but not this v2): > > > > Reviewed-by: Nicholas Kazlauskas > > > > You can feel free to apply that one. > > > > Reason 1: After having thought about it some more I don't think we > > enable anything today that has hubp powered down at the same time we > > expect to be waiting for a flip - eg. DMCU powering down HUBP during PSR > > entry. Static screen interrupt should happen after that flip finishes I > > think. > > > > The CRTC can still be powered on with zero planes, and I don't think any > > userspace explicitly asks for vblank events in this case but it doesn't > > hurt to have the check. > > > Ok, so the original commit that causes the races currently solves a non-existing - but potential future - problem? I guess then my v1 patch makes sense for the moment and fixes the immediate problem for Linux 5.6-rc. Maybe there's a way to ask the hw if hubp is clock-gated and so if that code needs to run or not in the future? As mentioned before, it would be helpful at least for me to get a better overview about which
Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
On 2020-03-13 1:35 p.m., Kazlauskas, Nicholas wrote: > On 2020-03-12 10:32 a.m., Alex Deucher wrote: >> On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner >> wrote: >>> >>> Commit '16f17eda8bad ("drm/amd/display: Send vblank and user >>> events at vsartup for DCN")' introduces a new way of pageflip >>> completion handling for DCN, and some trouble. >>> >>> The current implementation introduces a race condition, which >>> can cause pageflip completion events to be sent out one vblank >>> too early, thereby confusing userspace and causing flicker: >>> >>> prepare_flip_isr(): >>> >>> 1. Pageflip programming takes the ddev->event_lock. >>> 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED >>> 3. Releases ddev->event_lock. >>> >>> --> Deadline for surface address regs double-buffering passes on >>> target pipe. >>> >>> 4. dc_commit_updates_for_stream() MMIO programs the new pageflip >>> into hw, but too late for current vblank. >>> >>> => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete >>> in current vblank due to missing the double-buffering deadline >>> by a tiny bit. >>> >>> 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires, >>> dm_dcn_crtc_high_irq() gets called. >>> >>> 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the >>> pageflip has been completed/will complete in this vblank and >>> sends out pageflip completion event to userspace and resets >>> pflip_status = AMDGPU_FLIP_NONE. >>> >>> => Flip completion event sent out one vblank too early. >>> >>> This behaviour has been observed during my testing with measurement >>> hardware a couple of time. >>> >>> The commit message says that the extra flip event code was added to >>> dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events >>> in case the pflip irq doesn't fire, because the "DCH HUBP" component >>> is clock gated and doesn't fire pflip irqs in that state. Also that >>> this clock gating may happen if no planes are active. According to >>> Nicholas, the clock gating can also happen if psr is active, and the >>> gating is controlled independently by the hardware, so difficult to >>> detect if and when the completion code in above commit is needed. >>> >>> This patch tries the following solution: It only executes the extra >>> pflip >>> completion code in dm_dcn_crtc_high_irq() iff the hardware reports >>> that there aren't any surface updated pending in the double-buffered >>> surface scanout address registers. Otherwise it leaves pflip completion >>> to the pflip irq handler, for a more race-free experience. >>> >>> This would only guard against the order of events mentioned above. >>> If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help >>> at all, because 1-3 + 5 might happen even without the hw being >>> programmed >>> at all, ie. no surface update pending because none yet programmed >>> into hw. >>> >>> Therefore this patch also changes locking in amdgpu_dm_commit_planes(), >>> so that prepare_flip_isr() and dc_commit_updates_for_stream() are done >>> under event_lock protection within the same critical section. >>> >>> v2: Take Nicholas comments into account, try a different solution. >>> >>> Lightly tested on Polaris (locking) and Raven (the whole DCN stuff). >>> Seems to work without causing obvious new trouble. >> >> Nick, any comments on this? Can we get this committed or do you think >> it needs additional rework? >> >> Thanks, >> >> Alex > > Hi Alex, Mario, > > This might be a little strange, but if we want to get this in as a fix > for regressions caused by the original vblank and user events at > vstartup patch then I'm actually going to give my reviewed by on the > *v1* of this patch (but not this v2): > > Reviewed-by: Nicholas Kazlauskas > > You can feel free to apply that one. > > Reason 1: After having thought about it some more I don't think we > enable anything today that has hubp powered down at the same time we > expect to be waiting for a flip - eg. DMCU powering down HUBP during PSR > entry. Static screen interrupt should happen after that flip finishes I > think. > > The CRTC can still be powered on with zero planes, and I don't think any > userspace explicitly asks for vblank events in this case but it doesn't > hurt to have the check. > > Reason 2: This new patch will need much more thorough testing from side > to fully understand the consequences of locking the entire DC commit > sequence. For just a page flip that sounds fine, but for anything more > than (eg. full updates, modesets, etc) I don't think we want to be > disabling interrupts for potentially many milliseconds. Ah! I was wondering where the attached splat comes from, but I think this explains it: With this patch amdgpu_dm_commit_planes keeps the pcrtc->dev->event_lock spinlock locked while calling dc_commit_updates_for_stream, which ends up calling smu_set_display_count, which tries to lock a mutex. -- Earthling Michel Dänzer
Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
On 2020-03-12 10:32 a.m., Alex Deucher wrote: On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner wrote: Commit '16f17eda8bad ("drm/amd/display: Send vblank and user events at vsartup for DCN")' introduces a new way of pageflip completion handling for DCN, and some trouble. The current implementation introduces a race condition, which can cause pageflip completion events to be sent out one vblank too early, thereby confusing userspace and causing flicker: prepare_flip_isr(): 1. Pageflip programming takes the ddev->event_lock. 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED 3. Releases ddev->event_lock. --> Deadline for surface address regs double-buffering passes on target pipe. 4. dc_commit_updates_for_stream() MMIO programs the new pageflip into hw, but too late for current vblank. => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete in current vblank due to missing the double-buffering deadline by a tiny bit. 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires, dm_dcn_crtc_high_irq() gets called. 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the pageflip has been completed/will complete in this vblank and sends out pageflip completion event to userspace and resets pflip_status = AMDGPU_FLIP_NONE. => Flip completion event sent out one vblank too early. This behaviour has been observed during my testing with measurement hardware a couple of time. The commit message says that the extra flip event code was added to dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events in case the pflip irq doesn't fire, because the "DCH HUBP" component is clock gated and doesn't fire pflip irqs in that state. Also that this clock gating may happen if no planes are active. According to Nicholas, the clock gating can also happen if psr is active, and the gating is controlled independently by the hardware, so difficult to detect if and when the completion code in above commit is needed. This patch tries the following solution: It only executes the extra pflip completion code in dm_dcn_crtc_high_irq() iff the hardware reports that there aren't any surface updated pending in the double-buffered surface scanout address registers. Otherwise it leaves pflip completion to the pflip irq handler, for a more race-free experience. This would only guard against the order of events mentioned above. If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help at all, because 1-3 + 5 might happen even without the hw being programmed at all, ie. no surface update pending because none yet programmed into hw. Therefore this patch also changes locking in amdgpu_dm_commit_planes(), so that prepare_flip_isr() and dc_commit_updates_for_stream() are done under event_lock protection within the same critical section. v2: Take Nicholas comments into account, try a different solution. Lightly tested on Polaris (locking) and Raven (the whole DCN stuff). Seems to work without causing obvious new trouble. Nick, any comments on this? Can we get this committed or do you think it needs additional rework? Thanks, Alex Hi Alex, Mario, This might be a little strange, but if we want to get this in as a fix for regressions caused by the original vblank and user events at vstartup patch then I'm actually going to give my reviewed by on the *v1* of this patch (but not this v2): Reviewed-by: Nicholas Kazlauskas You can feel free to apply that one. Reason 1: After having thought about it some more I don't think we enable anything today that has hubp powered down at the same time we expect to be waiting for a flip - eg. DMCU powering down HUBP during PSR entry. Static screen interrupt should happen after that flip finishes I think. The CRTC can still be powered on with zero planes, and I don't think any userspace explicitly asks for vblank events in this case but it doesn't hurt to have the check. Reason 2: This new patch will need much more thorough testing from side to fully understand the consequences of locking the entire DC commit sequence. For just a page flip that sounds fine, but for anything more than (eg. full updates, modesets, etc) I don't think we want to be disabling interrupts for potentially many milliseconds. Maybe we could just use a lock-free queue here for the events. It's single producer/single consumer per CRTC. Regards, Nicholas Kazlauskas Fixes: 16f17eda8bad ("drm/amd/display: Send vblank and user events at vsartup for DCN") Signed-off-by: Mario Kleiner --- .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 80 --- 1 file changed, 67 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index d7df1a85e72f..aa4e941b276f 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -287,6 +287,28 @@ static inline bool
Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
On Thu, Mar 5, 2020 at 4:21 PM Mario Kleiner wrote: > > Commit '16f17eda8bad ("drm/amd/display: Send vblank and user > events at vsartup for DCN")' introduces a new way of pageflip > completion handling for DCN, and some trouble. > > The current implementation introduces a race condition, which > can cause pageflip completion events to be sent out one vblank > too early, thereby confusing userspace and causing flicker: > > prepare_flip_isr(): > > 1. Pageflip programming takes the ddev->event_lock. > 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED > 3. Releases ddev->event_lock. > > --> Deadline for surface address regs double-buffering passes on > target pipe. > > 4. dc_commit_updates_for_stream() MMIO programs the new pageflip >into hw, but too late for current vblank. > > => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete >in current vblank due to missing the double-buffering deadline >by a tiny bit. > > 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires, >dm_dcn_crtc_high_irq() gets called. > > 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the >pageflip has been completed/will complete in this vblank and >sends out pageflip completion event to userspace and resets >pflip_status = AMDGPU_FLIP_NONE. > > => Flip completion event sent out one vblank too early. > > This behaviour has been observed during my testing with measurement > hardware a couple of time. > > The commit message says that the extra flip event code was added to > dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events > in case the pflip irq doesn't fire, because the "DCH HUBP" component > is clock gated and doesn't fire pflip irqs in that state. Also that > this clock gating may happen if no planes are active. According to > Nicholas, the clock gating can also happen if psr is active, and the > gating is controlled independently by the hardware, so difficult to > detect if and when the completion code in above commit is needed. > > This patch tries the following solution: It only executes the extra pflip > completion code in dm_dcn_crtc_high_irq() iff the hardware reports > that there aren't any surface updated pending in the double-buffered > surface scanout address registers. Otherwise it leaves pflip completion > to the pflip irq handler, for a more race-free experience. > > This would only guard against the order of events mentioned above. > If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help > at all, because 1-3 + 5 might happen even without the hw being programmed > at all, ie. no surface update pending because none yet programmed into hw. > > Therefore this patch also changes locking in amdgpu_dm_commit_planes(), > so that prepare_flip_isr() and dc_commit_updates_for_stream() are done > under event_lock protection within the same critical section. > > v2: Take Nicholas comments into account, try a different solution. > > Lightly tested on Polaris (locking) and Raven (the whole DCN stuff). > Seems to work without causing obvious new trouble. Nick, any comments on this? Can we get this committed or do you think it needs additional rework? Thanks, Alex > > Fixes: 16f17eda8bad ("drm/amd/display: Send vblank and user events at vsartup > for DCN") > Signed-off-by: Mario Kleiner > --- > .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 80 --- > 1 file changed, 67 insertions(+), 13 deletions(-) > > diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > index d7df1a85e72f..aa4e941b276f 100644 > --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c > @@ -287,6 +287,28 @@ static inline bool amdgpu_dm_vrr_active(struct > dm_crtc_state *dm_state) >dm_state->freesync_config.state == VRR_STATE_ACTIVE_FIXED; > } > > +/** > + * dm_crtc_is_flip_pending() - Is a pageflip pending on this crtc? > + * > + * Returns true if any plane on the crtc has a flip pending, false otherwise. > + */ > +static bool dm_crtc_is_flip_pending(struct dm_crtc_state *acrtc_state) > +{ > + struct dc_stream_status *status = > dc_stream_get_status(acrtc_state->stream); > + const struct dc_plane_status *plane_status; > + int i; > + bool pending = false; > + > + for (i = 0; i < status->plane_count; i++) { > + plane_status = dc_plane_get_status(status->plane_states[i]); > + pending |= plane_status->is_flip_pending; > + DRM_DEBUG_DRIVER("plane:%d, flip_pending=%d\n", > +i, plane_status->is_flip_pending); > + } > + > + return pending; > +} > + > /** > * dm_pflip_high_irq() - Handle pageflip interrupt > * @interrupt_params: ignored > @@ -435,6 +457,11 @@ static void dm_vupdate_high_irq(void *interrupt_params) > > spin_unlock_irqrestore(&adev->ddev->
Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN.
On Mon, Mar 2, 2020 at 2:57 PM Kazlauskas, Nicholas wrote: > > On 2020-03-02 1:17 a.m., Mario Kleiner wrote: > > Commit '16f17eda8bad ("drm/amd/display: Send vblank and user > > events at vsartup for DCN")' introduces a new way of pageflip > > completion handling for DCN, and some trouble. > > > > The current implementation introduces a race condition, which > > can cause pageflip completion events to be sent out one vblank > > too early, thereby confusing userspace and causing flicker: > > > > prepare_flip_isr(): > > > > 1. Pageflip programming takes the ddev->event_lock. > > 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED > > 3. Releases ddev->event_lock. > > > > --> Deadline for surface address regs double-buffering passes on > > target pipe. > > > > 4. dc_commit_updates_for_stream() MMIO programs the new pageflip > > into hw, but too late for current vblank. > > > > => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete > > in current vblank due to missing the double-buffering deadline > > by a tiny bit. > > > > 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires, > > dm_dcn_crtc_high_irq() gets called. > > > > 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the > > pageflip has been completed/will complete in this vblank and > > sends out pageflip completion event to userspace and resets > > pflip_status = AMDGPU_FLIP_NONE. > > > > => Flip completion event sent out one vblank too early. > > > > This behaviour has been observed during my testing with measurement > > hardware a couple of time. > > > > The commit message says that the extra flip event code was added to > > dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events > > in case the pflip irq doesn't fire, because the "DCH HUBP" component > > is clock gated and doesn't fire pflip irqs in that state. Also that > > this clock gating may happen if no planes are active. This suggests > > that the problem addressed by that commit can't happen if planes > > are active. > > > > The proposed solution is therefore to only execute the extra pflip > > completion code iff the count of active planes is zero and otherwise > > leave pflip completion handling to the pflip irq handler, for a > > more race-free experience. > > > > Note that i don't know if this fixes the problem the original commit > > tried to address, as i don't know what the test scenario was. It > > does fix the observed too early pageflip events though and points > > out the problem introduced. > > This looks like a valid race condition that should be addressed. Indeed! And if possible in any way, before Linux 5.6 is released. For my use cases, neuroscience/medical research, this is a serious problem which would make DCN gpu's basically unusable for most work. Problems affecting flip timing / flip completion / timestamping are always worst case problems for my users. > > Unfortunately this also doesn't fix the problem the original commit was > trying to address. > > HUBP interrupts only trigger when it's not clock gated. But there are > cases (for example, PSR) where the HUBP can be clock gated but the > active plane count is greater than zero. > > The clock gating switch can typically happens outside of x86 control > flow so we're not really going to understand in advance whether or not > we'll be able to receive the pflip IRQ. > Oh dear! So how can that happen? Could explain to me in more detail, how this works? What's the job of HUBP, apart from (not) triggering pflip interrupts reliably? Is the scenario here that the desktop is detected idle for a while (how?) and PSR kicks in and HUBP gets clock gated, but somehow vblank interrupts are still active? I thought panel self refresh only enables on long idle display, so scanout from the gpu can be basically disabled while the panel refreshes itself? Is a programmed pageflip then automatically (no host cpu involvement) putting the panel out of self refresh and turning scanout on and the pageflip completion enables HUBP again, but HUBP doesn't trigger the pflip irq because it somehow missed that due to being clock-gated at time of flip completion? I'd really like to understand in more detail how this stuff works on your recent hw, and also which irqs / events / trigger points are associated with what actions of the hw. I have the feeling there will be more "fun" lingering in the future. I also wanted to experiment more with some VRR ideas, so any details about which hw events happen when and fire which irq's and which double-buffered registers double-buffer when are greatly appreciated. > While we do have plane status/flip pending bits available to check in > the VSTARTUP IRQ handler, this obviously doesn't work for resolving the > core of the issue - that we probably don't know whether or not the HUBP > will be on before sending out the event. > > Maybe we can guess based on what features are enabled.> Ok, that's your domain of knowledge. I jus
[PATCH] drm/amd/display: Fix pageflip event race condition for DCN. (v2)
Commit '16f17eda8bad ("drm/amd/display: Send vblank and user events at vsartup for DCN")' introduces a new way of pageflip completion handling for DCN, and some trouble. The current implementation introduces a race condition, which can cause pageflip completion events to be sent out one vblank too early, thereby confusing userspace and causing flicker: prepare_flip_isr(): 1. Pageflip programming takes the ddev->event_lock. 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED 3. Releases ddev->event_lock. --> Deadline for surface address regs double-buffering passes on target pipe. 4. dc_commit_updates_for_stream() MMIO programs the new pageflip into hw, but too late for current vblank. => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete in current vblank due to missing the double-buffering deadline by a tiny bit. 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires, dm_dcn_crtc_high_irq() gets called. 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the pageflip has been completed/will complete in this vblank and sends out pageflip completion event to userspace and resets pflip_status = AMDGPU_FLIP_NONE. => Flip completion event sent out one vblank too early. This behaviour has been observed during my testing with measurement hardware a couple of time. The commit message says that the extra flip event code was added to dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events in case the pflip irq doesn't fire, because the "DCH HUBP" component is clock gated and doesn't fire pflip irqs in that state. Also that this clock gating may happen if no planes are active. According to Nicholas, the clock gating can also happen if psr is active, and the gating is controlled independently by the hardware, so difficult to detect if and when the completion code in above commit is needed. This patch tries the following solution: It only executes the extra pflip completion code in dm_dcn_crtc_high_irq() iff the hardware reports that there aren't any surface updated pending in the double-buffered surface scanout address registers. Otherwise it leaves pflip completion to the pflip irq handler, for a more race-free experience. This would only guard against the order of events mentioned above. If Step 5 (VSTARTUP trigger) happens before step 4 then this won't help at all, because 1-3 + 5 might happen even without the hw being programmed at all, ie. no surface update pending because none yet programmed into hw. Therefore this patch also changes locking in amdgpu_dm_commit_planes(), so that prepare_flip_isr() and dc_commit_updates_for_stream() are done under event_lock protection within the same critical section. v2: Take Nicholas comments into account, try a different solution. Lightly tested on Polaris (locking) and Raven (the whole DCN stuff). Seems to work without causing obvious new trouble. Fixes: 16f17eda8bad ("drm/amd/display: Send vblank and user events at vsartup for DCN") Signed-off-by: Mario Kleiner --- .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 80 --- 1 file changed, 67 insertions(+), 13 deletions(-) diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index d7df1a85e72f..aa4e941b276f 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -287,6 +287,28 @@ static inline bool amdgpu_dm_vrr_active(struct dm_crtc_state *dm_state) dm_state->freesync_config.state == VRR_STATE_ACTIVE_FIXED; } +/** + * dm_crtc_is_flip_pending() - Is a pageflip pending on this crtc? + * + * Returns true if any plane on the crtc has a flip pending, false otherwise. + */ +static bool dm_crtc_is_flip_pending(struct dm_crtc_state *acrtc_state) +{ + struct dc_stream_status *status = dc_stream_get_status(acrtc_state->stream); + const struct dc_plane_status *plane_status; + int i; + bool pending = false; + + for (i = 0; i < status->plane_count; i++) { + plane_status = dc_plane_get_status(status->plane_states[i]); + pending |= plane_status->is_flip_pending; + DRM_DEBUG_DRIVER("plane:%d, flip_pending=%d\n", +i, plane_status->is_flip_pending); + } + + return pending; +} + /** * dm_pflip_high_irq() - Handle pageflip interrupt * @interrupt_params: ignored @@ -435,6 +457,11 @@ static void dm_vupdate_high_irq(void *interrupt_params) spin_unlock_irqrestore(&adev->ddev->event_lock, flags); } } + + if (acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED) { + DRM_DEBUG_DRIVER("%s:crtc:%d, flip_pending=%d\n", __func__, + acrtc->crtc_id, dm_crtc_is_flip_pending(acrtc_state)); + } } } @@ -489,6 +516,11 @@ st
Re: [PATCH] drm/amd/display: Fix pageflip event race condition for DCN.
On 2020-03-02 1:17 a.m., Mario Kleiner wrote: Commit '16f17eda8bad ("drm/amd/display: Send vblank and user events at vsartup for DCN")' introduces a new way of pageflip completion handling for DCN, and some trouble. The current implementation introduces a race condition, which can cause pageflip completion events to be sent out one vblank too early, thereby confusing userspace and causing flicker: prepare_flip_isr(): 1. Pageflip programming takes the ddev->event_lock. 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED 3. Releases ddev->event_lock. --> Deadline for surface address regs double-buffering passes on target pipe. 4. dc_commit_updates_for_stream() MMIO programs the new pageflip into hw, but too late for current vblank. => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete in current vblank due to missing the double-buffering deadline by a tiny bit. 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires, dm_dcn_crtc_high_irq() gets called. 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the pageflip has been completed/will complete in this vblank and sends out pageflip completion event to userspace and resets pflip_status = AMDGPU_FLIP_NONE. => Flip completion event sent out one vblank too early. This behaviour has been observed during my testing with measurement hardware a couple of time. The commit message says that the extra flip event code was added to dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events in case the pflip irq doesn't fire, because the "DCH HUBP" component is clock gated and doesn't fire pflip irqs in that state. Also that this clock gating may happen if no planes are active. This suggests that the problem addressed by that commit can't happen if planes are active. The proposed solution is therefore to only execute the extra pflip completion code iff the count of active planes is zero and otherwise leave pflip completion handling to the pflip irq handler, for a more race-free experience. Note that i don't know if this fixes the problem the original commit tried to address, as i don't know what the test scenario was. It does fix the observed too early pageflip events though and points out the problem introduced. This looks like a valid race condition that should be addressed. Unfortunately this also doesn't fix the problem the original commit was trying to address. HUBP interrupts only trigger when it's not clock gated. But there are cases (for example, PSR) where the HUBP can be clock gated but the active plane count is greater than zero. The clock gating switch can typically happens outside of x86 control flow so we're not really going to understand in advance whether or not we'll be able to receive the pflip IRQ. While we do have plane status/flip pending bits available to check in the VSTARTUP IRQ handler, this obviously doesn't work for resolving the core of the issue - that we probably don't know whether or not the HUBP will be on before sending out the event. Maybe we can guess based on what features are enabled. Regards, Nicholas Kazlauskas Fixes: 16f17eda8bad ("drm/amd/display: Send vblank and user events at vsartup for DCN") Signed-off-by: Mario Kleiner --- .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 18 +++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index 63e8a12a74bc..3502d6d52160 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -522,8 +522,9 @@ static void dm_dcn_crtc_high_irq(void *interrupt_params) acrtc_state = to_dm_crtc_state(acrtc->base.state); - DRM_DEBUG_DRIVER("crtc:%d, vupdate-vrr:%d\n", acrtc->crtc_id, - amdgpu_dm_vrr_active(acrtc_state)); + DRM_DEBUG_DRIVER("crtc:%d, vupdate-vrr:%d, planes:%d\n", acrtc->crtc_id, +amdgpu_dm_vrr_active(acrtc_state), +acrtc_state->active_planes); amdgpu_dm_crtc_handle_crc_irq(&acrtc->base); drm_crtc_handle_vblank(&acrtc->base); @@ -543,7 +544,18 @@ static void dm_dcn_crtc_high_irq(void *interrupt_params) &acrtc_state->vrr_params.adjust); } - if (acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED) { + /* +* If there aren't any active_planes then DCH HUBP may be clock-gated. +* In that case, pageflip completion interrupts won't fire and pageflip +* completion events won't get delivered. Prevent this by sending +* pending pageflip events from here if a flip is still pending. +* +* If any planes are enabled, use dm_pflip_high_irq() instead, to +* avoid race conditions between flip programming and completion, +* which could cause too early flip completion events. +*/ +
[PATCH] drm/amd/display: Fix pageflip event race condition for DCN.
Commit '16f17eda8bad ("drm/amd/display: Send vblank and user events at vsartup for DCN")' introduces a new way of pageflip completion handling for DCN, and some trouble. The current implementation introduces a race condition, which can cause pageflip completion events to be sent out one vblank too early, thereby confusing userspace and causing flicker: prepare_flip_isr(): 1. Pageflip programming takes the ddev->event_lock. 2. Sets acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED 3. Releases ddev->event_lock. --> Deadline for surface address regs double-buffering passes on target pipe. 4. dc_commit_updates_for_stream() MMIO programs the new pageflip into hw, but too late for current vblank. => pflip_status == AMDGPU_FLIP_SUBMITTED, but flip won't complete in current vblank due to missing the double-buffering deadline by a tiny bit. 5. VSTARTUP trigger point in vblank is reached, VSTARTUP irq fires, dm_dcn_crtc_high_irq() gets called. 6. Detects pflip_status == AMDGPU_FLIP_SUBMITTED and assumes the pageflip has been completed/will complete in this vblank and sends out pageflip completion event to userspace and resets pflip_status = AMDGPU_FLIP_NONE. => Flip completion event sent out one vblank too early. This behaviour has been observed during my testing with measurement hardware a couple of time. The commit message says that the extra flip event code was added to dm_dcn_crtc_high_irq() to prevent missing to send out pageflip events in case the pflip irq doesn't fire, because the "DCH HUBP" component is clock gated and doesn't fire pflip irqs in that state. Also that this clock gating may happen if no planes are active. This suggests that the problem addressed by that commit can't happen if planes are active. The proposed solution is therefore to only execute the extra pflip completion code iff the count of active planes is zero and otherwise leave pflip completion handling to the pflip irq handler, for a more race-free experience. Note that i don't know if this fixes the problem the original commit tried to address, as i don't know what the test scenario was. It does fix the observed too early pageflip events though and points out the problem introduced. Fixes: 16f17eda8bad ("drm/amd/display: Send vblank and user events at vsartup for DCN") Signed-off-by: Mario Kleiner --- .../gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c | 18 +++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c index 63e8a12a74bc..3502d6d52160 100644 --- a/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c +++ b/drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c @@ -522,8 +522,9 @@ static void dm_dcn_crtc_high_irq(void *interrupt_params) acrtc_state = to_dm_crtc_state(acrtc->base.state); - DRM_DEBUG_DRIVER("crtc:%d, vupdate-vrr:%d\n", acrtc->crtc_id, - amdgpu_dm_vrr_active(acrtc_state)); + DRM_DEBUG_DRIVER("crtc:%d, vupdate-vrr:%d, planes:%d\n", acrtc->crtc_id, +amdgpu_dm_vrr_active(acrtc_state), +acrtc_state->active_planes); amdgpu_dm_crtc_handle_crc_irq(&acrtc->base); drm_crtc_handle_vblank(&acrtc->base); @@ -543,7 +544,18 @@ static void dm_dcn_crtc_high_irq(void *interrupt_params) &acrtc_state->vrr_params.adjust); } - if (acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED) { + /* +* If there aren't any active_planes then DCH HUBP may be clock-gated. +* In that case, pageflip completion interrupts won't fire and pageflip +* completion events won't get delivered. Prevent this by sending +* pending pageflip events from here if a flip is still pending. +* +* If any planes are enabled, use dm_pflip_high_irq() instead, to +* avoid race conditions between flip programming and completion, +* which could cause too early flip completion events. +*/ + if (acrtc->pflip_status == AMDGPU_FLIP_SUBMITTED && + acrtc_state->active_planes == 0) { if (acrtc->event) { drm_crtc_send_vblank_event(&acrtc->base, acrtc->event); acrtc->event = NULL; -- 2.20.1 ___ amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx