Re: [Intel-gfx] [Nouveau] [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Mon, Feb 19, 2018 at 9:54 AM, Daniel Vetterwrote: > On Mon, Feb 19, 2018 at 03:47:42PM +0100, Lukas Wunner wrote: >> On Mon, Feb 19, 2018 at 03:05:53PM +0100, Daniel Vetter wrote: >> > On Mon, Feb 19, 2018 at 12:58:17PM +0100, Lukas Wunner wrote: >> > > On Mon, Feb 19, 2018 at 12:34:43PM +0100, Daniel Vetter wrote: >> > > > Well, userspace expects hotplug events, even when we runtime suspend >> > > > stuff. Hence waking shit up with polling. Imo ignoring hotplug events >> > > > is a >> > > > pretty serious policy decision which is ok in the context of >> > > > vga_switcheroo, but not really as an automatic thing. E.g. usb also >> > > > wakes >> > > > up if you plug something in, even with all the runtime pm stuff >> > > > enabled. >> > > > Same for sata and everything else. >> > > >> > > On the MacBook Pro, the HPD pin of external DP and HDMI ports goes into >> > > the gmux controller, which sends an interrupt on hotplug even if the GPU >> > > is powered down. >> > > >> > > Apparently Optimus has a similar functionality, cf. 3a6536c51d5d. >> > >> > Yeah, for vga_switcheroo and gmux setups shutting down polling explicitly >> > makes sense. I think ideally we'd stop polling in the gmux handler somehow >> > (maybe by dropping the relevant DRM_CONNECTOR_POLL flags, or outright >> > stopping it all). But not when runtime suspending the entire gpu (e.g. >> > idle system that shuts down the screen and everything, before it decides >> > a few minutes later to do a full system suspend). >> >> nouveau, radeon and amdgpu currently use runtime PM *only* on hybrid >> graphics laptops. >> >> Should the drivers later be extended to also use runtime PM in other >> scenarios (desktop machines, eGPUs), they can easily detect whether >> to disable polling on runtime suspend by calling apple_gmux_present() >> on Macs or the equivalent for Optimus/ATPX. > > Ah, then I think the current solution is ok (if not entirely clean imo, > but that can be fixed up whenever it hurts). Implementing runtime pm for > other cases is up to the driver authors really (probably more pressing > when the gpu is on the same SoC). On our APUs, we support fairly fine grained powergating so this mostly happens auto-magically in hw; no need for runtimepm. We haven't supported native analog encoders in last 3 or 4 generations of display hw, so polling is not much of an issue going forward. On most integrated platforms (e.g., laptops and all-in-ones), digital hotplug is handled by the platform (we get an ACPI ATIF notification) so we can wake the dGPU. Alex > -Daniel > >> >> Thanks, >> >> Lukas >> ___ >> dri-devel mailing list >> dri-devel@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/dri-devel > > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch > ___ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Intel-gfx] [Nouveau] [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Mon, Feb 19, 2018 at 03:47:42PM +0100, Lukas Wunner wrote: > On Mon, Feb 19, 2018 at 03:05:53PM +0100, Daniel Vetter wrote: > > On Mon, Feb 19, 2018 at 12:58:17PM +0100, Lukas Wunner wrote: > > > On Mon, Feb 19, 2018 at 12:34:43PM +0100, Daniel Vetter wrote: > > > > Well, userspace expects hotplug events, even when we runtime suspend > > > > stuff. Hence waking shit up with polling. Imo ignoring hotplug events > > > > is a > > > > pretty serious policy decision which is ok in the context of > > > > vga_switcheroo, but not really as an automatic thing. E.g. usb also > > > > wakes > > > > up if you plug something in, even with all the runtime pm stuff enabled. > > > > Same for sata and everything else. > > > > > > On the MacBook Pro, the HPD pin of external DP and HDMI ports goes into > > > the gmux controller, which sends an interrupt on hotplug even if the GPU > > > is powered down. > > > > > > Apparently Optimus has a similar functionality, cf. 3a6536c51d5d. > > > > Yeah, for vga_switcheroo and gmux setups shutting down polling explicitly > > makes sense. I think ideally we'd stop polling in the gmux handler somehow > > (maybe by dropping the relevant DRM_CONNECTOR_POLL flags, or outright > > stopping it all). But not when runtime suspending the entire gpu (e.g. > > idle system that shuts down the screen and everything, before it decides > > a few minutes later to do a full system suspend). > > nouveau, radeon and amdgpu currently use runtime PM *only* on hybrid > graphics laptops. > > Should the drivers later be extended to also use runtime PM in other > scenarios (desktop machines, eGPUs), they can easily detect whether > to disable polling on runtime suspend by calling apple_gmux_present() > on Macs or the equivalent for Optimus/ATPX. Ah, then I think the current solution is ok (if not entirely clean imo, but that can be fixed up whenever it hurts). Implementing runtime pm for other cases is up to the driver authors really (probably more pressing when the gpu is on the same SoC). -Daniel > > Thanks, > > Lukas > ___ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Intel-gfx] [Nouveau] [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Mon, Feb 19, 2018 at 03:05:53PM +0100, Daniel Vetter wrote: > On Mon, Feb 19, 2018 at 12:58:17PM +0100, Lukas Wunner wrote: > > On Mon, Feb 19, 2018 at 12:34:43PM +0100, Daniel Vetter wrote: > > > Well, userspace expects hotplug events, even when we runtime suspend > > > stuff. Hence waking shit up with polling. Imo ignoring hotplug events is a > > > pretty serious policy decision which is ok in the context of > > > vga_switcheroo, but not really as an automatic thing. E.g. usb also wakes > > > up if you plug something in, even with all the runtime pm stuff enabled. > > > Same for sata and everything else. > > > > On the MacBook Pro, the HPD pin of external DP and HDMI ports goes into > > the gmux controller, which sends an interrupt on hotplug even if the GPU > > is powered down. > > > > Apparently Optimus has a similar functionality, cf. 3a6536c51d5d. > > Yeah, for vga_switcheroo and gmux setups shutting down polling explicitly > makes sense. I think ideally we'd stop polling in the gmux handler somehow > (maybe by dropping the relevant DRM_CONNECTOR_POLL flags, or outright > stopping it all). But not when runtime suspending the entire gpu (e.g. > idle system that shuts down the screen and everything, before it decides > a few minutes later to do a full system suspend). nouveau, radeon and amdgpu currently use runtime PM *only* on hybrid graphics laptops. Should the drivers later be extended to also use runtime PM in other scenarios (desktop machines, eGPUs), they can easily detect whether to disable polling on runtime suspend by calling apple_gmux_present() on Macs or the equivalent for Optimus/ATPX. Thanks, Lukas ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Intel-gfx] [Nouveau] [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Mon, Feb 19, 2018 at 12:58:17PM +0100, Lukas Wunner wrote: > On Mon, Feb 19, 2018 at 12:34:43PM +0100, Daniel Vetter wrote: > > On Sun, Feb 11, 2018 at 10:38:28AM +0100, Lukas Wunner wrote: > > > Fix a deadlock on hybrid graphics laptops that's been present since 2013: > > > > > > DRM drivers poll connectors in 10 sec intervals. The poll worker is > > > stopped on ->runtime_suspend with cancel_delayed_work_sync(). However > > > the poll worker invokes the DRM drivers' ->detect callbacks, which call > > > pm_runtime_get_sync(). If the poll worker starts after runtime suspend > > > has begun, pm_runtime_get_sync() will wait for runtime suspend to finish > > > with the intention of runtime resuming the device afterwards. The result > > > is a circular wait between poll worker and autosuspend worker. > > > > Don't shut down the poll worker when runtime suspending, that' doesn't > > work. If you need the poll work, then that means waking up the gpu every > > few seconds. If you don't need it, then make sure the DRM_CONNECTOR_POLL > > flags are set correctly (you can update them at runtime, the poll worker > > will pick that up). > > > > That should fix the deadlock, and it's how we do it in i915 (where igt in > > CI totally hammers the runtime pm support, and it seems to hold up). > > It would fix the deadlock but it's not an option on dual GPU laptops. > Power consumption of the discrete GPU is massive (9 W on my machine). > > > > > i915, malidp and msm "solved" this issue by not stopping the poll worker > > > on runtime suspend. But this results in the GPU bouncing back and forth > > > between D0 and D3 continuously. That's a total no-go for GPUs which > > > runtime suspend to D3cold since every suspend/resume cycle costs a > > > significant amount of time and energy. (i915 and malidp do not seem > > > to acquire a runtime PM ref in the ->detect callbacks, which seems > > > questionable. msm however does and would also deadlock if it disabled > > > the poll worker on runtime suspend. cc += Archit, Liviu, intel-gfx) > > > > Well, userspace expects hotplug events, even when we runtime suspend > > stuff. Hence waking shit up with polling. Imo ignoring hotplug events is a > > pretty serious policy decision which is ok in the context of > > vga_switcheroo, but not really as an automatic thing. E.g. usb also wakes > > up if you plug something in, even with all the runtime pm stuff enabled. > > Same for sata and everything else. > > On the MacBook Pro, the HPD pin of external DP and HDMI ports goes into > the gmux controller, which sends an interrupt on hotplug even if the GPU > is powered down. > > Apparently Optimus has a similar functionality, cf. 3a6536c51d5d. Yeah, for vga_switcheroo and gmux setups shutting down polling explicitly makes sense. I think ideally we'd stop polling in the gmux handler somehow (maybe by dropping the relevant DRM_CONNECTOR_POLL flags, or outright stopping it all). But not when runtime suspending the entire gpu (e.g. idle system that shuts down the screen and everything, before it decides a few minutes later to do a full system suspend). I also think that this approach would lead to cleaner code, having explicit checks for the (locking) execution context all over the place tends to result in regrets eventually ime. > For the rare cases where an external VGA or DVI-A port is present, I guess > it's reasonable to have the user wake up the machine manually. > > I'm not sure why nouveau polls ports on my laptop, the GK107 only has an > LVDS and three DP ports, need to investigate. Yeah, that'd be good to figure out. The probe helpers should shut down the worker if there's no connector that needs probing. We use that to enable/disable the poll worker when there's a hotplug storm on the irq line. Once that's fixed we can perhaps also untangle the poll-vs-gmux story. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Intel-gfx] [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Mon, Feb 19, 2018 at 12:48:04PM +0100, Daniel Vetter wrote: > On Thu, Feb 15, 2018 at 06:38:44AM +0100, Lukas Wunner wrote: > > On Wed, Feb 14, 2018 at 09:58:43AM -0500, Sean Paul wrote: > > > On Wed, Feb 14, 2018 at 03:43:56PM +0100, Michel Dänzer wrote: > > > > On 2018-02-14 03:08 PM, Sean Paul wrote: > > > > > On Wed, Feb 14, 2018 at 10:26:35AM +0100, Maarten Lankhorst wrote: > > > > >> Op 14-02-18 om 09:46 schreef Lukas Wunner: > > > > >>> On Sun, Feb 11, 2018 at 10:38:28AM +0100, Lukas Wunner wrote: > > > > Fix a deadlock on hybrid graphics laptops that's been present > > > > since 2013: > > > > >>> This series has been reviewed, consent has been expressed by the > > > > >>> most > > > > >>> interested parties, patch [1/5] which touches files outside > > > > >>> drivers/gpu > > > > >>> has been acked and I've just out a v2 addressing the only objection > > > > >>> raised. My plan is thus to wait another two days for comments and, > > > > >>> barring further objections, push to drm-misc this weekend. > > > > >>> > > > > >>> However I'm struggling with the decision whether to push to next or > > > > >>> fixes. The series is marked for stable, however the number of > > > > >>> affected machines is limited and for an issue that's been present > > > > >>> for 5 years it probably doesn't matter if it soaks another two > > > > >>> months > > > > >>> in linux-next befor it gets backported. Hence I tend to err on the > > > > >>> side of caution and push to next, however a case could be made that > > > > >>> fixes is more appropriate. > > > > >>> > > > > >>> I'm lacking experience making such decisions and would be interested > > > > >>> to learn how you'd handle this. > > > > >> > > > > >> I would say fixes, it doesn't look particularly scary. :) > > > > > > > > > > Agreed. If it's good enough for stable, it's good enough for -fixes! > > > > > > > > It's not that simple, is it? Fast-tracking patches (some of which appear > > > > to be untested) to stable without an immediate cause for urgency seems > > > > risky to me. > > > > > > /me should be more careful what he says > > > > > > Given where we are in the release cycle, it's barely a fast track. > > > If these go in -fixes, they'll get in -rc2 and will have plenty of > > > time to bake. If we were at rc5, it might be a different story. > > > > The patches are marked for stable though, so if they go in through > > drm-misc-fixes, they may appear in stable kernels before 4.16-final > > is out. Greg picks up patches once they're in Linus' tree, though > > often with a delay of a few days or weeks. If they go in through > > drm-misc-next, they're guaranteed not to appear in *any* release > > before 4.16-final is out. > > > > This allows for differentiation between no-brainer stable fixes that > > can be sent immediately and scarier, but similarly important stable > > fixes that should soak for a while. I'm not sure which category > > this series belongs to, though it's true what Maarten says, it's > > not *that* grave a change. > > If you're this concerned about them, then pls do _not_ put cc: stable on > the patches. Instead get them merged through -fixes (or maybe even -next), > and once they're sufficiently tested, send a mail to stable@ asking for > ane explicit backport. I'm not concerned about them, but would have erred on the side of caution. However consensus seems to have been that they're sufficiently unscary to push to -fixes. Do you disagree with that decision, if so, why? Can we amend the dim docs to codify guidelines whether to push to -fixes or -next? I allowed 1 week for comments, now you're returning from vacation and seem to be unhappy, was 1 week too short? Thanks, Lukas ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Nouveau] [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Mon, Feb 19, 2018 at 12:34:43PM +0100, Daniel Vetter wrote: > On Sun, Feb 11, 2018 at 10:38:28AM +0100, Lukas Wunner wrote: > > Fix a deadlock on hybrid graphics laptops that's been present since 2013: > > > > DRM drivers poll connectors in 10 sec intervals. The poll worker is > > stopped on ->runtime_suspend with cancel_delayed_work_sync(). However > > the poll worker invokes the DRM drivers' ->detect callbacks, which call > > pm_runtime_get_sync(). If the poll worker starts after runtime suspend > > has begun, pm_runtime_get_sync() will wait for runtime suspend to finish > > with the intention of runtime resuming the device afterwards. The result > > is a circular wait between poll worker and autosuspend worker. > > Don't shut down the poll worker when runtime suspending, that' doesn't > work. If you need the poll work, then that means waking up the gpu every > few seconds. If you don't need it, then make sure the DRM_CONNECTOR_POLL > flags are set correctly (you can update them at runtime, the poll worker > will pick that up). > > That should fix the deadlock, and it's how we do it in i915 (where igt in > CI totally hammers the runtime pm support, and it seems to hold up). It would fix the deadlock but it's not an option on dual GPU laptops. Power consumption of the discrete GPU is massive (9 W on my machine). > > i915, malidp and msm "solved" this issue by not stopping the poll worker > > on runtime suspend. But this results in the GPU bouncing back and forth > > between D0 and D3 continuously. That's a total no-go for GPUs which > > runtime suspend to D3cold since every suspend/resume cycle costs a > > significant amount of time and energy. (i915 and malidp do not seem > > to acquire a runtime PM ref in the ->detect callbacks, which seems > > questionable. msm however does and would also deadlock if it disabled > > the poll worker on runtime suspend. cc += Archit, Liviu, intel-gfx) > > Well, userspace expects hotplug events, even when we runtime suspend > stuff. Hence waking shit up with polling. Imo ignoring hotplug events is a > pretty serious policy decision which is ok in the context of > vga_switcheroo, but not really as an automatic thing. E.g. usb also wakes > up if you plug something in, even with all the runtime pm stuff enabled. > Same for sata and everything else. On the MacBook Pro, the HPD pin of external DP and HDMI ports goes into the gmux controller, which sends an interrupt on hotplug even if the GPU is powered down. Apparently Optimus has a similar functionality, cf. 3a6536c51d5d. For the rare cases where an external VGA or DVI-A port is present, I guess it's reasonable to have the user wake up the machine manually. I'm not sure why nouveau polls ports on my laptop, the GK107 only has an LVDS and three DP ports, need to investigate. Thanks, Lukas ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Intel-gfx] [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Thu, Feb 15, 2018 at 06:38:44AM +0100, Lukas Wunner wrote: > On Wed, Feb 14, 2018 at 09:58:43AM -0500, Sean Paul wrote: > > On Wed, Feb 14, 2018 at 03:43:56PM +0100, Michel Dänzer wrote: > > > On 2018-02-14 03:08 PM, Sean Paul wrote: > > > > On Wed, Feb 14, 2018 at 10:26:35AM +0100, Maarten Lankhorst wrote: > > > >> Op 14-02-18 om 09:46 schreef Lukas Wunner: > > > >>> On Sun, Feb 11, 2018 at 10:38:28AM +0100, Lukas Wunner wrote: > > > Fix a deadlock on hybrid graphics laptops that's been present since > > > 2013: > > > >>> This series has been reviewed, consent has been expressed by the most > > > >>> interested parties, patch [1/5] which touches files outside > > > >>> drivers/gpu > > > >>> has been acked and I've just out a v2 addressing the only objection > > > >>> raised. My plan is thus to wait another two days for comments and, > > > >>> barring further objections, push to drm-misc this weekend. > > > >>> > > > >>> However I'm struggling with the decision whether to push to next or > > > >>> fixes. The series is marked for stable, however the number of > > > >>> affected machines is limited and for an issue that's been present > > > >>> for 5 years it probably doesn't matter if it soaks another two months > > > >>> in linux-next befor it gets backported. Hence I tend to err on the > > > >>> side of caution and push to next, however a case could be made that > > > >>> fixes is more appropriate. > > > >>> > > > >>> I'm lacking experience making such decisions and would be interested > > > >>> to learn how you'd handle this. > > > >> > > > >> I would say fixes, it doesn't look particularly scary. :) > > > > > > > > Agreed. If it's good enough for stable, it's good enough for -fixes! > > > > > > It's not that simple, is it? Fast-tracking patches (some of which appear > > > to be untested) to stable without an immediate cause for urgency seems > > > risky to me. > > > > /me should be more careful what he says > > > > Given where we are in the release cycle, it's barely a fast track. > > If these go in -fixes, they'll get in -rc2 and will have plenty of > > time to bake. If we were at rc5, it might be a different story. > > The patches are marked for stable though, so if they go in through > drm-misc-fixes, they may appear in stable kernels before 4.16-final > is out. Greg picks up patches once they're in Linus' tree, though > often with a delay of a few days or weeks. If they go in through > drm-misc-next, they're guaranteed not to appear in *any* release > before 4.16-final is out. > > This allows for differentiation between no-brainer stable fixes that > can be sent immediately and scarier, but similarly important stable > fixes that should soak for a while. I'm not sure which category > this series belongs to, though it's true what Maarten says, it's > not *that* grave a change. If you're this concerned about them, then pls do _not_ put cc: stable on the patches. Instead get them merged through -fixes (or maybe even -next), and once they're sufficiently tested, send a mail to stable@ asking for ane explicit backport. Stuff that's marked for stable must be obviuos and tested enough for backporting right away (which doesn't seem to be the case here). -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [Nouveau] [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Sun, Feb 11, 2018 at 10:38:28AM +0100, Lukas Wunner wrote: > Fix a deadlock on hybrid graphics laptops that's been present since 2013: > > DRM drivers poll connectors in 10 sec intervals. The poll worker is > stopped on ->runtime_suspend with cancel_delayed_work_sync(). However > the poll worker invokes the DRM drivers' ->detect callbacks, which call > pm_runtime_get_sync(). If the poll worker starts after runtime suspend > has begun, pm_runtime_get_sync() will wait for runtime suspend to finish > with the intention of runtime resuming the device afterwards. The result > is a circular wait between poll worker and autosuspend worker. Don't shut down the poll worker when runtime suspending, that' doesn't work. If you need the poll work, then that means waking up the gpu every few seconds. If you don't need it, then make sure the DRM_CONNECTOR_POLL flags are set correctly (you can update them at runtime, the poll worker will pick that up). That should fix the deadlock, and it's how we do it in i915 (where igt in CI totally hammers the runtime pm support, and it seems to hold up). And I guess we need to improve the poll worker docs about this. > I'm seeing this deadlock so often it's not funny anymore. I've also found > 3 nouveau bugzillas about it and 1 radeon bugzilla. (See patch [3/5] and > [4/5].) > > One theoretical solution would be to add a flag to the ->detect callback > to tell it that it's running in the poll worker's context. In that case > it doesn't need to call pm_runtime_get_sync() because the poll worker is > only enabled while runtime active and we know that ->runtime_suspend > waits for it to finish. But this approach would require touching every > single DRM driver's ->detect hook. Moreover the ->detect hook is called > from numerous other places, both in the DRM library as well as in each > driver, making it necessary to trace every possible call chain and check > if it's coming from the poll worker or not. I've tried to do that for > nouveau (see the call sites listed in the commit message of patch [3/5]) > and concluded that this approach is a nightmare to implement. > > Another reason for the infeasibility of this approach is that ->detect > is called from within the DRM library without driver involvement, e.g. > via DRM's sysfs interface. In those cases, pm_runtime_get_sync() would > have to be called by the DRM library, but the DRM library is supposed to > stay generic, wheras runtime PM is driver-specific. > > So instead, I've come up with this much simpler solution which gleans > from the current task's flags if it's a workqueue worker, retrieves the > work_struct and checks if it's the poll worker. All that's needed is > one small helper in the workqueue code and another small helper in the > DRM library. This solution lends itself nicely to stable-backporting. > > The patches for radeon and amdgpu are compile-tested only, I only have a > MacBook Pro with an Nvidia GK107 to test. To test the patches, add an > "msleep(12*1000);" at the top of the driver's ->runtime_suspend hook. > This ensures that the poll worker runs after ->runtime_suspend has begun. > Wait 12 sec after the GPU has begun runtime suspend, then check > /sys/bus/pci/devices/:01:00.0/power/runtime_status. Without this > series, the status will be stuck at "suspending" and you'll get hung task > errors in dmesg after a few minutes. > > i915, malidp and msm "solved" this issue by not stopping the poll worker > on runtime suspend. But this results in the GPU bouncing back and forth > between D0 and D3 continuously. That's a total no-go for GPUs which > runtime suspend to D3cold since every suspend/resume cycle costs a > significant amount of time and energy. (i915 and malidp do not seem > to acquire a runtime PM ref in the ->detect callbacks, which seems > questionable. msm however does and would also deadlock if it disabled > the poll worker on runtime suspend. cc += Archit, Liviu, intel-gfx) Well, userspace expects hotplug events, even when we runtime suspend stuff. Hence waking shit up with polling. Imo ignoring hotplug events is a pretty serious policy decision which is ok in the context of vga_switcheroo, but not really as an automatic thing. E.g. usb also wakes up if you plug something in, even with all the runtime pm stuff enabled. Same for sata and everything else. -Daniel > Please review. Thanks, > > Lukas > > Lukas Wunner (5): > workqueue: Allow retrieval of current task's work struct > drm: Allow determining if current task is output poll worker > drm/nouveau: Fix deadlock on runtime suspend > drm/radeon: Fix deadlock on runtime suspend > drm/amdgpu: Fix deadlock on runtime suspend > > drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c | 58 +--- > drivers/gpu/drm/drm_probe_helper.c | 14 + > drivers/gpu/drm/nouveau/nouveau_connector.c| 18 +-- > drivers/gpu/drm/radeon/radeon_connectors.c | 74 >
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Sun, Feb 11, 2018 at 10:38:28AM +0100, Lukas Wunner wrote: > workqueue: Allow retrieval of current task's work struct > drm: Allow determining if current task is output poll worker > drm/nouveau: Fix deadlock on runtime suspend > drm/radeon: Fix deadlock on runtime suspend > drm/amdgpu: Fix deadlock on runtime suspend Pushed to drm-misc-fixes, thanks a lot everyone for the acks, reviews, testing and comments. drm-misc maintainers, heads-up: drm-misc-fixes is still based on 4.15-rc8. The present series applies cleanly to both 4.15 and 4.16, so I had no need to have 4.16-rc1 backmerged, but that may be necessary sooner or later. I did a local test pull into drm-fixes, the shortlog looked sane and it merged and compiled cleanly. Please note two other commits are still pending in drm-misc-fixes: commit 745fd50f3b044db6a3922e1718306555613164b0 Author: Daniel VetterDate: Wed Jan 31 12:04:50 2018 +0100 drm/cirrus: Load lut in crtc_commit Gustavo sent a pull request for this one on Jan 31 but erroneously based it on the wrong commit and it ended up not being pulled by Dave. commit 54f809cfbd6b4a43959039f5d33596ed3297ce16 Author: Leo (Sunpeng) Li Date: Wed Jan 17 12:51:08 2018 +0100 drm/atomic: Fix memleak on ERESTARTSYS during non-blocking commits This one has already been pulled by Dave into drm-next for 4.17 as commit 1c6c6ebb but Maarten subsequently cherry-picked it onto drm-misc-fixes. Let me know if I've made any mistakes. Thanks, Lukas ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: i915 runtime PM (was: Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers)
On Fri, Feb 16, 2018 at 09:49:28AM +0100, Lukas Wunner wrote: > [trimming cc: a little to avoid spamming folks not directly involved with > i915] > > On Mon, Feb 12, 2018 at 03:04:06PM +0200, Imre Deak wrote: > > On Sun, Feb 11, 2018 at 10:38:28AM +0100, Lukas Wunner wrote: > > > i915, malidp and msm "solved" this issue by not stopping the poll worker > > > on runtime suspend. But this results in the GPU bouncing back and forth > > > between D0 and D3 continuously. That's a total no-go for GPUs which > > > runtime suspend to D3cold since every suspend/resume cycle costs a > > > significant amount of time and energy. (i915 and malidp do not seem > > > to acquire a runtime PM ref in the ->detect callbacks, which seems > > > questionable. msm however does and would also deadlock if it disabled > > > the poll worker on runtime suspend. cc += Archit, Liviu, intel-gfx) > > > > In i915 polling is on during runtime suspend only if there are outputs > > without hotplug interrupt support. A special case is when an output has > > working HPD interrupts when in D0, but no interrupts when runtime > > suspended. For these we start polling (from a scheduled work) in the > > runtime suspend hook and stop it in the runtime resume hook (again from > > a scheduled work). > > I'm assuming to poll outputs you need to access mmio, is that correct? Correct. > Since mmio is inaccessible in D3hot, you seem to leave the PCI device > in D0 during runtime suspend, right? No, it's put into D3 (by the runtime PM core). > Aren't you leaving power saving > potential on the table that way? Or are you able to achieve the same > low power consumption in D0 as in D3hot by turning off power wells etc? > When powering off the GPU via vga_switcheroo manual power control > (a legacy feature we'll hopefully drop once we have runtime PM > support on muxed laptops and in i915) the PCI device *is* put into > D3hot by i915_suspend_switcheroo(). If leaving the device in D0 > in the runtime suspend code path results in less power reduction, > runtime PM wouldn't be a full replacement for vga_switcheroo manual > power control, which would be kind of a bummer. :-/ It's possible that in the future the device won't be put into D3, given that HPD interrupts are gone in that state, but this will be done only if D3 doesn't provide any power saving over turning off display/GPU power wells (still need to figure out the details of this). > Another question, you're not calling pm_runtime_allow() when binding > the driver to the device, so users have to either manually allow it > via sysfs or use "powertop --auto-tune" or whatever. What's the > rationale for that? We decided to have more testing/fixing known issues before enabling it by default. --Imre > nouveau, radeon and amdgpu all allow it by default. > > Thanks, > > Lukas ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
i915 runtime PM (was: Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers)
[trimming cc: a little to avoid spamming folks not directly involved with i915] On Mon, Feb 12, 2018 at 03:04:06PM +0200, Imre Deak wrote: > On Sun, Feb 11, 2018 at 10:38:28AM +0100, Lukas Wunner wrote: > > i915, malidp and msm "solved" this issue by not stopping the poll worker > > on runtime suspend. But this results in the GPU bouncing back and forth > > between D0 and D3 continuously. That's a total no-go for GPUs which > > runtime suspend to D3cold since every suspend/resume cycle costs a > > significant amount of time and energy. (i915 and malidp do not seem > > to acquire a runtime PM ref in the ->detect callbacks, which seems > > questionable. msm however does and would also deadlock if it disabled > > the poll worker on runtime suspend. cc += Archit, Liviu, intel-gfx) > > In i915 polling is on during runtime suspend only if there are outputs > without hotplug interrupt support. A special case is when an output has > working HPD interrupts when in D0, but no interrupts when runtime > suspended. For these we start polling (from a scheduled work) in the > runtime suspend hook and stop it in the runtime resume hook (again from > a scheduled work). I'm assuming to poll outputs you need to access mmio, is that correct? Since mmio is inaccessible in D3hot, you seem to leave the PCI device in D0 during runtime suspend, right? Aren't you leaving power saving potential on the table that way? Or are you able to achieve the same low power consumption in D0 as in D3hot by turning off power wells etc? When powering off the GPU via vga_switcheroo manual power control (a legacy feature we'll hopefully drop once we have runtime PM support on muxed laptops and in i915) the PCI device *is* put into D3hot by i915_suspend_switcheroo(). If leaving the device in D0 in the runtime suspend code path results in less power reduction, runtime PM wouldn't be a full replacement for vga_switcheroo manual power control, which would be kind of a bummer. :-/ Another question, you're not calling pm_runtime_allow() when binding the driver to the device, so users have to either manually allow it via sysfs or use "powertop --auto-tune" or whatever. What's the rationale for that? nouveau, radeon and amdgpu all allow it by default. Thanks, Lukas ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Wed, Feb 14, 2018 at 09:58:43AM -0500, Sean Paul wrote: > On Wed, Feb 14, 2018 at 03:43:56PM +0100, Michel Dänzer wrote: > > On 2018-02-14 03:08 PM, Sean Paul wrote: > > > On Wed, Feb 14, 2018 at 10:26:35AM +0100, Maarten Lankhorst wrote: > > >> Op 14-02-18 om 09:46 schreef Lukas Wunner: > > >>> On Sun, Feb 11, 2018 at 10:38:28AM +0100, Lukas Wunner wrote: > > Fix a deadlock on hybrid graphics laptops that's been present since > > 2013: > > >>> This series has been reviewed, consent has been expressed by the most > > >>> interested parties, patch [1/5] which touches files outside drivers/gpu > > >>> has been acked and I've just out a v2 addressing the only objection > > >>> raised. My plan is thus to wait another two days for comments and, > > >>> barring further objections, push to drm-misc this weekend. > > >>> > > >>> However I'm struggling with the decision whether to push to next or > > >>> fixes. The series is marked for stable, however the number of > > >>> affected machines is limited and for an issue that's been present > > >>> for 5 years it probably doesn't matter if it soaks another two months > > >>> in linux-next befor it gets backported. Hence I tend to err on the > > >>> side of caution and push to next, however a case could be made that > > >>> fixes is more appropriate. > > >>> > > >>> I'm lacking experience making such decisions and would be interested > > >>> to learn how you'd handle this. > > >> > > >> I would say fixes, it doesn't look particularly scary. :) > > > > > > Agreed. If it's good enough for stable, it's good enough for -fixes! > > > > It's not that simple, is it? Fast-tracking patches (some of which appear > > to be untested) to stable without an immediate cause for urgency seems > > risky to me. > > /me should be more careful what he says > > Given where we are in the release cycle, it's barely a fast track. > If these go in -fixes, they'll get in -rc2 and will have plenty of > time to bake. If we were at rc5, it might be a different story. The patches are marked for stable though, so if they go in through drm-misc-fixes, they may appear in stable kernels before 4.16-final is out. Greg picks up patches once they're in Linus' tree, though often with a delay of a few days or weeks. If they go in through drm-misc-next, they're guaranteed not to appear in *any* release before 4.16-final is out. This allows for differentiation between no-brainer stable fixes that can be sent immediately and scarier, but similarly important stable fixes that should soak for a while. I'm not sure which category this series belongs to, though it's true what Maarten says, it's not *that* grave a change. Thanks, Lukas ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Wed, Feb 14, 2018 at 03:43:56PM +0100, Michel Dänzer wrote: > On 2018-02-14 03:08 PM, Sean Paul wrote: > > On Wed, Feb 14, 2018 at 10:26:35AM +0100, Maarten Lankhorst wrote: > >> Op 14-02-18 om 09:46 schreef Lukas Wunner: > >>> Dear drm-misc maintainers, > >>> > >>> On Sun, Feb 11, 2018 at 10:38:28AM +0100, Lukas Wunner wrote: > Fix a deadlock on hybrid graphics laptops that's been present since 2013: > >>> This series has been reviewed, consent has been expressed by the most > >>> interested parties, patch [1/5] which touches files outside drivers/gpu > >>> has been acked and I've just out a v2 addressing the only objection > >>> raised. My plan is thus to wait another two days for comments and, > >>> barring further objections, push to drm-misc this weekend. > >>> > >>> However I'm struggling with the decision whether to push to next or > >>> fixes. The series is marked for stable, however the number of > >>> affected machines is limited and for an issue that's been present > >>> for 5 years it probably doesn't matter if it soaks another two months > >>> in linux-next befor it gets backported. Hence I tend to err on the > >>> side of caution and push to next, however a case could be made that > >>> fixes is more appropriate. > >>> > >>> I'm lacking experience making such decisions and would be interested > >>> to learn how you'd handle this. > >>> > >>> Thanks, > >>> > >>> Lukas > >> > >> I would say fixes, it doesn't look particularly scary. :) > > > > Agreed. If it's good enough for stable, it's good enough for -fixes! > > It's not that simple, is it? Fast-tracking patches (some of which appear > to be untested) to stable without an immediate cause for urgency seems > risky to me. > /me should be more careful what he says Given where we are in the release cycle, it's barely a fast track. If these go in -fixes, they'll get in -rc2 and will have plenty of time to bake. If we were at rc5, it might be a different story. Sean > > -- > Earthling Michel Dänzer | http://www.amd.com > Libre software enthusiast | Mesa and X developer -- Sean Paul, Software Engineer, Google / Chromium OS ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On 2018-02-14 03:08 PM, Sean Paul wrote: > On Wed, Feb 14, 2018 at 10:26:35AM +0100, Maarten Lankhorst wrote: >> Op 14-02-18 om 09:46 schreef Lukas Wunner: >>> Dear drm-misc maintainers, >>> >>> On Sun, Feb 11, 2018 at 10:38:28AM +0100, Lukas Wunner wrote: Fix a deadlock on hybrid graphics laptops that's been present since 2013: >>> This series has been reviewed, consent has been expressed by the most >>> interested parties, patch [1/5] which touches files outside drivers/gpu >>> has been acked and I've just out a v2 addressing the only objection >>> raised. My plan is thus to wait another two days for comments and, >>> barring further objections, push to drm-misc this weekend. >>> >>> However I'm struggling with the decision whether to push to next or >>> fixes. The series is marked for stable, however the number of >>> affected machines is limited and for an issue that's been present >>> for 5 years it probably doesn't matter if it soaks another two months >>> in linux-next befor it gets backported. Hence I tend to err on the >>> side of caution and push to next, however a case could be made that >>> fixes is more appropriate. >>> >>> I'm lacking experience making such decisions and would be interested >>> to learn how you'd handle this. >>> >>> Thanks, >>> >>> Lukas >> >> I would say fixes, it doesn't look particularly scary. :) > > Agreed. If it's good enough for stable, it's good enough for -fixes! It's not that simple, is it? Fast-tracking patches (some of which appear to be untested) to stable without an immediate cause for urgency seems risky to me. -- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
Op 14-02-18 om 09:46 schreef Lukas Wunner: > Dear drm-misc maintainers, > > On Sun, Feb 11, 2018 at 10:38:28AM +0100, Lukas Wunner wrote: >> Fix a deadlock on hybrid graphics laptops that's been present since 2013: > This series has been reviewed, consent has been expressed by the most > interested parties, patch [1/5] which touches files outside drivers/gpu > has been acked and I've just out a v2 addressing the only objection > raised. My plan is thus to wait another two days for comments and, > barring further objections, push to drm-misc this weekend. > > However I'm struggling with the decision whether to push to next or > fixes. The series is marked for stable, however the number of > affected machines is limited and for an issue that's been present > for 5 years it probably doesn't matter if it soaks another two months > in linux-next befor it gets backported. Hence I tend to err on the > side of caution and push to next, however a case could be made that > fixes is more appropriate. > > I'm lacking experience making such decisions and would be interested > to learn how you'd handle this. > > Thanks, > > Lukas I would say fixes, it doesn't look particularly scary. :) ~Maarten ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
Dear drm-misc maintainers, On Sun, Feb 11, 2018 at 10:38:28AM +0100, Lukas Wunner wrote: > Fix a deadlock on hybrid graphics laptops that's been present since 2013: This series has been reviewed, consent has been expressed by the most interested parties, patch [1/5] which touches files outside drivers/gpu has been acked and I've just out a v2 addressing the only objection raised. My plan is thus to wait another two days for comments and, barring further objections, push to drm-misc this weekend. However I'm struggling with the decision whether to push to next or fixes. The series is marked for stable, however the number of affected machines is limited and for an issue that's been present for 5 years it probably doesn't matter if it soaks another two months in linux-next befor it gets backported. Hence I tend to err on the side of caution and push to next, however a case could be made that fixes is more appropriate. I'm lacking experience making such decisions and would be interested to learn how you'd handle this. Thanks, Lukas ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Tue, Feb 13, 2018 at 12:52:06PM +0100, Lukas Wunner wrote: > On Tue, Feb 13, 2018 at 10:55:06AM +, Liviu Dudau wrote: > > On Sun, Feb 11, 2018 at 10:38:28AM +0100, Lukas Wunner wrote: > > > DRM drivers poll connectors in 10 sec intervals. The poll worker is > > > stopped on ->runtime_suspend with cancel_delayed_work_sync(). However > > > the poll worker invokes the DRM drivers' ->detect callbacks, which call > > > pm_runtime_get_sync(). If the poll worker starts after runtime suspend > > > has begun, pm_runtime_get_sync() will wait for runtime suspend to finish > > > with the intention of runtime resuming the device afterwards. The result > > > is a circular wait between poll worker and autosuspend worker. > > > > I think I understand the problem you are trying to solve, but I'm > > struggling to understand where malidp makes any specific mistakes. First > > of all, malidp is only a display engine, so there is no GPU attached to > > it, but that is only a small clarification. Second, malidp doesn't use > > directly any of the callbacks that you are referring to, it uses the > > drm_cma_() API plus the generic drm_() call. So if there are any > > issues there (as they might well be) I think they would apply to a lot > > more drivers and the fix will involve more than just malidp, i915 and > > msm. > > I don't know if malidp makes any specific mistakes and didn't mean to > cast it in a negative light, sorry. I did not take what you've said as a negative thing, only wanted to understand how you came to your conclusions. > > So a lot of DRM drivers acquire a runtime PM ref in the connector ->detect > hook because they can't probe connectors while runtime suspended. > E.g. for a PCI device, probing might require mmio access, which isn't > possible outside of power state D0. There are no ->detect hooks declared > in drivers/gpu/drm/arm/, so it's unclear to me whether you're able to probe > during runtime suspend. That's because the drivers in drivers/gpu/drm/arm do not have connectors, they are only the CRTC part of the driver. Both hdlcd and mali-dp use the component framework to locate an encoder in device tree that will then provide the connectors. > > hdlcd_drv.c and malidp_drv.c both enable output polling. Output polling > is only necessary if you don't get HPD interrupts. That's right, hdlcd and mali-dp don't receive HPD interrupts because they don't have any. And because we don't know ahead of time which encoder/connector will be paired with the driver, we enable polling as a safe fallback. > > You're not disabling polling upon runtime suspend. Thus, if a runtime PM > ref is acquired during polling (such as in a ->detect hook), the GPU will > be runtime resumed every 10 secs. You may want to verify that's not the > case. If you decide that you do want to stop polling during runtime > suspend because it runtime resumes the GPU continuously, you'll need the > helper introduced in this series. So by cc'ing you I just wanted to keep > you in the loop about an issue that may potentially affect your driver. Again, we have no GPU linked to us and the polling during runtime suspend should be handled by the driver for the paired encoder, not by us. I understand the potential issue but I'm struggling to understand if it really applies to the drivers/gpu/drm/arm drivers other than in an abstract way. Best regards, Liviu > > Let me know if there are any questions. > > Thanks, > > Lukas -- | I would like to | | fix the world, | | but they're not | | giving me the | \ source code! / --- ¯\_(ツ)_/¯ ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Tue, Feb 13, 2018 at 3:17 AM, Lukas Wunnerwrote: > On Mon, Feb 12, 2018 at 01:58:32PM -0500, Alex Deucher wrote: >> On Mon, Feb 12, 2018 at 4:45 AM, Lukas Wunner wrote: >> > On Mon, Feb 12, 2018 at 09:03:26AM +, Mike Lothian wrote: >> >> On 12 February 2018 at 03:39, Lukas Wunner wrote: >> >> > On Mon, Feb 12, 2018 at 12:35:51AM +, Mike Lothian wrote: >> >> > > I've not been able to reproduce the original problem you're trying to >> >> > > solve on amdgpu thats with or without your patch set and the above >> >> > > "trigger" too >> > >> > Okay the reason you're not seeing deadlocks is because the output poll >> > worker is not enabled. And the output poll worker is not enabled >> > because your discrete GPU doesn't have any outputs: >> > >> > [0.265568] [drm:dc_create] *ERROR* DC: Number of connectors is zero! >> > >> > The outputs are only polled if there are connectors which have the >> > DRM_CONNECTOR_POLL_CONNECT or DRM_CONNECTOR_POLL_DISCONNECT flag set. >> > And that only ever seems to be the case for VGA and DVI. >> > >> > We know based on bugzilla reports that hybrid graphics laptops do exist >> > which poll outputs with radeon and nouveau. If there are no laptops >> > supported by amdgpu whose discrete GPU has polled connectors, then >> > patch [5/5] would be unnecessary. That is for Alex to decide. >> >> Most hybrid laptops don't have display connectors on the dGPU and we >> only use polling on analog connectors, so you are not likely to run >> into this on recent laptops. That said, I don't know if there is some >> OEM system out there with a VGA port on the dGPU in a hybrid laptop. >> I guess another option is to just disable polling on hybrid laptops. > > If we don't know for sure, applying patch [5/5] would seem to be the > safest approach. (Assuming it doesn't break anything else.) I don't have any objections. I see no reason to leave out the amdgpu changes. Alex ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Tue, Feb 13, 2018 at 10:55:06AM +, Liviu Dudau wrote: > On Sun, Feb 11, 2018 at 10:38:28AM +0100, Lukas Wunner wrote: > > DRM drivers poll connectors in 10 sec intervals. The poll worker is > > stopped on ->runtime_suspend with cancel_delayed_work_sync(). However > > the poll worker invokes the DRM drivers' ->detect callbacks, which call > > pm_runtime_get_sync(). If the poll worker starts after runtime suspend > > has begun, pm_runtime_get_sync() will wait for runtime suspend to finish > > with the intention of runtime resuming the device afterwards. The result > > is a circular wait between poll worker and autosuspend worker. > > I think I understand the problem you are trying to solve, but I'm > struggling to understand where malidp makes any specific mistakes. First > of all, malidp is only a display engine, so there is no GPU attached to > it, but that is only a small clarification. Second, malidp doesn't use > directly any of the callbacks that you are referring to, it uses the > drm_cma_() API plus the generic drm_() call. So if there are any > issues there (as they might well be) I think they would apply to a lot > more drivers and the fix will involve more than just malidp, i915 and > msm. I don't know if malidp makes any specific mistakes and didn't mean to cast it in a negative light, sorry. So a lot of DRM drivers acquire a runtime PM ref in the connector ->detect hook because they can't probe connectors while runtime suspended. E.g. for a PCI device, probing might require mmio access, which isn't possible outside of power state D0. There are no ->detect hooks declared in drivers/gpu/drm/arm/, so it's unclear to me whether you're able to probe during runtime suspend. hdlcd_drv.c and malidp_drv.c both enable output polling. Output polling is only necessary if you don't get HPD interrupts. You're not disabling polling upon runtime suspend. Thus, if a runtime PM ref is acquired during polling (such as in a ->detect hook), the GPU will be runtime resumed every 10 secs. You may want to verify that's not the case. If you decide that you do want to stop polling during runtime suspend because it runtime resumes the GPU continuously, you'll need the helper introduced in this series. So by cc'ing you I just wanted to keep you in the loop about an issue that may potentially affect your driver. Let me know if there are any questions. Thanks, Lukas ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
Hi Lukas, On Sun, Feb 11, 2018 at 10:38:28AM +0100, Lukas Wunner wrote: > Fix a deadlock on hybrid graphics laptops that's been present since 2013: > > DRM drivers poll connectors in 10 sec intervals. The poll worker is > stopped on ->runtime_suspend with cancel_delayed_work_sync(). However > the poll worker invokes the DRM drivers' ->detect callbacks, which call > pm_runtime_get_sync(). If the poll worker starts after runtime suspend > has begun, pm_runtime_get_sync() will wait for runtime suspend to finish > with the intention of runtime resuming the device afterwards. The result > is a circular wait between poll worker and autosuspend worker. > > I'm seeing this deadlock so often it's not funny anymore. I've also found > 3 nouveau bugzillas about it and 1 radeon bugzilla. (See patch [3/5] and > [4/5].) > > One theoretical solution would be to add a flag to the ->detect callback > to tell it that it's running in the poll worker's context. In that case > it doesn't need to call pm_runtime_get_sync() because the poll worker is > only enabled while runtime active and we know that ->runtime_suspend > waits for it to finish. But this approach would require touching every > single DRM driver's ->detect hook. Moreover the ->detect hook is called > from numerous other places, both in the DRM library as well as in each > driver, making it necessary to trace every possible call chain and check > if it's coming from the poll worker or not. I've tried to do that for > nouveau (see the call sites listed in the commit message of patch [3/5]) > and concluded that this approach is a nightmare to implement. > > Another reason for the infeasibility of this approach is that ->detect > is called from within the DRM library without driver involvement, e.g. > via DRM's sysfs interface. In those cases, pm_runtime_get_sync() would > have to be called by the DRM library, but the DRM library is supposed to > stay generic, wheras runtime PM is driver-specific. > > So instead, I've come up with this much simpler solution which gleans > from the current task's flags if it's a workqueue worker, retrieves the > work_struct and checks if it's the poll worker. All that's needed is > one small helper in the workqueue code and another small helper in the > DRM library. This solution lends itself nicely to stable-backporting. > > The patches for radeon and amdgpu are compile-tested only, I only have a > MacBook Pro with an Nvidia GK107 to test. To test the patches, add an > "msleep(12*1000);" at the top of the driver's ->runtime_suspend hook. > This ensures that the poll worker runs after ->runtime_suspend has begun. > Wait 12 sec after the GPU has begun runtime suspend, then check > /sys/bus/pci/devices/:01:00.0/power/runtime_status. Without this > series, the status will be stuck at "suspending" and you'll get hung task > errors in dmesg after a few minutes. > > i915, malidp and msm "solved" this issue by not stopping the poll worker > on runtime suspend. But this results in the GPU bouncing back and forth > between D0 and D3 continuously. That's a total no-go for GPUs which > runtime suspend to D3cold since every suspend/resume cycle costs a > significant amount of time and energy. (i915 and malidp do not seem > to acquire a runtime PM ref in the ->detect callbacks, which seems > questionable. msm however does and would also deadlock if it disabled > the poll worker on runtime suspend. cc += Archit, Liviu, intel-gfx) I think I understand the problem you are trying to solve, but I'm struggling to understand where malidp makes any specific mistakes. First of all, malidp is only a display engine, so there is no GPU attached to it, but that is only a small clarification. Second, malidp doesn't use directly any of the callbacks that you are referring to, it uses the drm_cma_() API plus the generic drm_() call. So if there are any issues there (as they might well be) I think they would apply to a lot more drivers and the fix will involve more than just malidp, i915 and msm. Would love to get any clarifications on what I might have misunderstood. Best regards, Liviu > > Please review. Thanks, > > Lukas > > Lukas Wunner (5): > workqueue: Allow retrieval of current task's work struct > drm: Allow determining if current task is output poll worker > drm/nouveau: Fix deadlock on runtime suspend > drm/radeon: Fix deadlock on runtime suspend > drm/amdgpu: Fix deadlock on runtime suspend > > drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c | 58 +--- > drivers/gpu/drm/drm_probe_helper.c | 14 + > drivers/gpu/drm/nouveau/nouveau_connector.c| 18 +-- > drivers/gpu/drm/radeon/radeon_connectors.c | 74 > +- > include/drm/drm_crtc_helper.h | 1 + > include/linux/workqueue.h | 1 + > kernel/workqueue.c | 16 ++ > 7 files changed, 132 insertions(+), 50
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Mon, Feb 12, 2018 at 01:58:32PM -0500, Alex Deucher wrote: > On Mon, Feb 12, 2018 at 4:45 AM, Lukas Wunnerwrote: > > On Mon, Feb 12, 2018 at 09:03:26AM +, Mike Lothian wrote: > >> On 12 February 2018 at 03:39, Lukas Wunner wrote: > >> > On Mon, Feb 12, 2018 at 12:35:51AM +, Mike Lothian wrote: > >> > > I've not been able to reproduce the original problem you're trying to > >> > > solve on amdgpu thats with or without your patch set and the above > >> > > "trigger" too > > > > Okay the reason you're not seeing deadlocks is because the output poll > > worker is not enabled. And the output poll worker is not enabled > > because your discrete GPU doesn't have any outputs: > > > > [0.265568] [drm:dc_create] *ERROR* DC: Number of connectors is zero! > > > > The outputs are only polled if there are connectors which have the > > DRM_CONNECTOR_POLL_CONNECT or DRM_CONNECTOR_POLL_DISCONNECT flag set. > > And that only ever seems to be the case for VGA and DVI. > > > > We know based on bugzilla reports that hybrid graphics laptops do exist > > which poll outputs with radeon and nouveau. If there are no laptops > > supported by amdgpu whose discrete GPU has polled connectors, then > > patch [5/5] would be unnecessary. That is for Alex to decide. > > Most hybrid laptops don't have display connectors on the dGPU and we > only use polling on analog connectors, so you are not likely to run > into this on recent laptops. That said, I don't know if there is some > OEM system out there with a VGA port on the dGPU in a hybrid laptop. > I guess another option is to just disable polling on hybrid laptops. If we don't know for sure, applying patch [5/5] would seem to be the safest approach. (Assuming it doesn't break anything else.) Right now runtime PM is only used on hybrid graphics dGPUs by nouveau, radeon and amdgpu. Would it be conceivable that its use is expanded beyond that in the future? E.g. on a desktop machine, if DPMS is off on all screens, why keep the GPU in D0? If that is conceivable, chances that analog connectors are present are higher, and then the patch would be necessary again. (Of course this would mean that analog screens wouldn't light up automatically if they're attached while the GPU is in D3hot, but the user may forbid runtime PM via sysfs if that is unwanted.) Thanks, Lukas ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Mon, Feb 12, 2018 at 4:45 AM, Lukas Wunnerwrote: > On Mon, Feb 12, 2018 at 09:03:26AM +, Mike Lothian wrote: >> On 12 February 2018 at 03:39, Lukas Wunner wrote: >> > On Mon, Feb 12, 2018 at 12:35:51AM +, Mike Lothian wrote: >> > > I've not been able to reproduce the original problem you're trying to >> > > solve on amdgpu thats with or without your patch set and the above >> > > "trigger" too >> > > >> > > Is anything else required to trigger it, I started multiple DRI_PRIME >> > > glxgears, in parallel, serial waiting the 12 seconds and serial within >> > > the 12 seconds and I couldn't reproduce it >> > >> > The discrete GPU needs to runtime suspend, that's the trigger, >> > so no DRI_PRIME executables should be running. Just let it >> > autosuspend after boot. Do you see "waiting 12 sec" messages >> > in dmesg? If not it's not autosuspending. >> >> Yes I'm seeing those messages, I'm just not seeing the hangs >> >> I've attached the dmesg in case you're interested > > Okay the reason you're not seeing deadlocks is because the output poll > worker is not enabled. And the output poll worker is not enabled > because your discrete GPU doesn't have any outputs: > > [0.265568] [drm:dc_create] *ERROR* DC: Number of connectors is zero! > > The outputs are only polled if there are connectors which have the > DRM_CONNECTOR_POLL_CONNECT or DRM_CONNECTOR_POLL_DISCONNECT flag set. > And that only ever seems to be the case for VGA and DVI. > > We know based on bugzilla reports that hybrid graphics laptops do exist > which poll outputs with radeon and nouveau. If there are no laptops > supported by amdgpu whose discrete GPU has polled connectors, then > patch [5/5] would be unnecessary. That is for Alex to decide. Most hybrid laptops don't have display connectors on the dGPU and we only use polling on analog connectors, so you are not likely to run into this on recent laptops. That said, I don't know if there is some OEM system out there with a VGA port on the dGPU in a hybrid laptop. I guess another option is to just disable polling on hybrid laptops. Alex > > However that is very good to know, so thanks a lot for your testing > efforts, much appreciated! > > Kind regards, > > Lukas > ___ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
This patchset is a no-brainer for me. I'm ashamed to say I think I might have seen this issue before, but polling gets used so rarely nowadays it slipped under the rug by accident. Anyway, for the whole series (except patch #2, which I will respond to with a short comment in just a moment): Reviewed-by: Lyude PaulOn Sun, 2018-02-11 at 10:38 +0100, Lukas Wunner wrote: > Fix a deadlock on hybrid graphics laptops that's been present since 2013: > > DRM drivers poll connectors in 10 sec intervals. The poll worker is > stopped on ->runtime_suspend with cancel_delayed_work_sync(). However > the poll worker invokes the DRM drivers' ->detect callbacks, which call > pm_runtime_get_sync(). If the poll worker starts after runtime suspend > has begun, pm_runtime_get_sync() will wait for runtime suspend to finish > with the intention of runtime resuming the device afterwards. The result > is a circular wait between poll worker and autosuspend worker. > > I'm seeing this deadlock so often it's not funny anymore. I've also found > 3 nouveau bugzillas about it and 1 radeon bugzilla. (See patch [3/5] and > [4/5].) > > One theoretical solution would be to add a flag to the ->detect callback > to tell it that it's running in the poll worker's context. In that case > it doesn't need to call pm_runtime_get_sync() because the poll worker is > only enabled while runtime active and we know that ->runtime_suspend > waits for it to finish. But this approach would require touching every > single DRM driver's ->detect hook. Moreover the ->detect hook is called > from numerous other places, both in the DRM library as well as in each > driver, making it necessary to trace every possible call chain and check > if it's coming from the poll worker or not. I've tried to do that for > nouveau (see the call sites listed in the commit message of patch [3/5]) > and concluded that this approach is a nightmare to implement. > > Another reason for the infeasibility of this approach is that ->detect > is called from within the DRM library without driver involvement, e.g. > via DRM's sysfs interface. In those cases, pm_runtime_get_sync() would > have to be called by the DRM library, but the DRM library is supposed to > stay generic, wheras runtime PM is driver-specific. > > So instead, I've come up with this much simpler solution which gleans > from the current task's flags if it's a workqueue worker, retrieves the > work_struct and checks if it's the poll worker. All that's needed is > one small helper in the workqueue code and another small helper in the > DRM library. This solution lends itself nicely to stable-backporting. > > The patches for radeon and amdgpu are compile-tested only, I only have a > MacBook Pro with an Nvidia GK107 to test. To test the patches, add an > "msleep(12*1000);" at the top of the driver's ->runtime_suspend hook. > This ensures that the poll worker runs after ->runtime_suspend has begun. > Wait 12 sec after the GPU has begun runtime suspend, then check > /sys/bus/pci/devices/:01:00.0/power/runtime_status. Without this > series, the status will be stuck at "suspending" and you'll get hung task > errors in dmesg after a few minutes. > > i915, malidp and msm "solved" this issue by not stopping the poll worker > on runtime suspend. But this results in the GPU bouncing back and forth > between D0 and D3 continuously. That's a total no-go for GPUs which > runtime suspend to D3cold since every suspend/resume cycle costs a > significant amount of time and energy. (i915 and malidp do not seem > to acquire a runtime PM ref in the ->detect callbacks, which seems > questionable. msm however does and would also deadlock if it disabled > the poll worker on runtime suspend. cc += Archit, Liviu, intel-gfx) > > Please review. Thanks, > > Lukas > > Lukas Wunner (5): > workqueue: Allow retrieval of current task's work struct > drm: Allow determining if current task is output poll worker > drm/nouveau: Fix deadlock on runtime suspend > drm/radeon: Fix deadlock on runtime suspend > drm/amdgpu: Fix deadlock on runtime suspend > > drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c | 58 +--- > drivers/gpu/drm/drm_probe_helper.c | 14 + > drivers/gpu/drm/nouveau/nouveau_connector.c| 18 +-- > drivers/gpu/drm/radeon/radeon_connectors.c | 74 + > - > include/drm/drm_crtc_helper.h | 1 + > include/linux/workqueue.h | 1 + > kernel/workqueue.c | 16 ++ > 7 files changed, 132 insertions(+), 50 deletions(-) > -- Cheers, Lyude Paul ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Sun, Feb 11, 2018 at 10:38:28AM +0100, Lukas Wunner wrote: > [...] > i915, malidp and msm "solved" this issue by not stopping the poll worker > on runtime suspend. But this results in the GPU bouncing back and forth > between D0 and D3 continuously. That's a total no-go for GPUs which > runtime suspend to D3cold since every suspend/resume cycle costs a > significant amount of time and energy. (i915 and malidp do not seem > to acquire a runtime PM ref in the ->detect callbacks, which seems > questionable. msm however does and would also deadlock if it disabled > the poll worker on runtime suspend. cc += Archit, Liviu, intel-gfx) In i915 polling is on during runtime suspend only if there are outputs without hotplug interrupt support. A special case is when an output has working HPD interrupts when in D0, but no interrupts when runtime suspended. For these we start polling (from a scheduled work) in the runtime suspend hook and stop it in the runtime resume hook (again from a scheduled work). Imo whether to leave polling on or not for the above purpose is a policy question (configurable with the drm_kms_helper.poll param). --Imre > > Please review. Thanks, > > Lukas > > Lukas Wunner (5): > workqueue: Allow retrieval of current task's work struct > drm: Allow determining if current task is output poll worker > drm/nouveau: Fix deadlock on runtime suspend > drm/radeon: Fix deadlock on runtime suspend > drm/amdgpu: Fix deadlock on runtime suspend > > drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c | 58 +--- > drivers/gpu/drm/drm_probe_helper.c | 14 + > drivers/gpu/drm/nouveau/nouveau_connector.c| 18 +-- > drivers/gpu/drm/radeon/radeon_connectors.c | 74 > +- > include/drm/drm_crtc_helper.h | 1 + > include/linux/workqueue.h | 1 + > kernel/workqueue.c | 16 ++ > 7 files changed, 132 insertions(+), 50 deletions(-) > > -- > 2.15.1 > > ___ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Mon, Feb 12, 2018 at 09:03:26AM +, Mike Lothian wrote: > On 12 February 2018 at 03:39, Lukas Wunnerwrote: > > On Mon, Feb 12, 2018 at 12:35:51AM +, Mike Lothian wrote: > > > I've not been able to reproduce the original problem you're trying to > > > solve on amdgpu thats with or without your patch set and the above > > > "trigger" too > > > > > > Is anything else required to trigger it, I started multiple DRI_PRIME > > > glxgears, in parallel, serial waiting the 12 seconds and serial within > > > the 12 seconds and I couldn't reproduce it > > > > The discrete GPU needs to runtime suspend, that's the trigger, > > so no DRI_PRIME executables should be running. Just let it > > autosuspend after boot. Do you see "waiting 12 sec" messages > > in dmesg? If not it's not autosuspending. > > Yes I'm seeing those messages, I'm just not seeing the hangs > > I've attached the dmesg in case you're interested Okay the reason you're not seeing deadlocks is because the output poll worker is not enabled. And the output poll worker is not enabled because your discrete GPU doesn't have any outputs: [0.265568] [drm:dc_create] *ERROR* DC: Number of connectors is zero! The outputs are only polled if there are connectors which have the DRM_CONNECTOR_POLL_CONNECT or DRM_CONNECTOR_POLL_DISCONNECT flag set. And that only ever seems to be the case for VGA and DVI. We know based on bugzilla reports that hybrid graphics laptops do exist which poll outputs with radeon and nouveau. If there are no laptops supported by amdgpu whose discrete GPU has polled connectors, then patch [5/5] would be unnecessary. That is for Alex to decide. However that is very good to know, so thanks a lot for your testing efforts, much appreciated! Kind regards, Lukas ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On 12 February 2018 at 03:39, Lukas Wunnerwrote: > On Mon, Feb 12, 2018 at 12:35:51AM +, Mike Lothian wrote: >> I've not been able to reproduce the original problem you're trying to >> solve on amdgpu thats with or without your patch set and the above >> "trigger" too >> >> Is anything else required to trigger it, I started multiple DRI_PRIME >> glxgears, in parallel, serial waiting the 12 seconds and serial within >> the 12 seconds and I couldn't reproduce it > > The discrete GPU needs to runtime suspend, that's the trigger, > so no DRI_PRIME executables should be running. Just let it > autosuspend after boot. Do you see "waiting 12 sec" messages > in dmesg? If not it's not autosuspending. > > Thanks, > > Lukas Hi Yes I'm seeing those messages, I'm just not seeing the hangs I've attached the dmesg in case you're interested Regards Mike dmesg.nohangs Description: Binary data ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Mon, Feb 12, 2018 at 12:35:51AM +, Mike Lothian wrote: > I've not been able to reproduce the original problem you're trying to > solve on amdgpu thats with or without your patch set and the above > "trigger" too > > Is anything else required to trigger it, I started multiple DRI_PRIME > glxgears, in parallel, serial waiting the 12 seconds and serial within > the 12 seconds and I couldn't reproduce it The discrete GPU needs to runtime suspend, that's the trigger, so no DRI_PRIME executables should be running. Just let it autosuspend after boot. Do you see "waiting 12 sec" messages in dmesg? If not it's not autosuspending. Thanks, Lukas ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
Hi I've not been able to reproduce the original problem you're trying to solve on amdgpu thats with or without your patch set and the above "trigger" too Is anything else required to trigger it, I started multiple DRI_PRIME glxgears, in parallel, serial waiting the 12 seconds and serial within the 12 seconds and I couldn't reproduce it Regards Mike ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Sun, Feb 11, 2018 at 08:23:14PM +0100, Lukas Wunner wrote: > On Sun, Feb 11, 2018 at 06:58:11PM +, Mike Lothian wrote: > > On 11 February 2018 at 09:38, Lukas Wunnerwrote: > > > The patches for radeon and amdgpu are compile-tested only, I only have a > > > MacBook Pro with an Nvidia GK107 to test. To test the patches, add an > > > "msleep(12*1000);" at the top of the driver's ->runtime_suspend hook. > > > This ensures that the poll worker runs after ->runtime_suspend has begun. > > > Wait 12 sec after the GPU has begun runtime suspend, then check > > > /sys/bus/pci/devices/:01:00.0/power/runtime_status. Without this > > > series, the status will be stuck at "suspending" and you'll get hung task > > > errors in dmesg after a few minutes. > > > > I wasn't quite sure where to add that msleep. I've tested the patches > > as is on top of agd5f's wip branch without ill effects > > > > I've had a radeon and now a amdgpu PRIME setup and don't believe I've > > ever seen this issue > > > > If you could pop a patch together for the msleep I'll give it a test on > > amdgpu > > Here you go, this is for all 3 drivers. > Should deadlock without the series. > Thanks! Sorry, I missed that amdgpu_drv.c and radeon_drv.c don't include delay.h, rectified testing patch below: diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 50afcf6..beaaf2c 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -36,6 +36,7 @@ #include #include +#include #include #include #include @@ -718,6 +719,9 @@ static int amdgpu_pmops_runtime_suspend(struct device *dev) return -EBUSY; } + printk("waiting 12 sec\n"); + msleep(12*1000); + printk("done waiting 12 sec\n"); drm_dev->switch_power_state = DRM_SWITCH_POWER_CHANGING; drm_kms_helper_poll_disable(drm_dev); vga_switcheroo_set_dynamic_switch(pdev, VGA_SWITCHEROO_OFF); diff --git a/drivers/gpu/drm/drm_probe_helper.c b/drivers/gpu/drm/drm_probe_helper.c index 555fbe5..ee7cf0d 100644 --- a/drivers/gpu/drm/drm_probe_helper.c +++ b/drivers/gpu/drm/drm_probe_helper.c @@ -586,6 +586,7 @@ static void output_poll_execute(struct work_struct *work) repoll = true; goto out; } + dev_info(>pdev->dev, "begin poll\n"); drm_connector_list_iter_begin(dev, _iter); drm_for_each_connector_iter(connector, _iter) { @@ -651,6 +652,7 @@ static void output_poll_execute(struct work_struct *work) if (repoll) schedule_delayed_work(delayed_work, DRM_OUTPUT_POLL_PERIOD); + dev_info(>pdev->dev, "end poll\n"); } /** diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c index 3e29302..f9da5bc 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drm.c +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c @@ -855,6 +855,9 @@ static int nouveau_drm_probe(struct pci_dev *pdev, return -EBUSY; } + printk("waiting 12 sec\n"); + msleep(12*1000); + printk("done waiting 12 sec\n"); drm_kms_helper_poll_disable(drm_dev); vga_switcheroo_set_dynamic_switch(pdev, VGA_SWITCHEROO_OFF); nouveau_switcheroo_optimus_dsm(); diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c index 31dd04f..2b4e7e0 100644 --- a/drivers/gpu/drm/radeon/radeon_drv.c +++ b/drivers/gpu/drm/radeon/radeon_drv.c @@ -35,6 +35,7 @@ #include #include +#include #include #include #include @@ -413,6 +414,9 @@ static int radeon_pmops_runtime_suspend(struct device *dev) return -EBUSY; } + printk("waiting 12 sec\n"); + msleep(12*1000); + printk("done waiting 12 sec\n"); drm_dev->switch_power_state = DRM_SWITCH_POWER_CHANGING; drm_kms_helper_poll_disable(drm_dev); vga_switcheroo_set_dynamic_switch(pdev, VGA_SWITCHEROO_OFF); ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On Sun, Feb 11, 2018 at 06:58:11PM +, Mike Lothian wrote: > On 11 February 2018 at 09:38, Lukas Wunnerwrote: > > The patches for radeon and amdgpu are compile-tested only, I only have a > > MacBook Pro with an Nvidia GK107 to test. To test the patches, add an > > "msleep(12*1000);" at the top of the driver's ->runtime_suspend hook. > > This ensures that the poll worker runs after ->runtime_suspend has begun. > > Wait 12 sec after the GPU has begun runtime suspend, then check > > /sys/bus/pci/devices/:01:00.0/power/runtime_status. Without this > > series, the status will be stuck at "suspending" and you'll get hung task > > errors in dmesg after a few minutes. > > I wasn't quite sure where to add that msleep. I've tested the patches > as is on top of agd5f's wip branch without ill effects > > I've had a radeon and now a amdgpu PRIME setup and don't believe I've > ever seen this issue > > If you could pop a patch together for the msleep I'll give it a test on > amdgpu Here you go, this is for all 3 drivers. Should deadlock without the series. Thanks! diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c index 50afcf6..eefa0d4 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c @@ -718,6 +718,9 @@ static int amdgpu_pmops_runtime_suspend(struct device *dev) return -EBUSY; } + printk("waiting 12 sec\n"); + msleep(12*1000); + printk("done waiting 12 sec\n"); drm_dev->switch_power_state = DRM_SWITCH_POWER_CHANGING; drm_kms_helper_poll_disable(drm_dev); vga_switcheroo_set_dynamic_switch(pdev, VGA_SWITCHEROO_OFF); diff --git a/drivers/gpu/drm/drm_probe_helper.c b/drivers/gpu/drm/drm_probe_helper.c index 555fbe5..ee7cf0d 100644 --- a/drivers/gpu/drm/drm_probe_helper.c +++ b/drivers/gpu/drm/drm_probe_helper.c @@ -586,6 +586,7 @@ static void output_poll_execute(struct work_struct *work) repoll = true; goto out; } + dev_info(>pdev->dev, "begin poll\n"); drm_connector_list_iter_begin(dev, _iter); drm_for_each_connector_iter(connector, _iter) { @@ -651,6 +652,7 @@ static void output_poll_execute(struct work_struct *work) if (repoll) schedule_delayed_work(delayed_work, DRM_OUTPUT_POLL_PERIOD); + dev_info(>pdev->dev, "end poll\n"); } /** diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c b/drivers/gpu/drm/nouveau/nouveau_drm.c index 3e29302..f9da5bc 100644 --- a/drivers/gpu/drm/nouveau/nouveau_drm.c +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c @@ -855,6 +855,9 @@ static int nouveau_drm_probe(struct pci_dev *pdev, return -EBUSY; } + printk("waiting 12 sec\n"); + msleep(12*1000); + printk("done waiting 12 sec\n"); drm_kms_helper_poll_disable(drm_dev); vga_switcheroo_set_dynamic_switch(pdev, VGA_SWITCHEROO_OFF); nouveau_switcheroo_optimus_dsm(); diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c index 31dd04f..707b8aa 100644 --- a/drivers/gpu/drm/radeon/radeon_drv.c +++ b/drivers/gpu/drm/radeon/radeon_drv.c @@ -413,6 +413,9 @@ static int radeon_pmops_runtime_suspend(struct device *dev) return -EBUSY; } + printk("waiting 12 sec\n"); + msleep(12*1000); + printk("done waiting 12 sec\n"); drm_dev->switch_power_state = DRM_SWITCH_POWER_CHANGING; drm_kms_helper_poll_disable(drm_dev); vga_switcheroo_set_dynamic_switch(pdev, VGA_SWITCHEROO_OFF); ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Re: [PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
On 11 February 2018 at 09:38, Lukas Wunnerwrote: > Fix a deadlock on hybrid graphics laptops that's been present since 2013: > > DRM drivers poll connectors in 10 sec intervals. The poll worker is > stopped on ->runtime_suspend with cancel_delayed_work_sync(). However > the poll worker invokes the DRM drivers' ->detect callbacks, which call > pm_runtime_get_sync(). If the poll worker starts after runtime suspend > has begun, pm_runtime_get_sync() will wait for runtime suspend to finish > with the intention of runtime resuming the device afterwards. The result > is a circular wait between poll worker and autosuspend worker. > > I'm seeing this deadlock so often it's not funny anymore. I've also found > 3 nouveau bugzillas about it and 1 radeon bugzilla. (See patch [3/5] and > [4/5].) > > One theoretical solution would be to add a flag to the ->detect callback > to tell it that it's running in the poll worker's context. In that case > it doesn't need to call pm_runtime_get_sync() because the poll worker is > only enabled while runtime active and we know that ->runtime_suspend > waits for it to finish. But this approach would require touching every > single DRM driver's ->detect hook. Moreover the ->detect hook is called > from numerous other places, both in the DRM library as well as in each > driver, making it necessary to trace every possible call chain and check > if it's coming from the poll worker or not. I've tried to do that for > nouveau (see the call sites listed in the commit message of patch [3/5]) > and concluded that this approach is a nightmare to implement. > > Another reason for the infeasibility of this approach is that ->detect > is called from within the DRM library without driver involvement, e.g. > via DRM's sysfs interface. In those cases, pm_runtime_get_sync() would > have to be called by the DRM library, but the DRM library is supposed to > stay generic, wheras runtime PM is driver-specific. > > So instead, I've come up with this much simpler solution which gleans > from the current task's flags if it's a workqueue worker, retrieves the > work_struct and checks if it's the poll worker. All that's needed is > one small helper in the workqueue code and another small helper in the > DRM library. This solution lends itself nicely to stable-backporting. > > The patches for radeon and amdgpu are compile-tested only, I only have a > MacBook Pro with an Nvidia GK107 to test. To test the patches, add an > "msleep(12*1000);" at the top of the driver's ->runtime_suspend hook. > This ensures that the poll worker runs after ->runtime_suspend has begun. > Wait 12 sec after the GPU has begun runtime suspend, then check > /sys/bus/pci/devices/:01:00.0/power/runtime_status. Without this > series, the status will be stuck at "suspending" and you'll get hung task > errors in dmesg after a few minutes. > > i915, malidp and msm "solved" this issue by not stopping the poll worker > on runtime suspend. But this results in the GPU bouncing back and forth > between D0 and D3 continuously. That's a total no-go for GPUs which > runtime suspend to D3cold since every suspend/resume cycle costs a > significant amount of time and energy. (i915 and malidp do not seem > to acquire a runtime PM ref in the ->detect callbacks, which seems > questionable. msm however does and would also deadlock if it disabled > the poll worker on runtime suspend. cc += Archit, Liviu, intel-gfx) > > Please review. Thanks, > > Lukas > > Lukas Wunner (5): > workqueue: Allow retrieval of current task's work struct > drm: Allow determining if current task is output poll worker > drm/nouveau: Fix deadlock on runtime suspend > drm/radeon: Fix deadlock on runtime suspend > drm/amdgpu: Fix deadlock on runtime suspend > > drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c | 58 +--- > drivers/gpu/drm/drm_probe_helper.c | 14 + > drivers/gpu/drm/nouveau/nouveau_connector.c| 18 +-- > drivers/gpu/drm/radeon/radeon_connectors.c | 74 > +- > include/drm/drm_crtc_helper.h | 1 + > include/linux/workqueue.h | 1 + > kernel/workqueue.c | 16 ++ > 7 files changed, 132 insertions(+), 50 deletions(-) > > -- > 2.15.1 > > ___ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel I wasn't quite sure where to add that msleep. I've tested the patches as is on top of agd5f's wip branch without ill effects I've had a radeon and now a amdgpu PRIME setup and don't believe I've ever seen this issue If you could pop a patch together for the msleep I'll give it a test on amdgpu Cheers Mike ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
[PATCH 0/5] Fix deadlock on runtime suspend in DRM drivers
Fix a deadlock on hybrid graphics laptops that's been present since 2013: DRM drivers poll connectors in 10 sec intervals. The poll worker is stopped on ->runtime_suspend with cancel_delayed_work_sync(). However the poll worker invokes the DRM drivers' ->detect callbacks, which call pm_runtime_get_sync(). If the poll worker starts after runtime suspend has begun, pm_runtime_get_sync() will wait for runtime suspend to finish with the intention of runtime resuming the device afterwards. The result is a circular wait between poll worker and autosuspend worker. I'm seeing this deadlock so often it's not funny anymore. I've also found 3 nouveau bugzillas about it and 1 radeon bugzilla. (See patch [3/5] and [4/5].) One theoretical solution would be to add a flag to the ->detect callback to tell it that it's running in the poll worker's context. In that case it doesn't need to call pm_runtime_get_sync() because the poll worker is only enabled while runtime active and we know that ->runtime_suspend waits for it to finish. But this approach would require touching every single DRM driver's ->detect hook. Moreover the ->detect hook is called from numerous other places, both in the DRM library as well as in each driver, making it necessary to trace every possible call chain and check if it's coming from the poll worker or not. I've tried to do that for nouveau (see the call sites listed in the commit message of patch [3/5]) and concluded that this approach is a nightmare to implement. Another reason for the infeasibility of this approach is that ->detect is called from within the DRM library without driver involvement, e.g. via DRM's sysfs interface. In those cases, pm_runtime_get_sync() would have to be called by the DRM library, but the DRM library is supposed to stay generic, wheras runtime PM is driver-specific. So instead, I've come up with this much simpler solution which gleans from the current task's flags if it's a workqueue worker, retrieves the work_struct and checks if it's the poll worker. All that's needed is one small helper in the workqueue code and another small helper in the DRM library. This solution lends itself nicely to stable-backporting. The patches for radeon and amdgpu are compile-tested only, I only have a MacBook Pro with an Nvidia GK107 to test. To test the patches, add an "msleep(12*1000);" at the top of the driver's ->runtime_suspend hook. This ensures that the poll worker runs after ->runtime_suspend has begun. Wait 12 sec after the GPU has begun runtime suspend, then check /sys/bus/pci/devices/:01:00.0/power/runtime_status. Without this series, the status will be stuck at "suspending" and you'll get hung task errors in dmesg after a few minutes. i915, malidp and msm "solved" this issue by not stopping the poll worker on runtime suspend. But this results in the GPU bouncing back and forth between D0 and D3 continuously. That's a total no-go for GPUs which runtime suspend to D3cold since every suspend/resume cycle costs a significant amount of time and energy. (i915 and malidp do not seem to acquire a runtime PM ref in the ->detect callbacks, which seems questionable. msm however does and would also deadlock if it disabled the poll worker on runtime suspend. cc += Archit, Liviu, intel-gfx) Please review. Thanks, Lukas Lukas Wunner (5): workqueue: Allow retrieval of current task's work struct drm: Allow determining if current task is output poll worker drm/nouveau: Fix deadlock on runtime suspend drm/radeon: Fix deadlock on runtime suspend drm/amdgpu: Fix deadlock on runtime suspend drivers/gpu/drm/amd/amdgpu/amdgpu_connectors.c | 58 +--- drivers/gpu/drm/drm_probe_helper.c | 14 + drivers/gpu/drm/nouveau/nouveau_connector.c| 18 +-- drivers/gpu/drm/radeon/radeon_connectors.c | 74 +- include/drm/drm_crtc_helper.h | 1 + include/linux/workqueue.h | 1 + kernel/workqueue.c | 16 ++ 7 files changed, 132 insertions(+), 50 deletions(-) -- 2.15.1 ___ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel