nouveau: Don't leak runtime pm ref on driver unload

Peter Wu Mon, 30 May 2016 19:03:46 +0200

On Sun, May 29, 2016 at 05:50:06PM +0200, Lukas Wunner wrote:
> Hi Peter,
> 
> On Fri, May 27, 2016 at 03:07:33AM +0200, Peter Wu wrote:
> > On Tue, May 24, 2016 at 06:03:27PM +0200, Lukas Wunner wrote:
> > > nouveau_drm_load() calls pm_runtime_put() if nouveau_runtime_pm != 0,
> > > but nouveau_drm_unload() calls pm_runtime_get_sync() unconditionally.
> > > We therefore leak a runtime pm ref whenever nouveau is loaded with
> > > runpm=0 and then unloaded. The GPU will subsequently never runtime
> > > suspend even if nouveau is loaded again with runpm=1.
> > > 
> > > Fix by taking the runtime pm ref under the same condition that it was
> > > released on driver load.
> > > 
> > > Fixes: 5addcf0a5f0f ("nouveau: add runtime PM support (v0.9)")
> > > Cc: Dave Airlie <airlied at redhat.com>
> > > Reported-by: Karol Herbst <nouveau at karolherbst.de>
> > > Tested-by: Karol Herbst <nouveau at karolherbst.de>
> > > Signed-off-by: Lukas Wunner <lukas at wunner.de>
> > 
> > Looks good, I tested this scenario:
> > 
> >     ru(){ cat /sys/bus/pci/devices/0000\:01:00.0/power/runtime_usage;}
> >     ru # reports 1
> >     modprobe nouveau runpm=0
> >     ru # reports 2
> >     rmmod nouveau
> >     ru # reports 1
> > 
> > Without runpm=0 the count drops to 0 in the second step and stays 0 in
> > the third step. After applying patch 2/9, this correctly reports 1 as
> > expected (this is the same as manually setting power/control to on).
> 
> How exactly did you reach the situation where the root port didn't wake
> up when you tried to load nouveau again? (IRC conversation this week.)


Ensure that the pci/pm patches are applied, then:

 0. Unload nouveau (I have blacklisted it for testing).
 1. Enable rpm for the root port and children (control = auto).
 2. Verify in the kernel logs that the devices are sleeping:
        pcieport 0000:00:01.0: power state changed by ACPI to D3cold
 3. (Optional, to rule out issues with delays:) Disable rpm for the
    Nvidia device (control = on).
 4. modprobe nouveau.

The above test with v4.6 + 4 pci/pm patches (8b71f565) gives:

    50.245795 MXM: GUID detected in BIOS
    50.245948    nseval-0227 ns_evaluate           : **** Execute method 
[\_SB.PCI0.GFX0._DSM] at AML address ffffc90000013b11 length 492
    50.246016 ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - 
Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95)
    50.246044    nseval-0227 ns_evaluate           : **** Execute method 
[\_SB.PCI0.GFX0._DSM] at AML address ffffc90000013b11 length 492
    50.246110    nseval-0227 ns_evaluate           : **** Execute method 
[\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F
    50.246256 ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch 
- Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95)
    50.246289    nseval-0227 ns_evaluate           : **** Execute method 
[\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F
    50.246443 ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch 
- Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95)
    50.246457    nseval-0227 ns_evaluate           : **** Execute method 
[\_SB.PCI0.PEG0.PEGP._DSM] at AML address ffffc90000018297 length 1F
    50.246932 pci 0000:01:00.0: optimus capabilities: enabled, status dynamic 
power, hda bios codec supported
    50.247005 VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP 
handle
    50.247084    nseval-0227 ns_evaluate           : **** Execute method 
[\_SB.PCI0.PEG0.PG00._ON] at AML address ffffc9000001086e length 11D
    50.390140 pcieport 0000:00:01.0: power state changed by ACPI to D0
    50.491893    nseval-0227 ns_evaluate           : **** Execute method 
[\_SB.PCI0.PEG0._DSW] at AML address ffffc90000010a2d length 1D
    50.492285 pcieport 0000:00:01.0: PME# disabled
    50.492583 nouveau 0000:01:00.0: unknown chipset (ffffffff)
    50.492687 nouveau: probe of 0000:01:00.0 failed with error -12
    50.501990    nseval-0227 ns_evaluate           : **** Execute method 
[\_SB.PCI0.PEG0._S0W] at AML address ffffc90000010a8e length 2
    50.502403 pcieport 0000:00:01.0: PME# enabled
    50.502601    nseval-0227 ns_evaluate           : **** Execute method 
[\_SB.PCI0.PEG0._DSW] at AML address ffffc90000010a2d length 1D
    50.513005    nseval-0227 ns_evaluate           : **** Execute method 
[\_SB.PCI0.PEG0.PG00._OFF] at AML address ffffc90000010994 length 6D
    50.533258 pcieport 0000:00:01.0: power state changed by ACPI to D3cold

(Note that this patch is not included.) When nouveau is operating
normally, I see that _PS0 is also called (which does not happen above).

If you think that mixing power resources with DSM causes this issue, I
also tried to apply my power resources work for nouveau but it gives the
same problem:

    20.183306 MXM: GUID detected in BIOS
    20.183606 ACPI Warning: \_SB.PCI0.GFX0._DSM: Argument #4 type mismatch - 
Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95)
    20.184158 ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch 
- Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95)
    20.184547 ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch 
- Found [Buffer], ACPI requires [Package] (20160108/nsarguments-95)
    20.185152 pci 0000:01:00.0: optimus capabilities: enabled, status dynamic 
power, hda bios codec supported
    20.185351 VGA switcheroo: detected Optimus DSM method \_SB_.PCI0.PEG0.PEGP 
handle
    20.185384 nouveau: detected PR support, will not use DSM
    20.185552 nouveau 0000:01:00.0: enabling device (0000 -> 0003)
    20.185873 nouveau 0000:01:00.0: unknown chipset (ffffffff)
    20.185946 nouveau: probe of 0000:01:00.0 failed with error -12

> What's happening is, the PCI core will keep unbound devices (i.e.,
> without driver) in D0 but the runtime status is allowed to change
> to "suspended". So it'll appear to the kernel as if it was suspended
> but in reality it stays in D0.
> 
> Once runtime pm for PCIe ports gets merged, the root port above the
> GPU will indeed go to D3 in such a situation because the check
> pm_children_suspended() (called from rpm_check_suspend_allowed())
> returns true.
> 
> I'm not sure if this is desirable or not. If we keep unbound devices
> in D0, should we allow ports above them to go to D3?

Maybe Rafael (linux-pm / linux-pci) can answer this question better?
The comments in local_pci_probe, pci_pm_runtime_suspend and
pci_pm_runtime_resume suggest that unbound devices are assumed in D0
which is apparently not the case when runtime PM is enabled.

> In any case, when nouveau is loaded again, local_pci_probe() will
> call pm_runtime_get_sync(), which will implicitly set the runtime
> status to "active" and which should also wake parents. So how did
> you ever reach a point where you loaded nouveau and the root port
> stayed asleep? Clearly we have a bug there, question is where.
> This shouldn't work only if pm_runtime_forbid() was called on
> driver unload.
> 
> Thanks for the extensive testing!
> Lukas

Both devices (root port and Nvidia) were resumed, but somehow the Nvidia
card was not fully initialized/ready (as you can see in the above logs).

Peter

> > 
> > Peter
> > 
> > > ---
> > >  drivers/gpu/drm/nouveau/nouveau_drm.c | 5 ++++-
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/gpu/drm/nouveau/nouveau_drm.c 
> > > b/drivers/gpu/drm/nouveau/nouveau_drm.c
> > > index 11f8dd9..faf7438 100644
> > > --- a/drivers/gpu/drm/nouveau/nouveau_drm.c
> > > +++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
> > > @@ -498,7 +498,10 @@ nouveau_drm_unload(struct drm_device *dev)
> > >  {
> > >   struct nouveau_drm *drm = nouveau_drm(dev);
> > >  
> > > - pm_runtime_get_sync(dev->dev);
> > > + if (nouveau_runtime_pm != 0) {
> > > +         pm_runtime_get_sync(dev->dev);
> > > + }
> > > +
> > >   nouveau_fbcon_fini(dev);
> > >   nouveau_accel_fini(drm);
> > >   nouveau_hwmon_fini(dev);
> > > -- 
> > > 2.8.1
> > > 
> > > _______________________________________________
> > > Nouveau mailing list
> > > Nouveau at lists.freedesktop.org
> > > https://lists.freedesktop.org/mailman/listinfo/nouveau

[Nouveau] [PATCH 1/9] drm/nouveau: Don't leak runtime pm ref on driver unload

Reply via email to