Note: I did read your response lower down in the thread, but I wanted to make
sure I addressed one of the comments here (see below)
On Thu, 2019-03-21 at 17:48 -0500, Bjorn Helgaas wrote:
> [+cc Rafael]
>
> On Wed, Mar 13, 2019 at 06:25:02PM -0400, Lyude Paul wrote:
> > On Fri, 2019-02-15 at 16:17 -0500, Lyude Paul wrote:
> > > On Thu, 2019-02-14 at 18:43 -0600, Bjorn Helgaas wrote:
> > > > On Tue, Feb 12, 2019 at 05:02:30PM -0500, Lyude Paul wrote:
> > > > > On a very specific subset of ThinkPad P50 SKUs, particularly
> > > > > ones that come with a Quadro M1000M chip instead of the M2000M
> > > > > variant, the BIOS seems to have a very nasty habit of not
> > > > > always resetting the secondary Nvidia GPU between full reboots
> > > > > if the laptop is configured in Hybrid Graphics mode. The
> > > > > reason for this happening is unknown, but the following steps
> > > > > and possibly a good bit of patience will reproduce the issue:
> > > > >
> > > > > 1. Boot up the laptop normally in Hybrid graphics mode
> > > > > 2. Make sure nouveau is loaded and that the GPU is awake
> > > > > 2. Allow the nvidia GPU to runtime suspend itself after being idle
> > > > > 3. Reboot the machine, the more sudden the better (e.g sysrq-b may
> > > > > help)
> > > > > 4. If nouveau loads up properly, reboot the machine again and go
> > > > > back to
> > > > > step 2 until you reproduce the issue
> > > > >
> > > > > This results in some very strange behavior: the GPU will quite
> > > > > literally be left in exactly the same state it was in when the
> > > > > previously booted kernel started the reboot. This has all
> > > > > sorts of bad sideaffects: for starters, this completely breaks
> > > > > nouveau starting with a mysterious EVO channel failure that
> > > > > happens well before we've actually used the EVO channel for
> > > > > anything:
>
> Thanks for the hybrid tutorial (snipped from this response). IIUC,
> what you said was that in hybrid mode, the Intel GPU drives the
> built-in display and the Nvidia GPU drives any external displays and
> may be used for DRI PRIME rendering (whatever that is). But since you
> say the Nvidia device gets runtime suspended, I assume there's no
> external display here and you're not using DRI PRIME.
>
> I wonder if it's related to the fact that the Nvidia GPU has been
> runtime suspended before you do the reboot. Can you try turning of
> runtime power management for the GPU by setting the runpm module
> parameter to 0? I *think* this would be booting with
> "nouveau.runpm=0".
>
> > > > Is there a bug report for this? Bugzilla.kernel.org would be ideal,
> > > > including "lspci -vvxxx" and dmidecode for the system.
> > > >
> > > Not yet, but there has been discussion about this between nouveau
> > > developers on our IRC channel.
> >
> > I lied: yes there actually is a bug report for this, but it's
> > currently on the Red Hat bugzilla. I can get more information from
> > it if you need (with lenovo's approval of course).
>
> Can you please make a bugzilla.kernel.org entry with as much
> information (dmesg, "lspci -vvxxx", dmidecode, etc) as you can get
> approval for? You can include the Red Hat bugzilla URL in the commit
> log, too, but that's not quite as good because we have no control over
> whether it's public.
>
> > And additionally: I've been working with Lenovo on this issue for a
> > couple of months now, and we've gone through dozens of different
> > trial BIOSes with no success thus far. However, Lenovo is currently
> > working on trying to add this workaround into their BIOS but I've
> > been told that this change is going to take a decent amount of time
> > since they need to test it across multiple operating systems. I'd be
> > happy to come back and add a conditional later to turn this
> > workaround off for later BIOS versions once Lenovo has released a
> > proper fix.
>
> Sounds like Lenovo is going to a lot of trouble for this. The ideal
> thing from my point of view would be if they could figure out why this
> works on Windows but not on Linux. I doubt Windows has a quirk like
> this, so if we could figure out why it works on Windows, we could
> likely do something similar in Linux.
I did actually try this route after first finding this bug, but unfortunately
from what I understand there isn't really much more Lenovo can do other then
give us a patched BIOS or look at their own BIOS to see if it's the cause.
Anyway, went ahead and filed a bug with as much information as I could get my
hands on here (different email then the one I'm talking to you from):
https://bugzilla.kernel.org/show_bug.cgi?id=203003
>
> > > > > So to do this, we add a new pci quirk using
> > > > > DECLARE_PCI_FIXUP_CLASS_FINAL that will be invoked before the PCI
> > > > > probe
> > > > > at boot finishes. From there, we check to make sure that this is
> > > > > indeed
> > > > > the specific P50 variant of this GPU. We also make sure that the GPU
> > > > > PCI
> > > >