Re: [Nouveau] [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50

2019-04-24 Thread Lyude Paul
On Wed, 2019-04-24 at 17:36 -0500, Bjorn Helgaas wrote:
> On Wed, Apr 24, 2019 at 03:16:37PM -0400, Lyude Paul wrote:
> > On Wed, 2019-04-24 at 13:59 -0500, Bjorn Helgaas wrote:
> > > Not being a scheduled work expert, I was unsure if this experiment was
> > > equivalent to what I proposed.
> > > 
> > > I'm always suspicious of singleton solutions like this (using
> > > schedule_work() in runtime_resume()) because usually they seem to be
> > > solving a generic problem that should happen on many kinds of
> > > hardware.  The 0b2fe6594fa2 ("drm/nouveau: Queue hpd_work on (runtime)
> > > resume") commit log says:
> > > 
> > >   We need to call drm_helper_hpd_irq_event() on resume to properly
> > >   detect monitor connection / disconnection on some laptops, use
> > >   hpd_work for this to avoid deadlocks.
> > > 
> > > The situation of a monitor being connected or disconnected during
> > > suspend can happen to *any* GPU, but the commit only changes nouveau,
> > > which of course raises the question of how we deal with that in other
> > > drivers.  If the Nvidia GPU has some unique behavior related to
> > > monitor connection, that would explain special-case code there, but
> > > the commit doesn't mention anything like that.
> > > 
> > > It should be simple to revert 0b2fe6594fa2 and see whether it changes
> > > the behavior at all (well, simple except for the fact that this
> > > problem isn't 100% reproducible in the first place).
> > 
> > It's not 100% reproducible, but it's at least 90% so it's not
> > difficult for me to test at all.
> > 
> > Also, reverting this commit makes no difference either. 
> 
> OK, great, that makes it crystal clear.  I didn't know you had
> specifically tested that revert.  Thanks for doing that.
> 
> > Note that while that commit only changed nouveau, scheduled_work()
> > is exactly how a number of other drivers (i915 for instance) handle
> > reprobing like this as well.
> 
> OK.  The GPU code would be a lot more approachable if similar things
> were done in similar ways.  I spent an hour or so looking for this
> similar code in i915, but gave up.

We try
> 
> > The reason being that we can't do full connector reprobing in our
> > runtime resume thread because we could deadlock if someone else is
> > holding a modesetting lock we need and waiting on us to resume at
> > the same time (there's a number of other bug fixes in nouveau for
> > other issues caused by the same deadlock scenario). 
> 
> You mention nouveau specifically here, but I assume this is a generic
> deadlock scenario that applies to any GPU, and they all avoid the
> deadlock in the same way.  Right?

Yes-there is only one exception in the tree that I know of (amdgpu/radeon),
but fixing that is on my todo if someone else hasn't already gotten to it yet.

> 
> > I'm confused here though, it sounds like you're running under the
> > assumption that PCI devices like this aren't reset into a clean
> > state during a system reboot, is that correct?
> 
> No, I wasn't trying to say anything about that.  My point here is
> that:
> 
>   - you're reporting a problem that only happens with nouveau and
> only happens during shutdown/reboot
>   - the behavior is similar to a race (not 100% reproducible, seems
> to happen more if shutdown is faster)
>   - shutdown involves resuming the device (see pci_device_shutdown())
>   - nouveau_pmops_runtime_resume() schedules asynchronous work, which
> (to my untrained eye) looks unusual
>   - asynchronous work is inherently subject to races
> 
> So I think that's all somewhat suspicious.  But if the same problem
> happens without the asynchronous work, obviously the issue is
> elsewhere.
> 
> But you *are* right that if the device were actually reset, none of
> this should matter.  It certainly seems that the BIOS neglects to
> reset it in some cases.
> 
> I can sort of imagine a BIOS engineer thinking that if the device
> looks like it's in use, we shouldn't reset it, and it's still
> conceivable that some sort of Linux shutdown race could leave it
> looking like it's in use.  But you've been working with Lenovo on
> this, and it seems like that would be pretty obvious to somebody with
> the BIOS source (though I just demonstrated above that even with the
> source it's easy to miss things).

Yeah, that's what I thought as well! This experience has taught me that
there's a surprising amount of things about BIOSes that BIOS engineers don't
seem to know about…

> 
> I'm out of ideas, so I think your quirk is the best way forward.  I
> trimmed out some of the commit log backtraces and such, added the
> bugzilla, and tweaked the patch to use pci_iomap() instead of
> ioremap().  Would the patch below work for you?

Works for me, tested on the problematic P50 as well.

> 
> 
> commit 18dc5b3c7ddc
> Author: Lyude Paul 
> Date:   Tue Feb 12 17:02:30 2019 -0500
> 
> PCI: Reset Lenovo ThinkPad P50 nvgpu at boot if necessary
> 
> On ThinkPad P50 SKUs with an Nvidia 

Re: [Nouveau] [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50

2019-04-24 Thread Bjorn Helgaas
On Wed, Apr 24, 2019 at 03:16:37PM -0400, Lyude Paul wrote:
> On Wed, 2019-04-24 at 13:59 -0500, Bjorn Helgaas wrote:
> > Not being a scheduled work expert, I was unsure if this experiment was
> > equivalent to what I proposed.
> > 
> > I'm always suspicious of singleton solutions like this (using
> > schedule_work() in runtime_resume()) because usually they seem to be
> > solving a generic problem that should happen on many kinds of
> > hardware.  The 0b2fe6594fa2 ("drm/nouveau: Queue hpd_work on (runtime)
> > resume") commit log says:
> > 
> >   We need to call drm_helper_hpd_irq_event() on resume to properly
> >   detect monitor connection / disconnection on some laptops, use
> >   hpd_work for this to avoid deadlocks.
> > 
> > The situation of a monitor being connected or disconnected during
> > suspend can happen to *any* GPU, but the commit only changes nouveau,
> > which of course raises the question of how we deal with that in other
> > drivers.  If the Nvidia GPU has some unique behavior related to
> > monitor connection, that would explain special-case code there, but
> > the commit doesn't mention anything like that.
> > 
> > It should be simple to revert 0b2fe6594fa2 and see whether it changes
> > the behavior at all (well, simple except for the fact that this
> > problem isn't 100% reproducible in the first place).
> 
> It's not 100% reproducible, but it's at least 90% so it's not
> difficult for me to test at all.
> 
> Also, reverting this commit makes no difference either. 

OK, great, that makes it crystal clear.  I didn't know you had
specifically tested that revert.  Thanks for doing that.

> Note that while that commit only changed nouveau, scheduled_work()
> is exactly how a number of other drivers (i915 for instance) handle
> reprobing like this as well.

OK.  The GPU code would be a lot more approachable if similar things
were done in similar ways.  I spent an hour or so looking for this
similar code in i915, but gave up.

> The reason being that we can't do full connector reprobing in our
> runtime resume thread because we could deadlock if someone else is
> holding a modesetting lock we need and waiting on us to resume at
> the same time (there's a number of other bug fixes in nouveau for
> other issues caused by the same deadlock scenario). 

You mention nouveau specifically here, but I assume this is a generic
deadlock scenario that applies to any GPU, and they all avoid the
deadlock in the same way.  Right?

> I'm confused here though, it sounds like you're running under the
> assumption that PCI devices like this aren't reset into a clean
> state during a system reboot, is that correct?

No, I wasn't trying to say anything about that.  My point here is
that:

  - you're reporting a problem that only happens with nouveau and
only happens during shutdown/reboot
  - the behavior is similar to a race (not 100% reproducible, seems
to happen more if shutdown is faster)
  - shutdown involves resuming the device (see pci_device_shutdown())
  - nouveau_pmops_runtime_resume() schedules asynchronous work, which
(to my untrained eye) looks unusual
  - asynchronous work is inherently subject to races

So I think that's all somewhat suspicious.  But if the same problem
happens without the asynchronous work, obviously the issue is
elsewhere.

But you *are* right that if the device were actually reset, none of
this should matter.  It certainly seems that the BIOS neglects to
reset it in some cases.

I can sort of imagine a BIOS engineer thinking that if the device
looks like it's in use, we shouldn't reset it, and it's still
conceivable that some sort of Linux shutdown race could leave it
looking like it's in use.  But you've been working with Lenovo on
this, and it seems like that would be pretty obvious to somebody with
the BIOS source (though I just demonstrated above that even with the
source it's easy to miss things).

I'm out of ideas, so I think your quirk is the best way forward.  I
trimmed out some of the commit log backtraces and such, added the
bugzilla, and tweaked the patch to use pci_iomap() instead of
ioremap().  Would the patch below work for you?


commit 18dc5b3c7ddc
Author: Lyude Paul 
Date:   Tue Feb 12 17:02:30 2019 -0500

PCI: Reset Lenovo ThinkPad P50 nvgpu at boot if necessary

On ThinkPad P50 SKUs with an Nvidia Quadro M1000M instead of the M2000M
variant, the BIOS does not always reset the secondary Nvidia GPU during
reboot if the laptop is configured in Hybrid Graphics mode.  The reason is
unknown, but the following steps and possibly a good bit of patience will
reproduce the issue:

  1. Boot up the laptop normally in Hybrid Graphics mode
  2. Make sure nouveau is loaded and that the GPU is awake
  2. Allow the Nvidia GPU to runtime suspend itself after being idle
  3. Reboot the machine, the more sudden the better (e.g sysrq-b may help)
  4. If nouveau loads up properly, reboot the machine again and go 

Re: [Nouveau] [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50

2019-04-24 Thread Lyude Paul
On Wed, 2019-04-24 at 13:59 -0500, Bjorn Helgaas wrote:
> Not being a scheduled work expert, I was unsure if this experiment was
> equivalent to what I proposed.
> 
> I'm always suspicious of singleton solutions like this (using
> schedule_work() in runtime_resume()) because usually they seem to be
> solving a generic problem that should happen on many kinds of
> hardware.  The 0b2fe6594fa2 ("drm/nouveau: Queue hpd_work on (runtime)
> resume") commit log says:
> 
>   We need to call drm_helper_hpd_irq_event() on resume to properly
>   detect monitor connection / disconnection on some laptops, use
>   hpd_work for this to avoid deadlocks.
> 
> The situation of a monitor being connected or disconnected during
> suspend can happen to *any* GPU, but the commit only changes nouveau,
> which of course raises the question of how we deal with that in other
> drivers.  If the Nvidia GPU has some unique behavior related to
> monitor connection, that would explain special-case code there, but
> the commit doesn't mention anything like that.
> 
> It should be simple to revert 0b2fe6594fa2 and see whether it changes
> the behavior at all (well, simple except for the fact that this
> problem isn't 100% reproducible in the first place).

It's not 100% reproducible, but it's at least 90% so it's not difficult for me
to test at all.

Also, reverting this commit makes no difference either. Note that while that
commit only changed nouveau, scheduled_work() is exactly how a number of other
drivers (i915 for instance) handle reprobing like this as well. The reason
being that we can't do full connector reprobing in our runtime resume thread
because we could deadlock if someone else is holding a modesetting lock we
need and waiting on us to resume at the same time (there's a number of other
bug fixes in nouveau for other issues caused by the same deadlock scenario). 

I'm confused here though, it sounds like you're running under the assumption
that PCI devices like this aren't reset into a clean state during a system
reboot, is that correct?

> 
> > Do we want to have this discussion on the bz btw, or is this email
> > thread fine?
> 
> Email is fine.
-- 
Cheers,
Lyude Paul

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50

2019-04-24 Thread Bjorn Helgaas
On Mon, Apr 15, 2019 at 02:07:18PM -0400, Lyude Paul wrote:
> On Thu, 2019-04-04 at 09:17 -0500, Bjorn Helgaas wrote:
> > [+cc Hans, author of 0b2fe6594fa2 ("drm/nouveau: Queue hpd_work on (runtime)
> > resume")]
> > 
> > On Fri, Mar 22, 2019 at 06:30:15AM -0500, Bjorn Helgaas wrote:
> > > On Thu, Mar 21, 2019 at 05:48:19PM -0500, Bjorn Helgaas wrote:
> > > > On Wed, Mar 13, 2019 at 06:25:02PM -0400, Lyude Paul wrote:
> > > > > On Fri, 2019-02-15 at 16:17 -0500, Lyude Paul wrote:
> > > > > > On Thu, 2019-02-14 at 18:43 -0600, Bjorn Helgaas wrote:
> > > > > > > On Tue, Feb 12, 2019 at 05:02:30PM -0500, Lyude Paul wrote:
> > > > > > > > On a very specific subset of ThinkPad P50 SKUs, particularly
> > > > > > > > ones that come with a Quadro M1000M chip instead of the M2000M
> > > > > > > > variant, the BIOS seems to have a very nasty habit of not
> > > > > > > > always resetting the secondary Nvidia GPU between full reboots
> > > > > > > > if the laptop is configured in Hybrid Graphics mode. The
> > > > > > > > reason for this happening is unknown, but the following steps
> > > > > > > > and possibly a good bit of patience will reproduce the issue:
> > > > > > > > 
> > > > > > > > 1. Boot up the laptop normally in Hybrid graphics mode
> > > > > > > > 2. Make sure nouveau is loaded and that the GPU is awake
> > > > > > > > 2. Allow the nvidia GPU to runtime suspend itself after being
> > > > > > > > idle
> > > > > > > > 3. Reboot the machine, the more sudden the better (e.g sysrq-b
> > > > > > > > may help)
> > > > > > > > 4. If nouveau loads up properly, reboot the machine again and go
> > > > > > > > back to
> > > > > > > > step 2 until you reproduce the issue
> > > > > > > > 
> > > > > > > > This results in some very strange behavior: the GPU will quite
> > > > > > > > literally be left in exactly the same state it was in when the
> > > > > > > > previously booted kernel started the reboot. This has all
> > > > > > > > sorts of bad sideaffects: for starters, this completely breaks
> > > > > > > > nouveau starting with a mysterious EVO channel failure that
> > > > > > > > happens well before we've actually used the EVO channel for
> > > > > > > > anything:
> > > > 
> > > > Thanks for the hybrid tutorial (snipped from this response).  IIUC,
> > > > what you said was that in hybrid mode, the Intel GPU drives the
> > > > built-in display and the Nvidia GPU drives any external displays and
> > > > may be used for DRI PRIME rendering (whatever that is).  But since you
> > > > say the Nvidia device gets runtime suspended, I assume there's no
> > > > external display here and you're not using DRI PRIME.
> > > > 
> > > > I wonder if it's related to the fact that the Nvidia GPU has been
> > > > runtime suspended before you do the reboot.  Can you try turning of
> > > > runtime power management for the GPU by setting the runpm module
> > > > parameter to 0?  I *think* this would be booting with
> > > > "nouveau.runpm=0".
> > > 
> > > Sorry, I wasn't really thinking here.  You already *said* this is
> > > related to runtime suspend.  It only happens when the Nvidia GPU has
> > > been suspended.
> > > 
> > > I don't know that much about suspend, but ISTR seeing comments about
> > > resuming devices before we shutdown.  If we do that, maybe there's
> > > some kind of race between that resume and the reboot?
> > 
> > I think we do in fact resume PCI devices before shutdown.  Here's the
> > path I'm looking at:
> > 
> >   device_shutdown
> > pm_runtime_get_noresume
> > pm_runtime_barrier
> > dev->bus->shutdown
> >   pci_device_shutdown
> > pm_runtime_resume
> >   __pm_runtime_resume(dev, 0)
> > rpm_resume(dev, 0)
> >   __update_runtime_status(dev, RPM_RESUMING)
> >   callback = RPM_GET_CALLBACK(dev, runtime_resume)
> >   rpm_callback(callback, dev)
> > __rpm_callback
> >   pci_pm_runtime_resume
> > drv->pm->runtime_resume
> >   nouveau_pmops_runtime_resume
> > nouveau_do_resume
> > schedule_work(hpd_work)   # <---
> > ...
> > nouveau_display_hpd_work
> >   pm_runtime_get_sync
> >   drm_helper_hpd_irq_event
> >   pm_runtime_mark_last_busy
> >   pm_runtime_put_sync
> > 
> > I'm curious about that "schedule_work(hpd_work)" near the end because
> > no other drivers seem to use schedule_work() in the runtime_resume
> > path, and I don't know how that synchronizes with the shutdown
> > process.  I don't see anything that waits for
> > nouveau_display_hpd_work() to complete, so it seems like something
> > that could be a race.
> > 
> > I wonder this problem would be easier to reproduce if you added a
> > sleep in nouveau_display_hpd_work() as in the first hunk below, and I
> > wonder if the 

Re: [Nouveau] [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50

2019-04-24 Thread Bjorn Helgaas
On Wed, Apr 24, 2019 at 01:31:09PM -0400, Lyude Paul wrote:
> Any update on this? This has been waiting for a while now

Oh, sorry, I guess we were both waiting for the other.  Let me respond to
the email with context.

> On Thu, 2019-04-04 at 09:17 -0500, Bjorn Helgaas wrote:
> > [+cc Hans, author of 0b2fe6594fa2 ("drm/nouveau: Queue hpd_work on (runtime)
> > resume")]
> -- 
> Cheers,
>   Lyude Paul
> 
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

Re: [Nouveau] [PATCH] pci/quirks: Add quirk to reset nvgpu at boot for the Lenovo ThinkPad P50

2019-04-24 Thread Lyude Paul
Any update on this? This has been waiting for a while now

On Thu, 2019-04-04 at 09:17 -0500, Bjorn Helgaas wrote:
> [+cc Hans, author of 0b2fe6594fa2 ("drm/nouveau: Queue hpd_work on (runtime)
> resume")]
-- 
Cheers,
Lyude Paul

___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

[Nouveau] [Bug 110500] X-Server crashes - GL error: GL_OUT_OF_MEMORY in glTexSubImage

2019-04-24 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110500

--- Comment #8 from Jordan Belfort  ---
The mistake messages originate from _glamor_create_tex() which is called by
glamor_create_fbo(), and when this happens, pixmap_priv->fbo is set to NULL
since, well, the FBO couldn't be distributed.
http://www.domyassignmentforme.com/help-homework

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are the QA Contact for the bug.___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau

[Nouveau] [Bug 110500] X-Server crashes - GL error: GL_OUT_OF_MEMORY in glTexSubImage

2019-04-24 Thread bugzilla-daemon
https://bugs.freedesktop.org/show_bug.cgi?id=110500

--- Comment #7 from wixbi...@guerrillamail.com  ---
I talked with a developer. I get told, that when a driver provide a
functionality, that functionality should be working. They should be able to
rely on a normal, expected functionality.

For the note: Karol told me he would take a look into the memory issue because
he know that this is a driver issue


In the meantime i had the issue that suddenly the picture start flickering. It
then flicker whole time. The OS was unusable. The computer had to be restarted
to stop this flickering.
I looked around in the internet and this seems to be also a driver issue. 
This issue is also now the second time there. It appear when i start scrolling
up and down the https://bugzilla.freedesktop.org website with the Mozilla
Firefox browser. Should i open a second bugreport for this issue or is it
related? I cant find anything in dmesg or any other logfiles. The os is
flickering when opening windows, scrolling in the webbrowser, or any other
movement.

-- 
You are receiving this mail because:
You are the QA Contact for the bug.
You are the assignee for the bug.___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau