Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

2021-01-29 Thread Marc MERLIN
On Fri, Jan 29, 2021 at 03:20:32PM -0600, Bjorn Helgaas wrote:
> > For comparison the intel iwlwifi driver is very clear about firmware
> > it's trying to load, if it can't and what exact firmware you need to
> > find on the internet (filename)
> 
> I guess you're referring to this in iwl_request_firmware()?
> 
>   IWL_ERR(drv, "check 
> git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git\n");
>  
 
Yes :)

> How can we fix this in nouveau so we don't have the debug this again?
> I don't really know how firmware loading works, but "git grep -A5
> request_firmware drivers/gpu/drm/nouveau/" shows that we generally
> print something when request_firmware() fails.

Well, have a look at https://pastebin.com/dX19aCpj
do you see any warning whatsoever?

> But I didn't notice those messages in your logs, so I'm probably
> barking up the wrong tree.

you're not It seems that newer kernels are a bit better:
[  189.304662] nouveau :01:00.0: pmu: firmware unavailable
[  189.312455] nouveau :01:00.0: disp: destroy running...
[  189.316552] nouveau :01:00.0: disp: destroy completed in 1us
[  189.320326] nouveau :01:00.0: disp ctor failed, -12
[  189.324214] nouveau: probe of :01:00.0 failed with error -12

So, it probably got better, but that message got displayed after the 2mn
hang that having the firmware, stops from happening.

whichever developer with the right hardware can probably easily
reproduce this by removing the firmware and looking at the boot
messages.

At the very least, it should print something more clear "driver will not
function properly", and a URL to where one can get the driver, would be
awesome.

> So maybe the wakeups are related to having vs not having the nouveau
> firmware?  I'm still curious about that, and it smells like a bug to
> me, but probably something to do with nouveau where I have no hope of
> debugging it.
 
Right. Honestly, given the time I've lost with this, and now that it
seems gone with the firmware, I'm happy to leave well enough alone :)

I'm not sure how you are involved with the driver, but are you able to
help improve the dmesg output?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
 
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

2021-01-29 Thread Bjorn Helgaas
On Thu, Jan 28, 2021 at 04:56:26PM -0800, Marc MERLIN wrote:
> On Wed, Jan 27, 2021 at 03:33:00PM -0600, Bjorn Helgaas wrote:
> > Hi Marc, I appreciate your persistence on this.  I am frankly
> > surprised that you've put up with this so long.
>  
> Well, been using linux for 27 years, but also it's not like I have much
> of a choice outside of switching to windows, as tempting as it's getting
> sometimes ;)
> 
> > > after boot, when it gets the right trigger (not sure which ones), it
> > > loops on this evern 2 seconds, mostly forever.
> > > 
> > > I'm not sure if it's nouveau's fault or the kernel's PCI PME's fault, or 
> > > something else.
> > 
> > IIUC there are basically two problems:
> > 
> >   1) A 2 minute delay during boot
> > Another random thought: is there any chance the boot delay could be
> > related to crypto waiting for entropy?
> 
> So, the 2mn hang went away after I added the nouveau firwmare in initrd.
> The only problem is that the nouveau driver does not give a very good
> clue as to what's going on and what to do.
>
> For comparison the intel iwlwifi driver is very clear about firmware
> it's trying to load, if it can't and what exact firmware you need to
> find on the internet (filename)

I guess you're referring to this in iwl_request_firmware()?

  IWL_ERR(drv, "check 
git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git\n"); 

How can we fix this in nouveau so we don't have the debug this again?
I don't really know how firmware loading works, but "git grep -A5
request_firmware drivers/gpu/drm/nouveau/" shows that we generally
print something when request_firmware() fails.

But I didn't notice those messages in your logs, so I'm probably
barking up the wrong tree.

> >   2) Some sort of event every 2 seconds that kills your battery life
> > Your machine doesn't sound unusual, and I haven't seen a flood of
> > similar reports, so maybe there's something unusual about your config.
> > But I really don't have any guesses for either one.
> 
> Honestly, there are not too many thinpad P73 running linux out there. I
> wouldn't be surprised if it's only a handful or two.
> 
> > It sounds like v5.5 worked fine and you first noticed the slow boot
> > problem in v5.8.  We *could* try to bisect it, but I know that's a lot
> > of work on your part.
> 
> I've done that in the past, to be honest now that it works after I added
> the firmware that nouveau started needing, and didn't need before, the
> hang at boot is gone for sure.
> The PCI PM wakeup issues on batteries happen sometimes still, but they
> are much more rare now.

So maybe the wakeups are related to having vs not having the nouveau
firmware?  I'm still curious about that, and it smells like a bug to
me, but probably something to do with nouveau where I have no hope of
debugging it.

> > Grasping for any ideas for the boot delay; could you boot with
> > "initcall_debug" and collect your "lsmod" output?  I notice async_tx
> > in some of your logs, but I have no idea what it is.  It's from
> > crypto, so possibly somewhat unusual?
> 
> Is this still neeeded? I think of nouveau does a better job of helping
> the user correct the issue if firmware is missing (I think intel even
> gives a URL in printk), that would probably be what's needed for the
> most part.

Nope, don't bother with this, thanks.

Bjorn
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

2021-01-28 Thread Marc MERLIN
On Wed, Jan 27, 2021 at 03:33:00PM -0600, Bjorn Helgaas wrote:
> Hi Marc, I appreciate your persistence on this.  I am frankly
> surprised that you've put up with this so long.
 
Well, been using linux for 27 years, but also it's not like I have much
of a choice outside of switching to windows, as tempting as it's getting
sometimes ;)

> > after boot, when it gets the right trigger (not sure which ones), it
> > loops on this evern 2 seconds, mostly forever.
> > 
> > I'm not sure if it's nouveau's fault or the kernel's PCI PME's fault, or 
> > something else.
> 
> IIUC there are basically two problems:
> 
>   1) A 2 minute delay during boot
> Another random thought: is there any chance the boot delay could be
> related to crypto waiting for entropy?

So, the 2mn hang went away after I added the nouveau firwmare in initrd.
The only problem is that the nouveau driver does not give a very good
clue as to what's going on and what to do.
For comparison the intel iwlwifi driver is very clear about firmware
it's trying to load, if it can't and what exact firmware you need to
find on the internet (filename)

>   2) Some sort of event every 2 seconds that kills your battery life
> Your machine doesn't sound unusual, and I haven't seen a flood of
> similar reports, so maybe there's something unusual about your config.
> But I really don't have any guesses for either one.

Honestly, there are not too many thinpad P73 running linux out there. I
wouldn't be surprised if it's only a handful or two.

> It sounds like v5.5 worked fine and you first noticed the slow boot
> problem in v5.8.  We *could* try to bisect it, but I know that's a lot
> of work on your part.

I've done that in the past, to be honest now that it works after I added
the firmware that nouveau started needing, and didn't need before, the
hang at boot is gone for sure.
The PCI PM wakeup issues on batteries happen sometimes still, but they
are much more rare now.

> Grasping for any ideas for the boot delay; could you boot with
> "initcall_debug" and collect your "lsmod" output?  I notice async_tx
> in some of your logs, but I have no idea what it is.  It's from
> crypto, so possibly somewhat unusual?

Is this still neeeded? I think of nouveau does a better job of helping
the user correct the issue if firmware is missing (I think intel even
gives a URL in printk), that would probably be what's needed for the
most part.

[   12.832547] async_tx: api initialized (async) comes from 
./crypto/async_tx/async_tx.c

Thanks for your answer, let me know if there is anything else useful I
can give, I think I'm otherwise mostly ok now.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
 
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

2021-01-28 Thread Bjorn Helgaas
On Wed, Jan 27, 2021 at 03:33:02PM -0600, Bjorn Helgaas wrote:
> On Sat, Dec 26, 2020 at 03:12:09AM -0800, Marc MERLIN wrote:
> > This started with 5.5 and hasn't gotten better since then, despite
> > some reports I tried to send.
> > 
> > As per my previous message:
> > I have a Thinkpad P70 with hybrid graphics.
> > 01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro 
> > M600M] (rev a2)
> > that one works fine, I can use i915 for the main screen, and nouveau to
> > display on the external ports (external ports are only wired to nvidia
> > chip, so it's impossible to use them without turning the nvidia chip
> > on).
> >  
> > I now got a newer P73 also with the same hybrid graphics (setup as such
> > in the bios). It runs fine with i915, and I don't need to use external
> > display with nouveau for now (it almost works, but I only see the mouse
> > cursor on the external screen, no window or anything else can get
> > displayed, very weird).
> > 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 
> > 4000 Mobile / Max-Q] (rev a1)
> >  
> > 
> > after boot, when it gets the right trigger (not sure which ones), it
> > loops on this evern 2 seconds, mostly forever.
> > 
> > I'm not sure if it's nouveau's fault or the kernel's PCI PME's fault, or 
> > something else.
> 
> IIUC there are basically two problems:
> 
>   1) A 2 minute delay during boot
>   2) Some sort of event every 2 seconds that kills your battery life
> 
> Your machine doesn't sound unusual, and I haven't seen a flood of
> similar reports, so maybe there's something unusual about your config.
> But I really don't have any guesses for either one.
> 
> It sounds like v5.5 worked fine and you first noticed the slow boot
> problem in v5.8.  We *could* try to bisect it, but I know that's a lot
> of work on your part.
> 
> Grasping for any ideas for the boot delay; could you boot with
> "initcall_debug" and collect your "lsmod" output?  I notice async_tx
> in some of your logs, but I have no idea what it is.  It's from
> crypto, so possibly somewhat unusual?

Another random thought: is there any chance the boot delay could be
related to crypto waiting for entropy?
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

2021-01-27 Thread Bjorn Helgaas
Hi Marc, I appreciate your persistence on this.  I am frankly
surprised that you've put up with this so long.

On Sat, Dec 26, 2020 at 03:12:09AM -0800, Marc MERLIN wrote:
> This started with 5.5 and hasn't gotten better since then, despite
> some reports I tried to send.
> 
> As per my previous message:
> I have a Thinkpad P70 with hybrid graphics.
> 01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro M600M] 
> (rev a2)
> that one works fine, I can use i915 for the main screen, and nouveau to
> display on the external ports (external ports are only wired to nvidia
> chip, so it's impossible to use them without turning the nvidia chip
> on).
>  
> I now got a newer P73 also with the same hybrid graphics (setup as such
> in the bios). It runs fine with i915, and I don't need to use external
> display with nouveau for now (it almost works, but I only see the mouse
> cursor on the external screen, no window or anything else can get
> displayed, very weird).
> 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 
> 4000 Mobile / Max-Q] (rev a1)
>  
> 
> after boot, when it gets the right trigger (not sure which ones), it
> loops on this evern 2 seconds, mostly forever.
> 
> I'm not sure if it's nouveau's fault or the kernel's PCI PME's fault, or 
> something else.

IIUC there are basically two problems:

  1) A 2 minute delay during boot
  2) Some sort of event every 2 seconds that kills your battery life

Your machine doesn't sound unusual, and I haven't seen a flood of
similar reports, so maybe there's something unusual about your config.
But I really don't have any guesses for either one.

It sounds like v5.5 worked fine and you first noticed the slow boot
problem in v5.8.  We *could* try to bisect it, but I know that's a lot
of work on your part.

Grasping for any ideas for the boot delay; could you boot with
"initcall_debug" and collect your "lsmod" output?  I notice async_tx
in some of your logs, but I have no idea what it is.  It's from
crypto, so possibly somewhat unusual?

Bjorn
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

2021-01-07 Thread Marc MERLIN
On Mon, Jan 04, 2021 at 02:28:37PM +0100, Karol Herbst wrote:
> mhh, that PCI config stuff should really not happen all the time, but
> it also doesn't appear to. The other thing I really don't know is, how
> well the runpm works with tools like TLP if there isn't only an audio
> device, but also the USB stuff and all the subdevices have to be
> turned off all the time in order for the GPU to stay powered down.
> 
> The firmware stuff is also just a functional problem, so you won't get
> display offloading, but it shouldn't drain your battery as long as
> nothing is connected. I'd check with "grep .
> /sys/bus/pci/devices/*/power/runtime_status" if all subdevices of the
> GPU are powered down, and check which one gets enabled regularly or
> something.

Well, all I can say is that without the firmware, my boot hung 2mn every
single time (I sent details in the logs upthread).

The battery draw issue was inconsistent. I haven't quite found what
triggers it yet.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
 
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

2021-01-04 Thread Karol Herbst
mhh, that PCI config stuff should really not happen all the time, but
it also doesn't appear to. The other thing I really don't know is, how
well the runpm works with tools like TLP if there isn't only an audio
device, but also the USB stuff and all the subdevices have to be
turned off all the time in order for the GPU to stay powered down.

The firmware stuff is also just a functional problem, so you won't get
display offloading, but it shouldn't drain your battery as long as
nothing is connected. I'd check with "grep .
/sys/bus/pci/devices/*/power/runtime_status" if all subdevices of the
GPU are powered down, and check which one gets enabled regularly or
something.

On Mon, Jan 4, 2021 at 12:50 PM Marc MERLIN  wrote:
>
> On Tue, Dec 29, 2020 at 09:47:50AM -0800, Marc MERLIN wrote:
> > > Of course now that I read your email a bit more carefully, it seems
> > > your issue is with the "saving config space" messages. I'm not sure
> > > I've seen those before. Perhaps you have some sort of debug enabled.
> > > I'd find where in the kernel they are being produced, and what the
> > > conditions for it are. But the failure to load firmware isn't great --
> > > not 100% sure if it impacts runpm or not.
> >
> > Yes, I have 'nouveau.debug=disp=trace'
> > Someone on this list asked me to add this a few months back.
> >
> > > I just double-checked, TU10x accel came in via
> > > afa3b96b058d87c2c44d1c83dadb2ba6998d03ce, which was first in v5.6.
> > > Initial TU10x support came in v5.0. So that doesn't line up with your
> > > timeline.
> >
> > You know, I said 5.5, maybe it was 5.6 now, it's been a little while
> > since those issues started.
> >
> > Now we know I was missing the required firmware, it's a good place to
> > start, so I'll start there, thank you very much for the pointers.
>
> Sorry for the delay. I rebooted and everything worked great.
> No hang at boot.
> As for the PME loop I've been seeing, it hasn't happened so far.
>
> I can't comment on whether firmware should be required for the kernel to
> boot properly, but if it's at all possible, please try to make the
> driver fall back or shut down if the firmware is absent as opposed to
> hanging the boot 2mn.
>
> Also some drivers give a better clue that their firmware is missing
> and where to get it from. Adding a printk to help users could be a good
> idea.
>
> Below is the boot with firmware present.
>
> Thanks for your help
> Marc
>
> sauron:~$ grep nouveau /var/log/dmesg
> [   11.016605] nouveau: detected PR support, will not use DSM
> [   11.025191] nouveau :01:00.0: runtime IRQ mapping not provided by arch
> [   11.071823] nouveau :01:00.0: enabling device ( -> 0003)
> [   11.111588] nouveau :01:00.0: NVIDIA TU104 (164000a1)
> [   11.203598] nouveau :01:00.0: bios: version 90.04.4d.00.2c
> [   11.203921] nouveau :01:00.0: pmu: firmware unavailable
> [   11.204229] nouveau :01:00.0: enabling bus mastering
> [   11.204543] nouveau :01:00.0: fb: 8192 MiB GDDR6
> [   11.215524] nouveau :01:00.0: DRM: VRAM: 8192 MiB
> [   11.215525] nouveau :01:00.0: DRM: GART: 536870912 MiB
> [   11.215527] nouveau :01:00.0: DRM: BIT table 'A' not found
> [   11.215527] nouveau :01:00.0: DRM: BIT table 'L' not found
> [   11.215528] nouveau :01:00.0: DRM: TMDS table version 2.0
> [   11.215529] nouveau :01:00.0: DRM: DCB version 4.1
> [   11.215530] nouveau :01:00.0: DRM: DCB outp 00: 02800f66 04600020
> [   11.215531] nouveau :01:00.0: DRM: DCB outp 01: 02011f52 00020010
> [   11.215532] nouveau :01:00.0: DRM: DCB outp 02: 01022f36 04600010
> [   11.215532] nouveau :01:00.0: DRM: DCB outp 03: 04033f76 04600010
> [   11.215533] nouveau :01:00.0: DRM: DCB outp 04: 04044f86 04600020
> [   11.215533] nouveau :01:00.0: DRM: DCB conn 00: 00020047
> [   11.215534] nouveau :01:00.0: DRM: DCB conn 01: 00010161
> [   11.215534] nouveau :01:00.0: DRM: DCB conn 02: 1248
> [   11.215535] nouveau :01:00.0: DRM: DCB conn 03: 01000348
> [   11.215535] nouveau :01:00.0: DRM: DCB conn 04: 02000471
> [   11.216166] nouveau :01:00.0: DRM: MM: using COPY for buffer copies
> [   11.526753] nouveau :01:00.0: DRM: unknown connector type 48
> [   11.527077] nouveau :01:00.0: DRM: unknown connector type 48
> [   11.552051] nouveau :01:00.0: [drm] Cannot find any crtc or sizes
> [   11.554239] nouveau :01:00.0: [drm] Cannot find any crtc or sizes
> [   11.555822] nouveau :01:00.0: [drm] Cannot find any crtc or sizes
> [   11.556054] [drm] Initialized nouveau 1.3.1 20120801 for :01:00.0 on 
> minor 1
> [   11.556060] nouveau :01:00.0: DRM: Disabling PCI power management to 
> avoid bug
> [   18.887229] nouveau :01:00.0: saving config space at offset 0x0 
> (reading 0x1eb610de)
> [   18.887231] nouveau :01:00.0: saving config space at offset 0x4 
> (reading 0x100407)
> [   18.887233] nouveau :01:00.0: saving config space at offset 0x8 
> (reading 

Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

2021-01-04 Thread Marc MERLIN
On Tue, Dec 29, 2020 at 09:47:50AM -0800, Marc MERLIN wrote:
> > Of course now that I read your email a bit more carefully, it seems
> > your issue is with the "saving config space" messages. I'm not sure
> > I've seen those before. Perhaps you have some sort of debug enabled.
> > I'd find where in the kernel they are being produced, and what the
> > conditions for it are. But the failure to load firmware isn't great --
> > not 100% sure if it impacts runpm or not.
>  
> Yes, I have 'nouveau.debug=disp=trace'
> Someone on this list asked me to add this a few months back.
> 
> > I just double-checked, TU10x accel came in via
> > afa3b96b058d87c2c44d1c83dadb2ba6998d03ce, which was first in v5.6.
> > Initial TU10x support came in v5.0. So that doesn't line up with your
> > timeline.
> 
> You know, I said 5.5, maybe it was 5.6 now, it's been a little while
> since those issues started.
> 
> Now we know I was missing the required firmware, it's a good place to
> start, so I'll start there, thank you very much for the pointers.

Sorry for the delay. I rebooted and everything worked great.
No hang at boot.
As for the PME loop I've been seeing, it hasn't happened so far.

I can't comment on whether firmware should be required for the kernel to
boot properly, but if it's at all possible, please try to make the
driver fall back or shut down if the firmware is absent as opposed to
hanging the boot 2mn.

Also some drivers give a better clue that their firmware is missing
and where to get it from. Adding a printk to help users could be a good
idea.

Below is the boot with firmware present.

Thanks for your help
Marc

sauron:~$ grep nouveau /var/log/dmesg 
[   11.016605] nouveau: detected PR support, will not use DSM
[   11.025191] nouveau :01:00.0: runtime IRQ mapping not provided by arch
[   11.071823] nouveau :01:00.0: enabling device ( -> 0003)
[   11.111588] nouveau :01:00.0: NVIDIA TU104 (164000a1)
[   11.203598] nouveau :01:00.0: bios: version 90.04.4d.00.2c
[   11.203921] nouveau :01:00.0: pmu: firmware unavailable
[   11.204229] nouveau :01:00.0: enabling bus mastering
[   11.204543] nouveau :01:00.0: fb: 8192 MiB GDDR6
[   11.215524] nouveau :01:00.0: DRM: VRAM: 8192 MiB
[   11.215525] nouveau :01:00.0: DRM: GART: 536870912 MiB
[   11.215527] nouveau :01:00.0: DRM: BIT table 'A' not found
[   11.215527] nouveau :01:00.0: DRM: BIT table 'L' not found
[   11.215528] nouveau :01:00.0: DRM: TMDS table version 2.0
[   11.215529] nouveau :01:00.0: DRM: DCB version 4.1
[   11.215530] nouveau :01:00.0: DRM: DCB outp 00: 02800f66 04600020
[   11.215531] nouveau :01:00.0: DRM: DCB outp 01: 02011f52 00020010
[   11.215532] nouveau :01:00.0: DRM: DCB outp 02: 01022f36 04600010
[   11.215532] nouveau :01:00.0: DRM: DCB outp 03: 04033f76 04600010
[   11.215533] nouveau :01:00.0: DRM: DCB outp 04: 04044f86 04600020
[   11.215533] nouveau :01:00.0: DRM: DCB conn 00: 00020047
[   11.215534] nouveau :01:00.0: DRM: DCB conn 01: 00010161
[   11.215534] nouveau :01:00.0: DRM: DCB conn 02: 1248
[   11.215535] nouveau :01:00.0: DRM: DCB conn 03: 01000348
[   11.215535] nouveau :01:00.0: DRM: DCB conn 04: 02000471
[   11.216166] nouveau :01:00.0: DRM: MM: using COPY for buffer copies
[   11.526753] nouveau :01:00.0: DRM: unknown connector type 48
[   11.527077] nouveau :01:00.0: DRM: unknown connector type 48
[   11.552051] nouveau :01:00.0: [drm] Cannot find any crtc or sizes
[   11.554239] nouveau :01:00.0: [drm] Cannot find any crtc or sizes
[   11.555822] nouveau :01:00.0: [drm] Cannot find any crtc or sizes
[   11.556054] [drm] Initialized nouveau 1.3.1 20120801 for :01:00.0 on 
minor 1
[   11.556060] nouveau :01:00.0: DRM: Disabling PCI power management to 
avoid bug
[   18.887229] nouveau :01:00.0: saving config space at offset 0x0 (reading 
0x1eb610de)
[   18.887231] nouveau :01:00.0: saving config space at offset 0x4 (reading 
0x100407)
[   18.887233] nouveau :01:00.0: saving config space at offset 0x8 (reading 
0x3a1)
[   18.887235] nouveau :01:00.0: saving config space at offset 0xc (reading 
0x80)
[   18.887237] nouveau :01:00.0: saving config space at offset 0x10 
(reading 0xcd00)
[   18.887239] nouveau :01:00.0: saving config space at offset 0x14 
(reading 0xa00c)
[   18.887241] nouveau :01:00.0: saving config space at offset 0x18 
(reading 0x0)
[   18.887243] nouveau :01:00.0: saving config space at offset 0x1c 
(reading 0xb00c)
[   18.887245] nouveau :01:00.0: saving config space at offset 0x20 
(reading 0x0)
[   18.887247] nouveau :01:00.0: saving config space at offset 0x24 
(reading 0x2001)
[   18.887249] nouveau :01:00.0: saving config space at offset 0x28 
(reading 0x0)
[   18.887251] nouveau :01:00.0: saving config space at offset 0x2c 
(reading 0x229b17aa)
[   18.887253] nouveau :01:00.0: saving config space 

Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

2020-12-30 Thread ael
On Tue, Dec 29, 2020 at 11:33:16AM -0500, Ilia Mirkin wrote:
> On Tue, Dec 29, 2020 at 10:52 AM Marc MERLIN  wrote:
> 
> I'm not extremely familiar with debian packaging, but the firmware is
> provided by NVIDIA and shipped as part of linux-firmware:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/nvidia

I think it may be  firmware-misc-nonfree.

ael
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

2020-12-29 Thread Marc MERLIN
(removed other lists, since it's likely not a linux-PCI problem)

On Tue, Dec 29, 2020 at 11:33:16AM -0500, Ilia Mirkin wrote:
> > Sounds like this would be a problem with all chips if userspace is able
> > to wake them up every second or two with a probe. Now I wonder what
> > broken userspace I have that could be doing this.
> 
> Well, it's a theory. Some userspace helpfully prevents the GPU from
> suspending entirely, unfortunately I don't remember its name though by
> messing with the attached audio device. It's very common and meant to
> help... oh well.

Are you thinking about tlp maybe?  https://linrunner.de/tlp/
I submitted a blacklist patch so that it works ok-ish on my laptop now.
(when the nvidia chip is unhappy, it happily uses 70W on batteries with
1.3h of runtime. When everything is ok, I can go down to about 12W/9H)

> > Do you think that could be a reason why the boot would hang for 2 full 
> > minutes at every
> > boot ever since I upgraded to 5.5?
> 
> I'd have to check, but I'm guessing TU104 acceleration became a thing
> in 5.5. I would also not be very surprised if the code didn't handle
> failure extremely gracefully - there definitely have been problems
> with that in the past.

Ah, then the timing checks out. That's exciting, at least now I have a
lead as to why I'm having problems. This was the same time a PCI PM
change went in, and I mistakenly thought it was to blame.

> > The kernel module is in my initrd:
> > sauron:/usr/local/bin# dd 
> > if=/boot/initrd.img-5.9.11-amd64-preempt-sysrq-20190817 bs=2966528  skip=1 
> > | gunzip | cpio -tdv | grep nouveau
> > drwxr-xr-x   1 root root0 Nov 30 15:40 
> > usr/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau
> > -rw-r--r--   1 root root  3691385 Nov 30 15:35 
> > usr/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau/nouveau.ko
> > 17+1 records in
> > 17+1 records out
> > 52566778 bytes (53 MB, 50 MiB) copied, 1.69708 s, 31.0 MB/s
> 
> I think that gets you out of "full newbie" land...

:)  (ok, I have been using linux since 1993, but stuff changes so much
all the time, that sometimes I feel like a newbie all over again)
In my days, we didn't complain about systemd vs sysvinit, we had rc.local
and it was good enough :-D

> > Note that ultimately I only need nouveau not to hang my boot 2mn and do
> > PM so that the nvidia chip goes to sleep since I don't use it.
> 
> I'm not extremely familiar with debian packaging, but the firmware is
> provided by NVIDIA and shipped as part of linux-firmware:
> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/nvidia
 
Ah, it comes from outside just like intel firmware, thanks.
Also, I was looking for nouveau, not nvidia:
sauron:/usr/local/bin# dd 
if=/boot/initrd.img-5.9.11-amd64-preempt-sysrq-20190817 bs=2966528  skip=1 | 
gunzip | cpio -tdv | grep tu104
shows no match

Good news is that debian did package it (they have multiple firmware
packages)
sauron:~# dpkggrep firmware | awk '{print $1}' | xargs apt-get install -y
sauron:~# dpkg -S /lib/firmware/nvidia/tu104
firmware-misc-nonfree: /lib/firmware/nvidia/tu104

update-initramfs -v -c -k 5.9.11-amd64-preempt-sysrq-20190817

Ok, I should be in business after next reboot, thank you.

> Of course now that I read your email a bit more carefully, it seems
> your issue is with the "saving config space" messages. I'm not sure
> I've seen those before. Perhaps you have some sort of debug enabled.
> I'd find where in the kernel they are being produced, and what the
> conditions for it are. But the failure to load firmware isn't great --
> not 100% sure if it impacts runpm or not.
 
Yes, I have 'nouveau.debug=disp=trace'
Someone on this list asked me to add this a few months back.

> I just double-checked, TU10x accel came in via
> afa3b96b058d87c2c44d1c83dadb2ba6998d03ce, which was first in v5.6.
> Initial TU10x support came in v5.0. So that doesn't line up with your
> timeline.

You know, I said 5.5, maybe it was 5.6 now, it's been a little while
since those issues started.

Now we know I was missing the required firmware, it's a good place to
start, so I'll start there, thank you very much for the pointers.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
 
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

2020-12-29 Thread Ilia Mirkin
On Tue, Dec 29, 2020 at 10:52 AM Marc MERLIN  wrote:
>
> On Sat, Dec 26, 2020 at 03:12:09AM -0800, Ilia Mirkin wrote:
> > > after boot, when it gets the right trigger (not sure which ones), it
> > > loops on this evern 2 seconds, mostly forever.
> >
> > The gpu suspends with runtime pm. And then gets woken up for some
> > reason (could be something quite silly, like lspci, or could be
> > something explicitly checking connectors, etc). Repeat.
>
> Ah, fair point.  Could it be powertop even?
> How would I go towards tracing that?
> Sounds like this would be a problem with all chips if userspace is able
> to wake them up every second or two with a probe. Now I wonder what
> broken userspace I have that could be doing this.

Well, it's a theory. Some userspace helpfully prevents the GPU from
suspending entirely, unfortunately I don't remember its name though by
messing with the attached audio device. It's very common and meant to
help... oh well.

>
> > Display offload usually requires acceleration -- the copies are done
> > using the DMA engine. Please make sure that you have firmware
> > available (and a new enough mesa). The errors suggest that you don't
> > have firmware available at the time that nouveau loads. Depending on
> > your setup, that might mean the firmware has to be built into the
> > kernel, or available in initramfs. (Or just regular filesystem if you
> > don't use a complicated boot sequence. But many people go with distro
> > defaults, which do have this complexity.)
>
> Hi Ilia, thanks for your answer.
>
> Do you think that could be a reason why the boot would hang for 2 full 
> minutes at every
> boot ever since I upgraded to 5.5?

I'd have to check, but I'm guessing TU104 acceleration became a thing
in 5.5. I would also not be very surprised if the code didn't handle
failure extremely gracefully - there definitely have been problems
with that in the past.

>
> Also, without wanting to sound like a full newbie, where is that
> firmware you're talking about? In my kernel source?
>
> Here's what I do have:
> sauron:/usr/local/bin# dpkggrep nouveau
> libdrm-nouveau2:amd64   install
> xserver-xorg-video-nouveau  install
>
> no nouveau-firmware package in debian:
> sauron:/usr/local/bin# apt-cache search nouveau
> bumblebee - NVIDIA Optimus support for Linux
> libdrm-nouveau2 - Userspace interface to nouveau-specific kernel DRM services 
> -- runtime
> xfonts-jmk - Jim Knoble's character-cell fonts for X
> xserver-xorg-video-nouveau - X.Org X server -- Nouveau display driver
>
> No firmware file on my disk:
> sauron:/usr/local/bin# find /lib/modules/5.9.11-amd64-preempt-sysrq-20190817/ 
> /lib/firmware/ |grep nouveau
> /lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau
> /lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau/nouveau.ko
> sauron:/usr/local/bin#
>
> The kernel module is in my initrd:
> sauron:/usr/local/bin# dd 
> if=/boot/initrd.img-5.9.11-amd64-preempt-sysrq-20190817 bs=2966528  skip=1 | 
> gunzip | cpio -tdv | grep nouveau
> drwxr-xr-x   1 root root0 Nov 30 15:40 
> usr/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau
> -rw-r--r--   1 root root  3691385 Nov 30 15:35 
> usr/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau/nouveau.ko
> 17+1 records in
> 17+1 records out
> 52566778 bytes (53 MB, 50 MiB) copied, 1.69708 s, 31.0 MB/s

I think that gets you out of "full newbie" land...

>
> What am I supposed to do/check next?
>
> Note that ultimately I only need nouveau not to hang my boot 2mn and do
> PM so that the nvidia chip goes to sleep since I don't use it.

I'm not extremely familiar with debian packaging, but the firmware is
provided by NVIDIA and shipped as part of linux-firmware:

https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/nvidia

This needs to be available at /lib/firmware/nvidia when nouveau loads.
Based on your email above, it's most likely that it would load from
the initrd - so make sure it's in there.

Of course now that I read your email a bit more carefully, it seems
your issue is with the "saving config space" messages. I'm not sure
I've seen those before. Perhaps you have some sort of debug enabled.
I'd find where in the kernel they are being produced, and what the
conditions for it are. But the failure to load firmware isn't great --
not 100% sure if it impacts runpm or not.

I just double-checked, TU10x accel came in via
afa3b96b058d87c2c44d1c83dadb2ba6998d03ce, which was first in v5.6.
Initial TU10x support came in v5.0. So that doesn't line up with your
timeline.

Anyways, I'd definitely sort the firmware situation out, but it may
not be the cause of your problem.

Cheers,

  -ilia
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

2020-12-29 Thread Marc MERLIN
On Sat, Dec 26, 2020 at 03:12:09AM -0800, Ilia Mirkin wrote:
> > after boot, when it gets the right trigger (not sure which ones), it
> > loops on this evern 2 seconds, mostly forever.
> 
> The gpu suspends with runtime pm. And then gets woken up for some
> reason (could be something quite silly, like lspci, or could be
> something explicitly checking connectors, etc). Repeat.

Ah, fair point.  Could it be powertop even?
How would I go towards tracing that?
Sounds like this would be a problem with all chips if userspace is able
to wake them up every second or two with a probe. Now I wonder what
broken userspace I have that could be doing this.
 
> Display offload usually requires acceleration -- the copies are done
> using the DMA engine. Please make sure that you have firmware
> available (and a new enough mesa). The errors suggest that you don't
> have firmware available at the time that nouveau loads. Depending on
> your setup, that might mean the firmware has to be built into the
> kernel, or available in initramfs. (Or just regular filesystem if you
> don't use a complicated boot sequence. But many people go with distro
> defaults, which do have this complexity.)

Hi Ilia, thanks for your answer.

Do you think that could be a reason why the boot would hang for 2 full minutes 
at every
boot ever since I upgraded to 5.5?

Also, without wanting to sound like a full newbie, where is that
firmware you're talking about? In my kernel source?

Here's what I do have:
sauron:/usr/local/bin# dpkggrep nouveau
libdrm-nouveau2:amd64   install
xserver-xorg-video-nouveau  install

no nouveau-firmware package in debian:
sauron:/usr/local/bin# apt-cache search nouveau
bumblebee - NVIDIA Optimus support for Linux
libdrm-nouveau2 - Userspace interface to nouveau-specific kernel DRM services 
-- runtime
xfonts-jmk - Jim Knoble's character-cell fonts for X
xserver-xorg-video-nouveau - X.Org X server -- Nouveau display driver

No firmware file on my disk:
sauron:/usr/local/bin# find /lib/modules/5.9.11-amd64-preempt-sysrq-20190817/ 
/lib/firmware/ |grep nouveau
/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau
/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau/nouveau.ko
sauron:/usr/local/bin# 

The kernel module is in my initrd:
sauron:/usr/local/bin# dd 
if=/boot/initrd.img-5.9.11-amd64-preempt-sysrq-20190817 bs=2966528  skip=1 | 
gunzip | cpio -tdv | grep nouveau
drwxr-xr-x   1 root root0 Nov 30 15:40 
usr/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau
-rw-r--r--   1 root root  3691385 Nov 30 15:35 
usr/lib/modules/5.9.11-amd64-preempt-sysrq-20190817/kernel/drivers/gpu/drm/nouveau/nouveau.ko
17+1 records in
17+1 records out
52566778 bytes (53 MB, 50 MiB) copied, 1.69708 s, 31.0 MB/s

What am I supposed to do/check next?

Note that ultimately I only need nouveau not to hang my boot 2mn and do
PM so that the nvidia chip goes to sleep since I don't use it.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
 
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


Re: [Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

2020-12-27 Thread Ilia Mirkin
On Sun, Dec 27, 2020 at 12:03 PM Marc MERLIN  wrote:
>
> This started with 5.5 and hasn't gotten better since then, despite some 
> reports
> I tried to send.
>
> As per my previous message:
> I have a Thinkpad P70 with hybrid graphics.
> 01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro M600M] 
> (rev a2)
> that one works fine, I can use i915 for the main screen, and nouveau to
> display on the external ports (external ports are only wired to nvidia
> chip, so it's impossible to use them without turning the nvidia chip
> on).
>
> I now got a newer P73 also with the same hybrid graphics (setup as such
> in the bios). It runs fine with i915, and I don't need to use external
> display with nouveau for now (it almost works, but I only see the mouse
> cursor on the external screen, no window or anything else can get
> displayed, very weird).
> 01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 
> 4000 Mobile / Max-Q] (rev a1)

Display offload usually requires acceleration -- the copies are done
using the DMA engine. Please make sure that you have firmware
available (and a new enough mesa). The errors suggest that you don't
have firmware available at the time that nouveau loads. Depending on
your setup, that might mean the firmware has to be built into the
kernel, or available in initramfs. (Or just regular filesystem if you
don't use a complicated boot sequence. But many people go with distro
defaults, which do have this complexity.)

>
>
> after boot, when it gets the right trigger (not sure which ones), it
> loops on this evern 2 seconds, mostly forever.

The gpu suspends with runtime pm. And then gets woken up for some
reason (could be something quite silly, like lspci, or could be
something explicitly checking connectors, etc). Repeat.

Cheers,

  -ilia
___
Nouveau mailing list
Nouveau@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/nouveau


[Nouveau] 5.9.11 still hanging 2mn at each boot and looping on nvidia-gpu 0000:01:00.3: PME# enabled (Quadro RTX 4000 Mobile)

2020-12-27 Thread Marc MERLIN
This started with 5.5 and hasn't gotten better since then, despite some reports
I tried to send.

As per my previous message:
I have a Thinkpad P70 with hybrid graphics.
01:00.0 VGA compatible controller: NVIDIA Corporation GM107GLM [Quadro M600M] 
(rev a2)
that one works fine, I can use i915 for the main screen, and nouveau to
display on the external ports (external ports are only wired to nvidia
chip, so it's impossible to use them without turning the nvidia chip
on).
 
I now got a newer P73 also with the same hybrid graphics (setup as such
in the bios). It runs fine with i915, and I don't need to use external
display with nouveau for now (it almost works, but I only see the mouse
cursor on the external screen, no window or anything else can get
displayed, very weird).
01:00.0 VGA compatible controller: NVIDIA Corporation TU104GLM [Quadro RTX 4000 
Mobile / Max-Q] (rev a1)
 

after boot, when it gets the right trigger (not sure which ones), it
loops on this evern 2 seconds, mostly forever.

I'm not sure if it's nouveau's fault or the kernel's PCI PME's fault, or 
something else.

Boot hangs look like this:
[   10.659209] Console: switching to colour frame buffer device 240x67
[   10.732353] i915 :00:02.0: [drm] fb0: i915drmfb frame buffer device
[   12.101203] nvidia-gpu :01:00.3: saving config space at offset 0x0 
(reading 0x1ad910de)
[   12.101212] nvidia-gpu :01:00.3: saving config space at offset 0x4 
(reading 0x100406)
[   12.101217] nvidia-gpu :01:00.3: saving config space at offset 0x8 
(reading 0xc8000a1)
[   12.101223] nvidia-gpu :01:00.3: saving config space at offset 0xc 
(reading 0x80)
[   12.101228] nvidia-gpu :01:00.3: saving config space at offset 0x10 
(reading 0xce054000)
[   12.101234] nvidia-gpu :01:00.3: saving config space at offset 0x14 
(reading 0x0)
[   12.101239] nvidia-gpu :01:00.3: saving config space at offset 0x18 
(reading 0x0)
[   12.101244] nvidia-gpu :01:00.3: saving config space at offset 0x1c 
(reading 0x0)
[   12.101249] nvidia-gpu :01:00.3: saving config space at offset 0x20 
(reading 0x0)
[   12.101254] nvidia-gpu :01:00.3: saving config space at offset 0x24 
(reading 0x0)
[   12.101259] nvidia-gpu :01:00.3: saving config space at offset 0x28 
(reading 0x0)
[   12.101265] nvidia-gpu :01:00.3: saving config space at offset 0x2c 
(reading 0x229b17aa)
[   12.101270] nvidia-gpu :01:00.3: saving config space at offset 0x30 
(reading 0x0)
[   12.101275] nvidia-gpu :01:00.3: saving config space at offset 0x34 
(reading 0x68)
[   12.101280] nvidia-gpu :01:00.3: saving config space at offset 0x38 
(reading 0x0)
[   12.101285] nvidia-gpu :01:00.3: saving config space at offset 0x3c 
(reading 0x4ff)
[   12.101333] nvidia-gpu :01:00.3: PME# enabled
[   25.151246] thunderbolt :06:00.0: saving config space at offset 0x0 
(reading 0x15eb8086)
[   25.151260] thunderbolt :06:00.0: saving config space at offset 0x4 
(reading 0x100406)
[   25.151265] thunderbolt :06:00.0: saving config space at offset 0x8 
(reading 0x886)
[   25.151270] thunderbolt :06:00.0: saving config space at offset 0xc 
(reading 0x20)
[   25.151276] thunderbolt :06:00.0: saving config space at offset 0x10 
(reading 0xcc10)
[   25.151281] thunderbolt :06:00.0: saving config space at offset 0x14 
(reading 0xcc14)
[   25.151286] thunderbolt :06:00.0: saving config space at offset 0x18 
(reading 0x0)
[   25.151291] thunderbolt :06:00.0: saving config space at offset 0x1c 
(reading 0x0)
[   25.151296] thunderbolt :06:00.0: saving config space at offset 0x20 
(reading 0x0)
[   25.151301] thunderbolt :06:00.0: saving config space at offset 0x24 
(reading 0x0)
[   25.151306] thunderbolt :06:00.0: saving config space at offset 0x28 
(reading 0x0)
[   25.151311] thunderbolt :06:00.0: saving config space at offset 0x2c 
(reading 0x229b17aa)
[   25.151316] thunderbolt :06:00.0: saving config space at offset 0x30 
(reading 0x0)
[   25.151322] thunderbolt :06:00.0: saving config space at offset 0x34 
(reading 0x80)
[   25.151327] thunderbolt :06:00.0: saving config space at offset 0x38 
(reading 0x0)
[   25.151332] thunderbolt :06:00.0: saving config space at offset 0x3c 
(reading 0x1ff)
[   25.151416] thunderbolt :06:00.0: PME# enabled
[   25.169204] pcieport :05:00.0: saving config space at offset 0x0 
(reading 0x15ea8086)
[   25.169214] pcieport :05:00.0: saving config space at offset 0x4 
(reading 0x100407)
[   25.169219] pcieport :05:00.0: saving config space at offset 0x8 
(reading 0x6040006)
[   25.169224] pcieport :05:00.0: saving config space at offset 0xc 
(reading 0x10020)
[   25.169229] pcieport :05:00.0: saving config space at offset 0x10 
(reading 0x0)
[   25.169233] pcieport :05:00.0: saving config space at offset 0x14 
(reading 0x0)
[   25.169238] pcieport :05:00.0: saving config space at offset 0x18 
(reading 0x60605)
[