Re: Regression causes a hang on boot with a Comtrol PCI card

2019-04-24 Thread Alan Stern
On Tue, 23 Apr 2019, Jesse Hathaway wrote:

> On Tue, Apr 16, 2019 at 10:00 AM Alan Stern  wrote:
> > Whatever the source of the problem, I don't think you're going to find
> > it by looking at the USB code.  Perhaps the early initialization of the
> > functions that _are_ present on the Comtrol card somehow messes up
> > other parts of the system.
> 
> Thanks for all you help Alan,
> 
> I think at this point my Linux kernel debugging knowledge has run
> aground. I would
> love to crack this nut, but I doubt that will be possible with my
> current knowledge base.
> For now I am going to run the patched kernel even if I don't
> understand precisely why
> they prevent the box from hanging on boot when the Comtrol card is present.

Okay, that's understandable.  Sorry I couldn't do more to help.

Alan Stern



Re: Regression causes a hang on boot with a Comtrol PCI card

2019-04-16 Thread Alan Stern
On Mon, 15 Apr 2019, Jesse Hathaway wrote:

> On Sat, Apr 6, 2019 at 10:32 AM Alan Stern  wrote:
> > Well, at least that's forward progress.  I don't know what pstore is or
> > what connection it has to the USB subsystem.  Does the machine hang
> > similarly if you boot without the Comtrol PCI card present?
> 
> Yes the box boots fine when the Comtrol PCI card is *not* present.
> 
> > For that matter, what happens if you remove EHCI from the kernel
> > configuration completely?
> 
> If I remove USB support, the box still hangs after registering the pstore, but
> if I remove pstore support and APEI support from the kernel then the box boots
> without issue.
> 
> > As for how the PCI card affects the USB handoff, it depends on how the
> > BIOS behaves.  Normally the BIOS will take control of all the available
> > EHCI controllers during bootup (so that it can use them to communicate
> > with a USB keyboard or mouse), including controllers on add-on PCI
> > cards as well as those on the motherboard.  When the kernel starts up,
> > it tries to take ownership of the controllers away from the BIOS
> > (that's the handoff) so that Linux can use them.  However, if the BIOS
> > was never tested for handoff of USB controllers on add-on PCI cards, it
> > could easily have a bug that would crash the machine.
> 
> The Comtrol card provides 32 serial ports, via a breakout box, but it has
> no USB functionality, which was why I was surprised that its presence
> somehow breaks the USB hand off.

Well, I am completely mystified.  Nor do I understand how the commits
you identified could be related, although maybe the relationship is
very indirect.

Whatever the source of the problem, I don't think you're going to find 
it by looking at the USB code.  Perhaps the early initialization of the 
functions that _are_ present on the Comtrol card somehow messes up 
other parts of the system.

Alan Stern



Re: Regression causes a hang on boot with a Comtrol PCI card

2019-04-04 Thread Alan Stern
On Thu, 4 Apr 2019, Jesse Hathaway wrote:

> On Tue, Apr 2, 2019 at 9:29 AM Alan Stern  wrote:
> > Most likely the problem occurs somewhere inside
> > quirk_usb_handoff_xhci().  Can Jesse add debugging statements to that
> > routine in order to pin down exactly where the problem lies?
> 
> Alan,
> 
> I added debug statements to quirk_usb_early_handoff, quirk_usb_disable_ehci &
> ehci_bios_handoff. The box hangs right before calling:
> 
> pci_write_config_byte(pdev, offset + 3, 1);

Right _before_ that line?  Not _after_ it?

That's surprising because the two preceding lines of code are the 
condition of an "if" statement and a dev_dbg() call.  I don't see how 
either of them could cause a hang.

Maybe the hang is a delayed reaction to something happening somewhere
else.  But on the assumption that it isn't, you could try commenting 
out various parts of ehci_bios_handoff to see which ones make a 
difference.

> which is in ehci_bios_handoff:
> 
> [   10.698240] DEBUG: Passed quirk_usb_early_handoff 1300
> [   10.704271] DEBUG: Passed quirk_usb_early_handoff 1308
> [   10.710206] DEBUG: Passed quirk_usb_disable_ehci 939
> [   10.715949] DEBUG: Passed quirk_usb_disable_ehci 945
> [   10.721685] DEBUG: Passed quirk_usb_disable_ehci 950
> [   10.727423] DEBUG: Passed quirk_usb_disable_ehci 958
> [   10.733160] DEBUG: Passed quirk_usb_disable_ehci 964
> [   10.738897] DEBUG: Passed quirk_usb_disable_ehci 968
> [   10.744633] DEBUG: Passed ehci_bios_handoff 849
> [   10.749884] DEBUG: Passed ehci_bios_handoff 884
> 
> I have attached the debug output, and my modified pci-quirks.c file
> to the bug report, let me know what else I can do to help.

Nothing was attached.

Alan Stern



Re: Regression causes a hang on boot with a Comtrol PCI card

2019-04-01 Thread Jesse Hathaway
On Fri, Mar 22, 2019 at 3:02 PM Jesse Hathaway  wrote:
> > Can you boot v5.0 vanilla with "initcall_debug"?  Maybe we can narrow
> > it down to a specific quirk.
>
> yup, added the "initcall_debug" output to the ticket:
> https://bugzilla.kernel.org/show_bug.cgi?id=202927, here is the tail end
>
> [   14.896337] NET: Registered protocol family 1
> [   14.901314] initcall af_unix_init+0x0/0x4e returned 0 after 4866 usecs
> [   14.908694] calling  ipv6_offload_init+0x0/0x7f @ 1
> [   14.914238] initcall ipv6_offload_init+0x0/0x7f returned 0 after 1 usecs
> [   14.921821] calling  vlan_offload_init+0x0/0x20 @ 1
> [   14.927365] initcall vlan_offload_init+0x0/0x20 returned 0 after 0 usecs
> [   14.934948] calling  pci_apply_final_quirks+0x0/0x126 @ 1
> [   14.941106] pci :00:1a.0: calling  quirk_usb_early_handoff+0x0/0x6a0 @ 
> 1

Bjorn, did you get a chance to look at the initcall_debug output for anything
obvious to you on what might be the cause of the problem?

Thanks, Jesse Hathaway


Re: Regression causes a hang on boot with a Comtrol PCI card

2019-03-22 Thread Jesse Hathaway
> So apparently the hang happens while we're running the "final" PCI
> fixups.  This happens after all the rest of PCI is initialized.
>
> Can you boot v5.0 vanilla with "initcall_debug"?  Maybe we can narrow
> it down to a specific quirk.

yup, added the "initcall_debug" output to the ticket:
https://bugzilla.kernel.org/show_bug.cgi?id=202927, here is the tail end

[   14.896337] NET: Registered protocol family 1
[   14.901314] initcall af_unix_init+0x0/0x4e returned 0 after 4866 usecs
[   14.908694] calling  ipv6_offload_init+0x0/0x7f @ 1
[   14.914238] initcall ipv6_offload_init+0x0/0x7f returned 0 after 1 usecs
[   14.921821] calling  vlan_offload_init+0x0/0x20 @ 1
[   14.927365] initcall vlan_offload_init+0x0/0x20 returned 0 after 0 usecs
[   14.934948] calling  pci_apply_final_quirks+0x0/0x126 @ 1
[   14.941106] pci :00:1a.0: calling  quirk_usb_early_handoff+0x0/0x6a0 @ 1

thanks, Jesse Hathaway


Re: Regression causes a hang on boot with a Comtrol PCI card

2019-03-21 Thread Bjorn Helgaas
On Thu, Mar 14, 2019 at 03:57:07PM -0500, Jesse Hathaway wrote:
> > > 1302fcf0d03e (refs/bisect/bad) PCI: Configure *all* devices, not just
> > > hot-added ones
> > > 1c3c5eab1715 sched/core: Enable might_sleep() and smp_processor_id()
> > > checks early
> >
> > How did you narrow it down to *two* commits, and do you have to revert
> > both of them to avoid the hang?  Usually a bisection identifies a
> > single commit, and the two you mention aren't related.
> 
> Sorry I should have been more verbose in what the bisection process was, I
> found the problem after attempting to upgrade from linux v3.16 to v4.9. When
> v4.9 hung I tried the latest kernel, v5.0, which also hanged. I began a git
> bisect, but found there was more than one bad commit. Here is my current
> understanding:
> 
> - [x] v3.18 vanilla, 1302fcf0d03e committed, hangs
> - [x] v3.18 with revert of 1302fcf0d03e, works
> .
> .
> .
> - [x] v4.12 vanilla, hangs
> - [x] v4.12 with revert of 1302fcf0d03e, works
> 
> - [x] v4.13 vanilla, 1c3c5eab1715 committed, hangs
> - [x] v4.13 with revert of 1302fcf0d03e, hangs
> - [x] v4.13 with revert of 1c3c5eab1715, hangs
> - [x] v4.13 with revert of 1302fcf0d03e & 1c3c5eab1715, works
> 
> - [x] v5.0 vanilla, hangs
> - [x] v5.0 with revert of 1302fcf0d03e & 1c3c5eab1715, works

Thanks!  I doubt either of those commits is the real problem, but
they're both related to system_state, so it's conceivable they're both
involved in exposing the problem.

> > Can you collect a complete dmesg log (with a working kernel) and
> > output of "sudo lspci -vvxxx"?  You can open a bug report at
> > https://bugzilla.kernel.org, attach the logs there, and respond here
> > with the URL.
> 
> Bug submitted along with the requested logs,
> https://bugzilla.kernel.org/show_bug.cgi?id=202927

Thanks for that.

> > Where does the hang happen?  Is it when we configure the Comtrol card?
> 
> Hang occurs after PCI is initialized, snippet below, I have included the full
> output in the bug report:
> 
> [   10.561971] pci :81:00.0:   bridge window [mem 0xc800-0xc80f]
> [   10.569661] pci :80:01.0: PCI bridge to [bus 81-82]
> [   10.575594] pci :80:01.0:   bridge window [mem 0xc800-0xc80f]
> [   10.583278] pci :80:03.0: PCI bridge to [bus 83]
> [   10.589008] NET: Registered protocol family 2
> [   10.594254] tcp_listen_portaddr_hash hash table entries: 65536
> (order: 8, 1048576 bytes)
> [   10.603671] TCP established hash table entries: 524288 (order: 10,
> 4194304 bytes)
> [   10.612729] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
> [   10.620446] TCP: Hash tables configured (established 524288 bind 65536)
> [   10.628124] UDP hash table entries: 65536 (order: 9, 2097152 bytes)
> [   10.635541] UDP-Lite hash table entries: 65536 (order: 9, 2097152 bytes)
> [   10.643669] NET: Registered protocol family 1

The successful boot continues on with this:

  [   10.675996] pci :00:1a.0: quirk_usb_early_handoff+0x0/0x6a0 took 22519 
usecs
  [   10.684519] pci :03:00.0: [Firmware Bug]: disabling VPD access (can't 
determine size of non-standard VPD for)
  [   10.696404] pci :03:00.0: quirk_blacklist_vpd+0x0/0x30 took 11605 usecs
  [   10.704515] pci :0b:00.0: Video device with shadowed ROM at [mem 
0x000c-0x000d]

So apparently the hang happens while we're running the "final" PCI
fixups.  This happens after all the rest of PCI is initialized.

Can you boot v5.0 vanilla with "initcall_debug"?  Maybe we can narrow
it down to a specific quirk.

Bjorn


Re: Regression causes a hang on boot with a Comtrol PCI card

2019-03-14 Thread Jesse Hathaway
> > 1302fcf0d03e (refs/bisect/bad) PCI: Configure *all* devices, not just
> > hot-added ones
> > 1c3c5eab1715 sched/core: Enable might_sleep() and smp_processor_id()
> > checks early
>
> How did you narrow it down to *two* commits, and do you have to revert
> both of them to avoid the hang?  Usually a bisection identifies a
> single commit, and the two you mention aren't related.

Sorry I should have been more verbose in what the bisection process was, I
found the problem after attempting to upgrade from linux v3.16 to v4.9. When
v4.9 hung I tried the latest kernel, v5.0, which also hanged. I began a git
bisect, but found there was more than one bad commit. Here is my current
understanding:

- [x] v3.18 vanilla, 1302fcf0d03e committed, hangs
- [x] v3.18 with revert of 1302fcf0d03e, works
.
.
.
- [x] v4.12 vanilla, hangs
- [x] v4.12 with revert of 1302fcf0d03e, works

- [x] v4.13 vanilla, 1c3c5eab1715 committed, hangs
- [x] v4.13 with revert of 1302fcf0d03e, hangs
- [x] v4.13 with revert of 1c3c5eab1715, hangs
- [x] v4.13 with revert of 1302fcf0d03e & 1c3c5eab1715, works

- [x] v5.0 vanilla, hangs
- [x] v5.0 with revert of 1302fcf0d03e & 1c3c5eab1715, works

> Can you collect a complete dmesg log (with a working kernel) and
> output of "sudo lspci -vvxxx"?  You can open a bug report at
> https://bugzilla.kernel.org, attach the logs there, and respond here
> with the URL.

Bug submitted along with the requested logs,
https://bugzilla.kernel.org/show_bug.cgi?id=202927

> Where does the hang happen?  Is it when we configure the Comtrol card?

Hang occurs after PCI is initialized, snippet below, I have included the full
output in the bug report:

[   10.561971] pci :81:00.0:   bridge window [mem 0xc800-0xc80f]
[   10.569661] pci :80:01.0: PCI bridge to [bus 81-82]
[   10.575594] pci :80:01.0:   bridge window [mem 0xc800-0xc80f]
[   10.583278] pci :80:03.0: PCI bridge to [bus 83]
[   10.589008] NET: Registered protocol family 2
[   10.594254] tcp_listen_portaddr_hash hash table entries: 65536
(order: 8, 1048576 bytes)
[   10.603671] TCP established hash table entries: 524288 (order: 10,
4194304 bytes)
[   10.612729] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes)
[   10.620446] TCP: Hash tables configured (established 524288 bind 65536)
[   10.628124] UDP hash table entries: 65536 (order: 9, 2097152 bytes)
[   10.635541] UDP-Lite hash table entries: 65536 (order: 9, 2097152 bytes)
[   10.643669] NET: Registered protocol family 1

Please let me know if there is anything else I can provide, I am also happy to
test any patches, Jesse Hathaway


Re: Regression causes a hang on boot with a Comtrol PCI card

2019-03-13 Thread Bjorn Helgaas
Hi Jesse,

On Wed, Mar 13, 2019 at 11:50:07AM -0500, Jesse Hathaway wrote:
> Two regressions cause Linux to hang on boot when a Comtrol PCI card
> is present.
> 
> If I revert the following two commits, I can boot again and the card
> operates without issue:
> 
> 1302fcf0d03e (refs/bisect/bad) PCI: Configure *all* devices, not just
> hot-added ones
> 1c3c5eab1715 sched/core: Enable might_sleep() and smp_processor_id()
> checks early

I'm very sorry about the regression, but thank you very much for
narrowing it down and reporting it!

How did you narrow it down to *two* commits, and do you have to revert
both of them to avoid the hang?  Usually a bisection identifies a
single commit, and the two you mention aren't related.

> ; lspci -vs 82:00.0
> 82:00.0 Multiport serial controller: Comtrol Corporation Device 0061
> Subsystem: Comtrol Corporation Device 0061
> Flags: 66MHz, medium devsel, IRQ 35, NUMA node 1
> Memory at c8004000 (32-bit, non-prefetchable) [size=4K]
> Memory at c800 (32-bit, non-prefetchable) [size=16K]
> Capabilities: [40] Hot-plug capable
> Capabilities: [48] Power Management version 2
> Kernel driver in use: rp2
> Kernel modules: rp2
> 
> Is it possible that the problem is that the card claims to support
> Hot-plug, but does not?
> 
> I would love to help fix this issue, please let me know what other
> information would be helpful to provide.

Can you collect a complete dmesg log (with a working kernel) and
output of "sudo lspci -vvxxx"?  You can open a bug report at
https://bugzilla.kernel.org, attach the logs there, and respond here
with the URL.

Where does the hang happen?  Is it when we configure the Comtrol card?

Bjorn


Regression causes a hang on boot with a Comtrol PCI card

2019-03-13 Thread Jesse Hathaway
Two regressions cause Linux to hang on boot when a Comtrol PCI card is present.

If I revert the following two commits, I can boot again and the card operates
without issue:

1302fcf0d03e (refs/bisect/bad) PCI: Configure *all* devices, not just
hot-added ones
1c3c5eab1715 sched/core: Enable might_sleep() and smp_processor_id()
checks early

; lspci -vs 82:00.0
82:00.0 Multiport serial controller: Comtrol Corporation Device 0061
Subsystem: Comtrol Corporation Device 0061
Flags: 66MHz, medium devsel, IRQ 35, NUMA node 1
Memory at c8004000 (32-bit, non-prefetchable) [size=4K]
Memory at c800 (32-bit, non-prefetchable) [size=16K]
Capabilities: [40] Hot-plug capable
Capabilities: [48] Power Management version 2
Kernel driver in use: rp2
Kernel modules: rp2

Is it possible that the problem is that the card claims to support Hot-plug,
but does not?

I would love to help fix this issue, please let me know what other information
would be helpful to provide.

; awk -f scripts/ver_linux
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.

Linux tty01 5.0.1-amd64 #1 SMP Wed Mar 13 15:43:44 UTC 2019 x86_64 GNU/Linux

GNU C   6.3.0
GNU Make4.1
Binutils2.28
Util-linux  2.29.2
Mount   2.29.2
Linux C Library 2.24
Dynamic linker (ldd)2.24
Procps  3.3.12
Sh-utils8.26
Udev232
Modules Loaded  8021q acpi_power_meter aesni_intel aes_x86_64
ahci autofs4 bonding button coretemp crc16 crc32c_generic crc32c_intel
crc32_pclmul crct10dif_pclmul cryptd crypto_simd dca dcdbas dm_mod drm
drm_kms_helper ehci_hcd ehci_pci evdev ext4 fscrypto garp
ghash_clmulni_intel glue_helper i2c_algo_bit igb intel_cstate
intel_powerclamp intel_rapl intel_rapl_perf intel_uncore ioatdma
ipmi_devintf ipmi_msghandler ipmi_si iptable_filter ip_tables
irqbypass iTCO_vendor_support iTCO_wdt ixgbe jbd2 kvm kvm_intel
libahci libata libcrc32c libphy llc lpc_ich mbcache mdio megaraid_sas
mei mei_me mgag200 mrp mxm_wmi nf_conntrack nf_defrag_ipv4
nf_defrag_ipv6 pcc_cpufreq pcspkr rp2 sb_edac scsi_mod sd_mod sg snd
snd_pcm snd_timer soundcore stp ttm usbcore wmi x86_pkg_temp_thermal
xfrm_algo x_tables xt_conntrack xt_tcpudp