Re: Regression causes a hang on boot with a Comtrol PCI card
> > 1302fcf0d03e (refs/bisect/bad) PCI: Configure *all* devices, not just > > hot-added ones > > 1c3c5eab1715 sched/core: Enable might_sleep() and smp_processor_id() > > checks early > > How did you narrow it down to *two* commits, and do you have to revert > both of them to avoid the hang? Usually a bisection identifies a > single commit, and the two you mention aren't related. Sorry I should have been more verbose in what the bisection process was, I found the problem after attempting to upgrade from linux v3.16 to v4.9. When v4.9 hung I tried the latest kernel, v5.0, which also hanged. I began a git bisect, but found there was more than one bad commit. Here is my current understanding: - [x] v3.18 vanilla, 1302fcf0d03e committed, hangs - [x] v3.18 with revert of 1302fcf0d03e, works . . . - [x] v4.12 vanilla, hangs - [x] v4.12 with revert of 1302fcf0d03e, works - [x] v4.13 vanilla, 1c3c5eab1715 committed, hangs - [x] v4.13 with revert of 1302fcf0d03e, hangs - [x] v4.13 with revert of 1c3c5eab1715, hangs - [x] v4.13 with revert of 1302fcf0d03e & 1c3c5eab1715, works - [x] v5.0 vanilla, hangs - [x] v5.0 with revert of 1302fcf0d03e & 1c3c5eab1715, works > Can you collect a complete dmesg log (with a working kernel) and > output of "sudo lspci -vvxxx"? You can open a bug report at > https://bugzilla.kernel.org, attach the logs there, and respond here > with the URL. Bug submitted along with the requested logs, https://bugzilla.kernel.org/show_bug.cgi?id=202927 > Where does the hang happen? Is it when we configure the Comtrol card? Hang occurs after PCI is initialized, snippet below, I have included the full output in the bug report: [ 10.561971] pci :81:00.0: bridge window [mem 0xc800-0xc80f] [ 10.569661] pci :80:01.0: PCI bridge to [bus 81-82] [ 10.575594] pci :80:01.0: bridge window [mem 0xc800-0xc80f] [ 10.583278] pci :80:03.0: PCI bridge to [bus 83] [ 10.589008] NET: Registered protocol family 2 [ 10.594254] tcp_listen_portaddr_hash hash table entries: 65536 (order: 8, 1048576 bytes) [ 10.603671] TCP established hash table entries: 524288 (order: 10, 4194304 bytes) [ 10.612729] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes) [ 10.620446] TCP: Hash tables configured (established 524288 bind 65536) [ 10.628124] UDP hash table entries: 65536 (order: 9, 2097152 bytes) [ 10.635541] UDP-Lite hash table entries: 65536 (order: 9, 2097152 bytes) [ 10.643669] NET: Registered protocol family 1 Please let me know if there is anything else I can provide, I am also happy to test any patches, Jesse Hathaway
Re: Regression causes a hang on boot with a Comtrol PCI card
On Mon, 15 Apr 2019, Jesse Hathaway wrote: > On Sat, Apr 6, 2019 at 10:32 AM Alan Stern wrote: > > Well, at least that's forward progress. I don't know what pstore is or > > what connection it has to the USB subsystem. Does the machine hang > > similarly if you boot without the Comtrol PCI card present? > > Yes the box boots fine when the Comtrol PCI card is *not* present. > > > For that matter, what happens if you remove EHCI from the kernel > > configuration completely? > > If I remove USB support, the box still hangs after registering the pstore, but > if I remove pstore support and APEI support from the kernel then the box boots > without issue. > > > As for how the PCI card affects the USB handoff, it depends on how the > > BIOS behaves. Normally the BIOS will take control of all the available > > EHCI controllers during bootup (so that it can use them to communicate > > with a USB keyboard or mouse), including controllers on add-on PCI > > cards as well as those on the motherboard. When the kernel starts up, > > it tries to take ownership of the controllers away from the BIOS > > (that's the handoff) so that Linux can use them. However, if the BIOS > > was never tested for handoff of USB controllers on add-on PCI cards, it > > could easily have a bug that would crash the machine. > > The Comtrol card provides 32 serial ports, via a breakout box, but it has > no USB functionality, which was why I was surprised that its presence > somehow breaks the USB hand off. Well, I am completely mystified. Nor do I understand how the commits you identified could be related, although maybe the relationship is very indirect. Whatever the source of the problem, I don't think you're going to find it by looking at the USB code. Perhaps the early initialization of the functions that _are_ present on the Comtrol card somehow messes up other parts of the system. Alan Stern
Re: Regression causes a hang on boot with a Comtrol PCI card
Hi Jesse, On Wed, Mar 13, 2019 at 11:50:07AM -0500, Jesse Hathaway wrote: > Two regressions cause Linux to hang on boot when a Comtrol PCI card > is present. > > If I revert the following two commits, I can boot again and the card > operates without issue: > > 1302fcf0d03e (refs/bisect/bad) PCI: Configure *all* devices, not just > hot-added ones > 1c3c5eab1715 sched/core: Enable might_sleep() and smp_processor_id() > checks early I'm very sorry about the regression, but thank you very much for narrowing it down and reporting it! How did you narrow it down to *two* commits, and do you have to revert both of them to avoid the hang? Usually a bisection identifies a single commit, and the two you mention aren't related. > ; lspci -vs 82:00.0 > 82:00.0 Multiport serial controller: Comtrol Corporation Device 0061 > Subsystem: Comtrol Corporation Device 0061 > Flags: 66MHz, medium devsel, IRQ 35, NUMA node 1 > Memory at c8004000 (32-bit, non-prefetchable) [size=4K] > Memory at c800 (32-bit, non-prefetchable) [size=16K] > Capabilities: [40] Hot-plug capable > Capabilities: [48] Power Management version 2 > Kernel driver in use: rp2 > Kernel modules: rp2 > > Is it possible that the problem is that the card claims to support > Hot-plug, but does not? > > I would love to help fix this issue, please let me know what other > information would be helpful to provide. Can you collect a complete dmesg log (with a working kernel) and output of "sudo lspci -vvxxx"? You can open a bug report at https://bugzilla.kernel.org, attach the logs there, and respond here with the URL. Where does the hang happen? Is it when we configure the Comtrol card? Bjorn
Re: Regression causes a hang on boot with a Comtrol PCI card
On Fri, Mar 22, 2019 at 3:02 PM Jesse Hathaway wrote: > > Can you boot v5.0 vanilla with "initcall_debug"? Maybe we can narrow > > it down to a specific quirk. > > yup, added the "initcall_debug" output to the ticket: > https://bugzilla.kernel.org/show_bug.cgi?id=202927, here is the tail end > > [ 14.896337] NET: Registered protocol family 1 > [ 14.901314] initcall af_unix_init+0x0/0x4e returned 0 after 4866 usecs > [ 14.908694] calling ipv6_offload_init+0x0/0x7f @ 1 > [ 14.914238] initcall ipv6_offload_init+0x0/0x7f returned 0 after 1 usecs > [ 14.921821] calling vlan_offload_init+0x0/0x20 @ 1 > [ 14.927365] initcall vlan_offload_init+0x0/0x20 returned 0 after 0 usecs > [ 14.934948] calling pci_apply_final_quirks+0x0/0x126 @ 1 > [ 14.941106] pci :00:1a.0: calling quirk_usb_early_handoff+0x0/0x6a0 @ > 1 Bjorn, did you get a chance to look at the initcall_debug output for anything obvious to you on what might be the cause of the problem? Thanks, Jesse Hathaway
Re: Regression causes a hang on boot with a Comtrol PCI card
On Thu, 4 Apr 2019, Jesse Hathaway wrote: > On Tue, Apr 2, 2019 at 9:29 AM Alan Stern wrote: > > Most likely the problem occurs somewhere inside > > quirk_usb_handoff_xhci(). Can Jesse add debugging statements to that > > routine in order to pin down exactly where the problem lies? > > Alan, > > I added debug statements to quirk_usb_early_handoff, quirk_usb_disable_ehci & > ehci_bios_handoff. The box hangs right before calling: > > pci_write_config_byte(pdev, offset + 3, 1); Right _before_ that line? Not _after_ it? That's surprising because the two preceding lines of code are the condition of an "if" statement and a dev_dbg() call. I don't see how either of them could cause a hang. Maybe the hang is a delayed reaction to something happening somewhere else. But on the assumption that it isn't, you could try commenting out various parts of ehci_bios_handoff to see which ones make a difference. > which is in ehci_bios_handoff: > > [ 10.698240] DEBUG: Passed quirk_usb_early_handoff 1300 > [ 10.704271] DEBUG: Passed quirk_usb_early_handoff 1308 > [ 10.710206] DEBUG: Passed quirk_usb_disable_ehci 939 > [ 10.715949] DEBUG: Passed quirk_usb_disable_ehci 945 > [ 10.721685] DEBUG: Passed quirk_usb_disable_ehci 950 > [ 10.727423] DEBUG: Passed quirk_usb_disable_ehci 958 > [ 10.733160] DEBUG: Passed quirk_usb_disable_ehci 964 > [ 10.738897] DEBUG: Passed quirk_usb_disable_ehci 968 > [ 10.744633] DEBUG: Passed ehci_bios_handoff 849 > [ 10.749884] DEBUG: Passed ehci_bios_handoff 884 > > I have attached the debug output, and my modified pci-quirks.c file > to the bug report, let me know what else I can do to help. Nothing was attached. Alan Stern
Re: Regression causes a hang on boot with a Comtrol PCI card
On Tue, 23 Apr 2019, Jesse Hathaway wrote: > On Tue, Apr 16, 2019 at 10:00 AM Alan Stern wrote: > > Whatever the source of the problem, I don't think you're going to find > > it by looking at the USB code. Perhaps the early initialization of the > > functions that _are_ present on the Comtrol card somehow messes up > > other parts of the system. > > Thanks for all you help Alan, > > I think at this point my Linux kernel debugging knowledge has run > aground. I would > love to crack this nut, but I doubt that will be possible with my > current knowledge base. > For now I am going to run the patched kernel even if I don't > understand precisely why > they prevent the box from hanging on boot when the Comtrol card is present. Okay, that's understandable. Sorry I couldn't do more to help. Alan Stern
Re: Regression causes a hang on boot with a Comtrol PCI card
On Thu, Mar 14, 2019 at 03:57:07PM -0500, Jesse Hathaway wrote: > > > 1302fcf0d03e (refs/bisect/bad) PCI: Configure *all* devices, not just > > > hot-added ones > > > 1c3c5eab1715 sched/core: Enable might_sleep() and smp_processor_id() > > > checks early > > > > How did you narrow it down to *two* commits, and do you have to revert > > both of them to avoid the hang? Usually a bisection identifies a > > single commit, and the two you mention aren't related. > > Sorry I should have been more verbose in what the bisection process was, I > found the problem after attempting to upgrade from linux v3.16 to v4.9. When > v4.9 hung I tried the latest kernel, v5.0, which also hanged. I began a git > bisect, but found there was more than one bad commit. Here is my current > understanding: > > - [x] v3.18 vanilla, 1302fcf0d03e committed, hangs > - [x] v3.18 with revert of 1302fcf0d03e, works > . > . > . > - [x] v4.12 vanilla, hangs > - [x] v4.12 with revert of 1302fcf0d03e, works > > - [x] v4.13 vanilla, 1c3c5eab1715 committed, hangs > - [x] v4.13 with revert of 1302fcf0d03e, hangs > - [x] v4.13 with revert of 1c3c5eab1715, hangs > - [x] v4.13 with revert of 1302fcf0d03e & 1c3c5eab1715, works > > - [x] v5.0 vanilla, hangs > - [x] v5.0 with revert of 1302fcf0d03e & 1c3c5eab1715, works Thanks! I doubt either of those commits is the real problem, but they're both related to system_state, so it's conceivable they're both involved in exposing the problem. > > Can you collect a complete dmesg log (with a working kernel) and > > output of "sudo lspci -vvxxx"? You can open a bug report at > > https://bugzilla.kernel.org, attach the logs there, and respond here > > with the URL. > > Bug submitted along with the requested logs, > https://bugzilla.kernel.org/show_bug.cgi?id=202927 Thanks for that. > > Where does the hang happen? Is it when we configure the Comtrol card? > > Hang occurs after PCI is initialized, snippet below, I have included the full > output in the bug report: > > [ 10.561971] pci :81:00.0: bridge window [mem 0xc800-0xc80f] > [ 10.569661] pci :80:01.0: PCI bridge to [bus 81-82] > [ 10.575594] pci :80:01.0: bridge window [mem 0xc800-0xc80f] > [ 10.583278] pci :80:03.0: PCI bridge to [bus 83] > [ 10.589008] NET: Registered protocol family 2 > [ 10.594254] tcp_listen_portaddr_hash hash table entries: 65536 > (order: 8, 1048576 bytes) > [ 10.603671] TCP established hash table entries: 524288 (order: 10, > 4194304 bytes) > [ 10.612729] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes) > [ 10.620446] TCP: Hash tables configured (established 524288 bind 65536) > [ 10.628124] UDP hash table entries: 65536 (order: 9, 2097152 bytes) > [ 10.635541] UDP-Lite hash table entries: 65536 (order: 9, 2097152 bytes) > [ 10.643669] NET: Registered protocol family 1 The successful boot continues on with this: [ 10.675996] pci :00:1a.0: quirk_usb_early_handoff+0x0/0x6a0 took 22519 usecs [ 10.684519] pci :03:00.0: [Firmware Bug]: disabling VPD access (can't determine size of non-standard VPD for) [ 10.696404] pci :03:00.0: quirk_blacklist_vpd+0x0/0x30 took 11605 usecs [ 10.704515] pci :0b:00.0: Video device with shadowed ROM at [mem 0x000c-0x000d] So apparently the hang happens while we're running the "final" PCI fixups. This happens after all the rest of PCI is initialized. Can you boot v5.0 vanilla with "initcall_debug"? Maybe we can narrow it down to a specific quirk. Bjorn
Re: Regression causes a hang on boot with a Comtrol PCI card
> So apparently the hang happens while we're running the "final" PCI > fixups. This happens after all the rest of PCI is initialized. > > Can you boot v5.0 vanilla with "initcall_debug"? Maybe we can narrow > it down to a specific quirk. yup, added the "initcall_debug" output to the ticket: https://bugzilla.kernel.org/show_bug.cgi?id=202927, here is the tail end [ 14.896337] NET: Registered protocol family 1 [ 14.901314] initcall af_unix_init+0x0/0x4e returned 0 after 4866 usecs [ 14.908694] calling ipv6_offload_init+0x0/0x7f @ 1 [ 14.914238] initcall ipv6_offload_init+0x0/0x7f returned 0 after 1 usecs [ 14.921821] calling vlan_offload_init+0x0/0x20 @ 1 [ 14.927365] initcall vlan_offload_init+0x0/0x20 returned 0 after 0 usecs [ 14.934948] calling pci_apply_final_quirks+0x0/0x126 @ 1 [ 14.941106] pci :00:1a.0: calling quirk_usb_early_handoff+0x0/0x6a0 @ 1 thanks, Jesse Hathaway