[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
** Tags removed: verification-needed-bionic ** Tags added: verification-done-bionic -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
This bug is awaiting verification that the linux-kvm/4.15.0-1102.104 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'. If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you! ** Tags added: verification-needed-bionic -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
This bug was fixed in the package linux - 4.15.0-161.169 --- linux (4.15.0-161.169) bionic; urgency=medium * bionic/linux: 4.15.0-161.169 -proposed tracker (LP: #1947358) * Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal (LP: #1946149) - SAUCE: Revert "PCI/MSI: Enforce MSI[X] entry updates to be visible" - SAUCE: Revert "PCI/MSI: Enforce that MSI-X table entry is masked for update" linux (4.15.0-160.168) bionic; urgency=medium * bionic/linux: 4.15.0-160.168 -proposed tracker (LP: #1944933) * Packaging resync (LP: #1786013) - debian/dkms-versions -- update from kernel-versions (main/2021.09.27) * ext4 journal recovery fails w/ data=journal + mmap (LP: #1847340) - jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers() - jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers() - ext4: data=journal: fixes for ext4_page_mkwrite() - ext4: data=journal: write-protect pages on j_submit_inode_data_buffers() - ext4: fix mmap write protection for data=journal mode * CVE-2021-40490 - ext4: fix race writing to an inline_data file while its xattrs are changing * Bionic update: upstream stable patchset 2021-09-22 (LP: #1944600) - iio: humidity: hdc100x: Add margin to the conversion time - iio: adc: Fix incorrect exit of for-loop - ASoC: intel: atom: Fix reference to PCM buffer address - i2c: dev: zero out array used for i2c reads from userspace - ACPI: NFIT: Fix support for virtual SPA ranges - ASoC: cs42l42: Correct definition of ADC Volume control - ASoC: cs42l42: Don't allow SND_SOC_DAIFMT_LEFT_J - ASoC: cs42l42: Fix inversion of ADC Notch Switch control - ASoC: cs42l42: Remove duplicate control for WNF filter frequency - net: dsa: mt7530: add the missing RxUnicast MIB counter - ppp: Fix generating ifname when empty IFLA_IFNAME is specified - psample: Add a fwd declaration for skbuff - net: Fix memory leak in ieee802154_raw_deliver - net: bridge: fix memleak in br_add_if() - tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B packets - xen/events: Fix race in set_evtchn_to_irq - vsock/virtio: avoid potential deadlock when vsock device remove - powerpc/kprobes: Fix kprobe Oops happens in booke - x86/tools: Fix objdump version check again - x86/resctrl: Fix default monitoring groups reporting - PCI/MSI: Enable and mask MSI-X early - PCI/MSI: Do not set invalid bits in MSI mask - PCI/MSI: Correct misleading comments - PCI/MSI: Use msi_mask_irq() in pci_msi_shutdown() - PCI/MSI: Protect msi_desc::masked for multi-MSI - PCI/MSI: Mask all unused MSI-X entries - PCI/MSI: Enforce that MSI-X table entry is masked for update - PCI/MSI: Enforce MSI[X] entry updates to be visible - vmlinux.lds.h: Handle clang's module.{c,d}tor sections - mac80211: drop data frames without key on encrypted links - x86/fpu: Make init_fpstate correct with optimized XSAVE - ath: Use safer key clearing with key cache entries - ath9k: Clear key cache explicitly on disabling hardware - ath: Export ath_hw_keysetmac() - ath: Modify ath_key_delete() to not need full key entry - ath9k: Postpone key cache entry deletion for TXQ frames reference it - dmaengine: usb-dmac: Fix PM reference leak in usb_dmac_probe() - ARM: dts: am43x-epos-evm: Reduce i2c0 bus speed for tps65218 - dmaengine: of-dma: router_xlate to return -EPROBE_DEFER if controller is not yet available - scsi: megaraid_mm: Fix end of loop tests for list_for_each_entry() - scsi: scsi_dh_rdac: Avoid crash during rdac_bus_attach() - scsi: core: Avoid printing an error if target_alloc() returns -ENXIO - ARM: dts: nomadik: Fix up interrupt controller node names - net: usb: lan78xx: don't modify phy_device state concurrently - Bluetooth: hidp: use correct wait queue when removing ctrl_wait - dccp: add do-while-0 stubs for dccp_pr_debug macros - vhost: Fix the calculation in vhost_overflow() - bnxt: don't lock the tx queue from napi poll - net: 6pack: fix slab-out-of-bounds in decode_data - ptp_pch: Restore dependency on PCI - net: qlcnic: add missed unlock in qlcnic_83xx_flash_read32 - net: mdio-mux: Don't ignore memory allocation errors - net: mdio-mux: Handle -EPROBE_DEFER correctly - mmc: dw_mmc: Fix hang on data CRC error - ALSA: hda - fix the 'Capture Switch' value change notifications - ipack: tpci200: fix many double free issues in tpci200_pci_probe - btrfs: prevent rename2 from exchanging a subvol with a directory from different parents - ASoC: intel: atom: Fix breakage for PCM buffer address setup - locks: print a warning when mount fails due to lack of "mand" support - fs: warn about impending deprecation of mandatory locks - netfilter:
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
I confirm that bionic/linux 4.15.0-161.169 and bionic/linux-aws 4.15.0-1114.121 are not experiencing the reported boot issues on AWS r5.metal or on any other platform/instance. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
** Changed in: linux (Ubuntu Bionic) Status: In Progress => Fix Committed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
** Also affects: linux-aws (Ubuntu Bionic) Importance: Undecided Status: New ** Changed in: linux-aws (Ubuntu Bionic) Importance: Undecided => High ** Changed in: linux-aws (Ubuntu Bionic) Status: New => In Progress ** Package changed: linux-aws (Ubuntu) => linux (Ubuntu) ** Changed in: linux (Ubuntu) Status: New => Invalid -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
** Description changed: - When creating an r5.metal instance on AWS, the default kernel is - bionic/linux-aws-5.4(5.4.0-1056-aws), when changing to bionic/linux- - aws(4.15.0-1113-aws) the machine fails to boot the 4.15 kernel. + + [ Impact ] + The bionic 4.15 kernels are failing to boot on r5.metal instances on AWS + . The default kernel is bionic/linux-aws-5.4(5.4.0-1056-aws), when changing to bionic/linux-aws(4.15.0-1113-aws) or bionic/linux (4.15.0-160.168) the machine fails to boot the 4.15 kernel. + + This problem only appears on metal instances, which uses NVME instead of + XVDA devices. + + [ Fix ] + It was discovered that after reverting the following two commits from upstream stable the 4.15 kernels can be booted again on the affected AWS metal instance: + + PCI/MSI: Enforce that MSI-X table entry is masked for update + PCI/MSI: Enforce MSI[X] entry updates to be visible + + [ Test Case ] + Deploy a r5.metal instance on AWS with a bionic image, which should boot initially with bionic/linux-aws-5.4. Install bionic/linux or bionic/linux-aws (4.15 based) and reboot the system. + + [ Where problems could occur ] + These two commits are part of a larger patchset fixing PCI/MSI issues which were backported to some upstream stable releases. By reverting only part of the set we might end up with MSI issues that were not present before the whole set was applied. Regression potential can be minimized by testing the kernels with these two reverted patches on all the platforms available. + + + [ Original Description ] + When creating an r5.metal instance on AWS, the default kernel is bionic/linux-aws-5.4(5.4.0-1056-aws), when changing to bionic/linux-aws(4.15.0-1113-aws) the machine fails to boot the 4.15 kernel. If I remove these patches the instance correctly boots the 4.15 kernel https://lists.ubuntu.com/archives/kernel-team/2021-September/123963.html With that being said, after successfully updating to the 4.15 without those patches applied, I can then upgrade to a 4.15 kernel with the above patches included, and the instance will boot properly. This problem only appears on metal instances, which uses NVME instead of XVDA devices. AWS instances also use the 'discard' mount option with ext4, thought maybe there could be a race condition between ext4 discard and journal flush. Removed 'discard' from mount options and rebooted 5.4 kernel prior to 4.15 kernel installation, but still wouldn't boot after installing the 4.15 kernel. I have been unable to capture a stack trace using 'aws get-console- output'. After enabling kdump I was unable to replicate the failure. So there must be some sort of race with either ext4 and/or nvme. ** Description changed: - [ Impact ] - The bionic 4.15 kernels are failing to boot on r5.metal instances on AWS - . The default kernel is bionic/linux-aws-5.4(5.4.0-1056-aws), when changing to bionic/linux-aws(4.15.0-1113-aws) or bionic/linux (4.15.0-160.168) the machine fails to boot the 4.15 kernel. + The bionic 4.15 kernels are failing to boot on r5.metal instances on AWS. The default kernel is bionic/linux-aws-5.4(5.4.0-1056-aws), when changing to bionic/linux-aws(4.15.0-1113-aws) or bionic/linux (4.15.0-160.168) the machine fails to boot the 4.15 kernel. This problem only appears on metal instances, which uses NVME instead of XVDA devices. [ Fix ] It was discovered that after reverting the following two commits from upstream stable the 4.15 kernels can be booted again on the affected AWS metal instance: PCI/MSI: Enforce that MSI-X table entry is masked for update PCI/MSI: Enforce MSI[X] entry updates to be visible [ Test Case ] Deploy a r5.metal instance on AWS with a bionic image, which should boot initially with bionic/linux-aws-5.4. Install bionic/linux or bionic/linux-aws (4.15 based) and reboot the system. [ Where problems could occur ] These two commits are part of a larger patchset fixing PCI/MSI issues which were backported to some upstream stable releases. By reverting only part of the set we might end up with MSI issues that were not present before the whole set was applied. Regression potential can be minimized by testing the kernels with these two reverted patches on all the platforms available. - [ Original Description ] When creating an r5.metal instance on AWS, the default kernel is bionic/linux-aws-5.4(5.4.0-1056-aws), when changing to bionic/linux-aws(4.15.0-1113-aws) the machine fails to boot the 4.15 kernel. If I remove these patches the instance correctly boots the 4.15 kernel https://lists.ubuntu.com/archives/kernel-team/2021-September/123963.html With that being said, after successfully updating to the 4.15 without those patches applied, I can then upgrade to a 4.15 kernel with the above patches included, and the instance will boot properly. This problem only appears on metal instances, which uses NVME
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
As I was bisecting the commits, I was attempting to take advantage of parallelism. While my test kernel was building I would deploy a clean AWS r5.metal instance. I started seeing test kernels boot that I wouldn't expect to boot. So I decided as a sanity test, I would deploy an r5.metal instance, let it sit idle for 20 minutes and then install the known problematic 4.15.0-1113-aws kernel. Sure enough it booted fine. Tried the same thing again with letting it sit idle 20 mins and it worked again. So this does appear to be a race condition. I think this also explains some of the erratic test results I've seen while looking at this bug. Fortunately the console output gave us some definitive proof as to where the problem was occurring. With that being said, it appears I have found the offending commits. PCI/MSI: Enforce that MSI-X table entry is masked for update PCI/MSI: Enforce MSI[X] entry updates to be visible https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux- aws/+git/bionic/commit/?id=27571f5ea1dd074924b41a455c50dc2278e8c2b7 https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux- aws/+git/bionic/commit/?id=2478f358c2b35fea04e005447ce99ad8dc53fd5d More specifically the hang is introduced by 'PCI/MSI: Enforce that MSI-X table entry is masked for update', but it isn't a clean revert without reverting the other commit. So for a quick test confirmation I reverted both. I have not had a chance to determine why these commits are causing the problem, but with these reverted in a test build on top of 4.15.0-1113-aws, I can migrate from 5.4 to 4.15 as soon as the instance is available. I've done at least 6 attempts now and all have passed and doing the same steps without the reverts all have hung(unless I wait 20 mins). -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
Hi Mauricio, Thanks for getting this info. This is very helpful! I see a few potential patches between 4.15.0-159.167 and 4.15.0-160.168 that could be related to the hang. This will help greatly with the bisect. Ian -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
Steps to reproduce: --- Ubuntu 18.04 image in AWS r5.metal instance type. $ lsb_release -cs bionic $ dmesg | grep DMI: [0.00] DMI: Amazon EC2 r5.metal/Not Specified, BIOS 1.0 10/16/2017 $ uname -rv 5.4.0-1045-aws #47~18.04.1-Ubuntu SMP Tue Apr 13 15:58:14 UTC 2021 $ sudo add-apt-repository ppa:canonical-kernel-team/ppa $ sudo apt install linux-image-4.15.0-1113-aws $ sudo grub-reboot 'Advanced options for Ubuntu>Ubuntu, with Linux 4.15.0-1113-aws' $ sudo grub-editenv /boot/grub/grubenv list next_entry=Advanced options for Ubuntu>Ubuntu, with Linux 4.15.0-1113-aws $ sudo reboot The system should not come back from reboot (which usually takes 10-15 minutes on r5.metal) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
** Attachment added: "serial-console-output.txt" https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+attachment/5532619/+files/serial-console-output.txt -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
We've got a serial console log from AWS Support through our Support team (special thanks to Pedro Principeza and our former colleague Mark Thomas.) The problem is definitely not the ext4/jbd2 patchset as suspected (although it's unclear how reverting it caused the kernel to boot; maybe build environment differences?) Early in the kernel boot, before even trying to mount the rootfs, there are blcked swapper tasks, and they just continue to happen. (full log attached.) ``` Starting Reboot... ... [0.00] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.15.0-1113-aws root=UUID=db937f23-4ed7-4c4b-8058-b23a860fae08 ro console=tty1 console=ttyS0 nvme_core.io_timeout=4294967295 ... [0.00] gran_size: 64K chunk_size: 256Mnum_reg: 10 lose cover RAM: 737G ... [2.742455] clocksource: Switched to clocksource tsc [ 242.656089] INFO: task swapper/0:1 blocked for more than 120 seconds. ... [ 363.488083] INFO: task swapper/0:1 blocked for more than 120 seconds. ... [ 484.320066] INFO: task swapper/0:1 blocked for more than 120 seconds. ... [ 605.152061] INFO: task swapper/0:1 blocked for more than 120 seconds. ... [ 725.984054] INFO: task swapper/0:1 blocked for more than 120 seconds. ... [ 846.816051] INFO: task swapper/0:1 blocked for more than 120 seconds. ... [ 967.648055] INFO: task swapper/0:1 blocked for more than 120 seconds. ... [ 1088.480033] INFO: task swapper/0:1 blocked for more than 120 seconds. ... [ 1209.312036] INFO: task swapper/0:1 blocked for more than 120 seconds. ... ``` -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
Hey Kleber, Thanks for confirming. I guess there might be something wrong with the boot process on r5.metal, specifically: - there's no issue with kexec boot, just with normal boot (same code and from/to versions) - there's no issue with normal boot on similar instance types (r5d.metal, r5.24xlarge) - rarely, but sometimes, it works with normal boot. I'm looking forward to hearing back from Ian on whether he could get additional information from AWS, or how to reach out to them, about getting the serial console log from the r5.metal instances internally, via AWS support, so we know what the actual issue is. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
Hi Mauricio, We are seeing the issue only on r5.metal. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
Hi Kleber, Thanks for the info. The impact on bionic/generic is also exclusively on aws r5.metal or broader? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
This issue is also affecting the bionic/linux generic kernel. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
Mauricio, Interesting update, I agree that we need more info as to what the state is when the instance won't boot switching to the new 4.15 kernel. I'll check with my team in the morning and see if we can get additional info from AWS I was trying a few more scenarios this evening the first being the most interesting. Scenario 1 start with 5.4.0-1056-aws install 5.4.0-1058-aws reboot confirm 5.4.0-1058-aws booted reboot AGAIN install 4.15.0-1113-aws reboot machine booted 4.15.0-1113-aws successfully Scenario 2 start with 5.4.0-1056-aws install 4.15.0-1112-aws reboot install 4.15.0-1113-aws reboot confirmed 4.15.0-1113-aws booted then booted back into 5.4.0-1056-aws removed 4.15.0-1112-aws and 4.15.0-1113-aws rebooted again for good measure confirmed still running 5.4.0-1056-aws installed 4.15.0-1113-aws rebooted 4.15.0-1113-aws successfully loaded -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
Ian, Do you/team have contacts in here or AWS that could help with that? I think that other lines of investigation now, after our findings and apparent inconsistencies, would be based on speculation, and we're better trying to get real information/logs from the system with AWS Support. cheers, Mauricio -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
Today I wanted to try and instrument the boot process a bit, since we have no serial console in the nitro metal instances. I was looking for pstore_blk (hoping we could panic_on_warn or panic_on_oops), but it's only available in 5.8+ it seems.) So I decided to start with grub, and keep a progress variable in grubenv, and use grub-reboot to boot 4.15.0-1113-aws _once_ (as it's expected to fail), then (force) stop and start again, and check grubenv in 5.4.0-*-aws (which works.) Interestingly, in one of such attempts 4.15.0-1113-aws WORKED. In another attempt, I could see the progress variable for the 4.15 _and_ 5.14 kernels, so it seems that grub booted 4.15 but it didn't make it to the fully booted system. (i.e., grub seems to be working correctly.) In the other attempts I noticed that once we try to boot 4.15, the system seems to become weird and not react quickly even to the 'Force stop' method (after you try 'Stop' that doesn't work.) ... So, since 4.15 worked/booted once, and the systems seem weird, and Ian just posted that he had a different result/questioned previous result (ie, it might well be a _different_ result), I wonder if somehow this particular instance type is acting up. Given that 4.15 worked/booted ~20 times under kexec, it's not unreasonable to consider there might be something going on in normal boot. I think we should probably engage AWS Support to try and ask for a console log using an internally available method (seen it elsewhere iirc), and also to clarify differences in boot disk among instace types r5.metal (fail), r5d.metal (works), and r5d.24xlarge (works) -- they all have EBS/nvme as '/'. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
Just want to add an update. I haven't been able to replicate successfully booting 4.15.0-1113-aws from 5.4.0-1058-aws, so I'm questioning whether I made a mistake the time I thought it was successful. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
Thanks for the in-depth update Mauricio! Is there any investigation you'd like me to specifically target? -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
For the record, 4.15.0-1113-aws works in r5.metal w/ kexec. Booted it 10 times successfully from both 5.4.0-1058-aws and 4.15.0-1113-aws (itself.) (not that it was expected to make a difference as the issue happens on normal boot, which doesn't have previous kernel.) Right after that, in the same instance, trying a normal boot fails. And it had kdump installed/enabled (ie, crashkernel in cmdline), w/ which Ian mentioned that he couldn't reproduce the problem. --- It also works on normal boot w/ r5d.metal (note 'd'), which should be the same as r5.metal but w/ four local nvme disks. (still boots from EBS/nvme disk in the same way as r5.metal) --- Similarly, it works on r4.24xlarge (this is not metal) but does boot from EBS/nvme disk too. --- So it seems like there's no problem with the patchset as in 4.15.0-1113-aws as it boots fine in several types w/ approx the same hardware config, just differing on normal/kexec in the r5.metal type (problem report.) - r5.metal: normal boot fails / kexec boot works - r5d.metal: normal boot works. - r5.24xlarge: normal boot works. The kexec boot worked ~20 times, so it wouldn't seem like a race condition is in place, as that should be enough runs, considering it failed every time on normal boot. Also, Ian mentioned that he couldn't reproduce w/ crashdump installed. Well, I think the only difference it would cause _before_ mounting the rootfs (assuming that's what doesn't work/allow machine to boot, as we have no serial console) is the crashkernel reservation? --- So, all this is a bit confusing, but seem to indicate again that there's no problem w/ the patchset per se, but perhaps something in booting this particular kernel on a particular instance type (r5.metal) which _might_ be related to normal/ kexec/crashkernel boot differences. More tomorrow. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
BTW, do you know of the differences between r5.metal and r5.24xlarge? Per the specs they seem to be the same as in cpu/ram/nic/_nvme_ storage, but differ in baremetal vs nitro hypervisor? The reason I ask is because downgrading from 5.4.0-1056-aws to 4.15.0-1113-aws worked/booted fine on r5.24xlarge, differently from r5.metal. I got to test it while looking for a similar instance type that had serial console support, but it didn't repro the problem, actually. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
It looks like it's not a problem with the patchset in general, maybe it's specific to aws 4.15? The patchset is in 5.4.0-1058-aws and it booted fine here too. I'll check the patchset in 4.15.0-1113-aws. A difference from your comment is that I could _not_ boot it after 5.4.0-1058-aws, which worked for you. (r5.metal usually boots in ~15min? and it's been ~20min already and no response; serial console/screenshot is not availabe on bare metal sadly.) -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
** Description changed: When creating an r5.metal instance on AWS, the default kernel is bionic/linux-aws-5.4(5.4.0-1056-aws), when changing to bionic/linux- aws(4.15.0-1113-aws) the machine fails to boot the 4.15 kernel. If I remove these patches the instance correctly boots the 4.15 kernel https://lists.ubuntu.com/archives/kernel-team/2021-September/123963.html With that being said, after successfully updating to the 4.15 without those patches applied, I can then upgrade to a 4.15 kernel with the above patches included, and the instance will boot properly. This problem only appears on metal instances, which uses NVME instead of XVDA devices. AWS instances also use the 'discard' mount option with ext4, thought maybe there could be a race condition between ext4 discard and journal - flush. Removed 'discard' mount and rebooted 5.4 kernel prior to 4.15 - kernel installation, but still wouldn't boot after installing the 4.15 - kernel. + flush. Removed 'discard' from mount options and rebooted 5.4 kernel + prior to 4.15 kernel installation, but still wouldn't boot after + installing the 4.15 kernel. I have been unable to capture a stack trace using 'aws get-console- output'. After enabling kdump I was unable to replicate the failure. So there must be some sort of race with either ext4 and/or nvme. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
Confirmed it does work to first upgrade bionic/linux-5.4 from 5.4.0-1056-aws to 5.4.0-1058-aws and then update to 4.15.0-1113-aws -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
Hey Ian, thanks for the bug report! I'm checking this on AWS. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
** Description changed: When creating an r5.metal instance on AWS, the default kernel is bionic/linux-aws-5.4(5.4.0-1056-aws), when changing to bionic/linux- - aws(4.15.0-1113-aws) the machine fails to boot 4.15 kernel. + aws(4.15.0-1113-aws) the machine fails to boot the 4.15 kernel. If I remove these patches the instance correctly boots the 4.15 kernel https://lists.ubuntu.com/archives/kernel-team/2021-September/123963.html - But after successfully updating to the 4.15 without those patches - applied, I can then upgrade to a 4.15 kernel with the above patches - included, and the instance will boot properly. + With that being said, after successfully updating to the 4.15 without + those patches applied, I can then upgrade to a 4.15 kernel with the + above patches included, and the instance will boot properly. This problem only appears on metal instances, which uses NVME instead of XVDA devices. AWS instances also use the 'discard' mount option with ext4, thought maybe there could be a race condition between ext4 discard and journal flush. Removed 'discard' mount and rebooted 5.4 kernel prior to 4.15 kernel installation, but still wouldn't boot after installing the 4.15 kernel. I have been unable to capture a stack trace using 'aws get-console- output'. After enabling kdump I was unable to replicate the failure. So there must be some sort of race with either ext4 and/or nvme. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
** Description changed: When creating an r5.metal instance on AWS, the default kernel is bionic/linux-aws-5.4(5.4.0-1056-aws), when changing to bionic/linux- aws(4.15.0-1113-aws) the machine fails to boot 4.15 kernel. If I remove these patches the instance correctly boots the 4.15 kernel https://lists.ubuntu.com/archives/kernel-team/2021-September/123963.html But after successfully updating to the 4.15 without those patches applied, I can then upgrade to a test kernel with the above patches included, and the instance will boot properly. This problem only appears on metal instances, which uses NVME instead of XVDA devices. AWS instances also use the 'discard' mount option with ext4, thought maybe there could be a race condition between ext4 discard and journal flush. Removed 'discard' mount and rebooted 5.4 kernel prior to 4.15 kernel installation, but still wouldn't boot. + + I have been unable to capture a stack trace using 'aws get-console- + output'. I enabled kdump and was unable to replicate the failure. So + there must be some sort of race with either ext4 and/or nvme. ** Description changed: When creating an r5.metal instance on AWS, the default kernel is bionic/linux-aws-5.4(5.4.0-1056-aws), when changing to bionic/linux- aws(4.15.0-1113-aws) the machine fails to boot 4.15 kernel. If I remove these patches the instance correctly boots the 4.15 kernel https://lists.ubuntu.com/archives/kernel-team/2021-September/123963.html But after successfully updating to the 4.15 without those patches - applied, I can then upgrade to a test kernel with the above patches + applied, I can then upgrade to a 4.15 kernel with the above patches included, and the instance will boot properly. This problem only appears on metal instances, which uses NVME instead of XVDA devices. AWS instances also use the 'discard' mount option with ext4, thought maybe there could be a race condition between ext4 discard and journal flush. Removed 'discard' mount and rebooted 5.4 kernel prior to 4.15 kernel installation, but still wouldn't boot. I have been unable to capture a stack trace using 'aws get-console- output'. I enabled kdump and was unable to replicate the failure. So there must be some sort of race with either ext4 and/or nvme. ** Description changed: When creating an r5.metal instance on AWS, the default kernel is bionic/linux-aws-5.4(5.4.0-1056-aws), when changing to bionic/linux- aws(4.15.0-1113-aws) the machine fails to boot 4.15 kernel. If I remove these patches the instance correctly boots the 4.15 kernel https://lists.ubuntu.com/archives/kernel-team/2021-September/123963.html But after successfully updating to the 4.15 without those patches applied, I can then upgrade to a 4.15 kernel with the above patches included, and the instance will boot properly. This problem only appears on metal instances, which uses NVME instead of XVDA devices. AWS instances also use the 'discard' mount option with ext4, thought maybe there could be a race condition between ext4 discard and journal flush. Removed 'discard' mount and rebooted 5.4 kernel prior to 4.15 - kernel installation, but still wouldn't boot. + kernel installation, but still wouldn't boot after installing the 4.15 + kernel. I have been unable to capture a stack trace using 'aws get-console- - output'. I enabled kdump and was unable to replicate the failure. So + output'. After enabling kdump I was unable to replicate the failure. So there must be some sort of race with either ext4 and/or nvme. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
Have been unable to capture a stack trace using 'aws get-console- output'. Enabled kdump and was unable to replicate the failed boot, which makes this feel like a race condition with NVME. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 1946149] Re: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal
** Description changed: When creating an r5.metal instance on AWS, the default kernel is bionic/linux-aws-5.4(5.4.0-1056-aws), when changing to bionic/linux- aws(4.15.0-1113-aws) the machine fails to boot 4.15 kernel. + + If I remove these patches the instance correctly boots the 4.15 kernel + + https://lists.ubuntu.com/archives/kernel-team/2021-September/123963.html + + But after successfully updating to the 4.15 without those patches + applied, I can then upgrade to a test kernel with the above patches + included, and the instance will boot properly. + + This problem only appears on metal instances, which uses NVME instead of + XVDA devices. + + AWS instances also use the 'discard' mount option with ext4, thought + maybe there could be a race condition between ext4 discard and journal + flush. Removed 'discard' mount and rebooted 5.4 kernel prior to 4.15 + kernel installation, but still wouldn't boot. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs