For the record, 4.15.0-1113-aws works in r5.metal w/ kexec. Booted it 10 times successfully from both 5.4.0-1058-aws and 4.15.0-1113-aws (itself.)
(not that it was expected to make a difference as the issue happens on normal boot, which doesn't have previous kernel.) Right after that, in the same instance, trying a normal boot fails. And it had kdump installed/enabled (ie, crashkernel in cmdline), w/ which Ian mentioned that he couldn't reproduce the problem. --- It also works on normal boot w/ r5d.metal (note 'd'), which should be the same as r5.metal but w/ four local nvme disks. (still boots from EBS/nvme disk in the same way as r5.metal) --- Similarly, it works on r4.24xlarge (this is not metal) but does boot from EBS/nvme disk too. --- So it seems like there's no problem with the patchset as in 4.15.0-1113-aws as it boots fine in several types w/ approx the same hardware config, just differing on normal/kexec in the r5.metal type (problem report.) - r5.metal: normal boot fails / kexec boot works - r5d.metal: normal boot works. - r5.24xlarge: normal boot works. The kexec boot worked ~20 times, so it wouldn't seem like a race condition is in place, as that should be enough runs, considering it failed every time on normal boot. Also, Ian mentioned that he couldn't reproduce w/ crashdump installed. Well, I think the only difference it would cause _before_ mounting the rootfs (assuming that's what doesn't work/allow machine to boot, as we have no serial console) is the crashkernel reservation? --- So, all this is a bit confusing, but seem to indicate again that there's no problem w/ the patchset per se, but perhaps something in booting this particular kernel on a particular instance type (r5.metal) which _might_ be related to normal/ kexec/crashkernel boot differences. More tomorrow. -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-aws in Ubuntu. https://bugs.launchpad.net/bugs/1946149 Title: Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on r5.metal Status in linux-aws package in Ubuntu: New Bug description: When creating an r5.metal instance on AWS, the default kernel is bionic/linux-aws-5.4(5.4.0-1056-aws), when changing to bionic/linux- aws(4.15.0-1113-aws) the machine fails to boot the 4.15 kernel. If I remove these patches the instance correctly boots the 4.15 kernel https://lists.ubuntu.com/archives/kernel- team/2021-September/123963.html With that being said, after successfully updating to the 4.15 without those patches applied, I can then upgrade to a 4.15 kernel with the above patches included, and the instance will boot properly. This problem only appears on metal instances, which uses NVME instead of XVDA devices. AWS instances also use the 'discard' mount option with ext4, thought maybe there could be a race condition between ext4 discard and journal flush. Removed 'discard' from mount options and rebooted 5.4 kernel prior to 4.15 kernel installation, but still wouldn't boot after installing the 4.15 kernel. I have been unable to capture a stack trace using 'aws get-console- output'. After enabling kdump I was unable to replicate the failure. So there must be some sort of race with either ext4 and/or nvme. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp