For the record, 4.15.0-1113-aws works in r5.metal w/ kexec.

Booted it 10 times successfully from both 5.4.0-1058-aws
and 4.15.0-1113-aws (itself.)

(not that it was expected to make a difference as the issue
happens on normal boot, which doesn't have previous kernel.)

Right after that, in the same instance, trying a normal boot
fails.

And it had kdump installed/enabled (ie, crashkernel in cmdline),
w/ which Ian mentioned that he couldn't reproduce the problem.

---

It also works on normal boot w/ r5d.metal (note 'd'), which
should be the same as r5.metal but w/ four local nvme disks.
(still boots from EBS/nvme disk in the same way as r5.metal)

---

Similarly, it works on r4.24xlarge (this is not metal) but
does boot from EBS/nvme disk too.

---

So it seems like there's no problem with the patchset as in
4.15.0-1113-aws as it boots fine in several types w/ approx
the same hardware config, just differing on normal/kexec in
the r5.metal type (problem report.)

- r5.metal: normal boot fails / kexec boot works
- r5d.metal: normal boot works.
- r5.24xlarge: normal boot works.

The kexec boot worked ~20 times, so it wouldn't seem like a
race condition is in place, as that should be enough runs,
considering it failed every time on normal boot.

Also, Ian mentioned that he couldn't reproduce w/ crashdump
installed. Well, I think the only difference it would cause
_before_ mounting the rootfs (assuming that's what doesn't
work/allow machine to boot, as we have no serial console)
is the crashkernel reservation?

---

So, all this is a bit confusing, but seem to indicate again
that there's no problem w/ the patchset per se, but perhaps
something in booting this particular kernel on a particular
instance type (r5.metal) which _might_ be related to normal/
kexec/crashkernel boot differences.

More tomorrow.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-aws in Ubuntu.
https://bugs.launchpad.net/bugs/1946149

Title:
  Bionic/linux-aws Boot failure downgrading from Bionic/linux-aws-5.4 on
  r5.metal

Status in linux-aws package in Ubuntu:
  New

Bug description:
  When creating an r5.metal instance on AWS, the default kernel is
  bionic/linux-aws-5.4(5.4.0-1056-aws), when changing to bionic/linux-
  aws(4.15.0-1113-aws) the machine fails to boot the 4.15 kernel.

  If I remove these patches the instance correctly boots the 4.15 kernel

  https://lists.ubuntu.com/archives/kernel-
  team/2021-September/123963.html

  With that being said, after successfully updating to the 4.15 without
  those patches applied, I can then upgrade to a 4.15 kernel with the
  above patches included, and the instance will boot properly.

  This problem only appears on metal instances, which uses NVME instead
  of XVDA devices.

  AWS instances also use the 'discard' mount option with ext4, thought
  maybe there could be a race condition between ext4 discard and journal
  flush.  Removed 'discard' from mount options and rebooted 5.4 kernel
  prior to 4.15 kernel installation, but still wouldn't boot after
  installing the 4.15 kernel.

  I have been unable to capture a stack trace using 'aws get-console-
  output'. After enabling kdump I was unable to replicate the failure.
  So there must be some sort of race with either ext4 and/or nvme.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/1946149/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to