This bug is missing log files that will aid in diagnosing the problem.
>From a terminal window please run:
apport-collect 1679208
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable
to run this command, please add a comment stating that fact and change
the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the
Ubuntu Kernel Team.
** Changed in: linux (Ubuntu)
Status: New => Incomplete
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1679208
Title:
Zesty (4.10.0-14) won't boot on HP ProLiant DL360 Gen9 with
intel_iommu=on
Status in linux package in Ubuntu:
Incomplete
Bug description:
TL;DR
- one of our HP ProLiant DL360 Gen9 fails to boot with intel_iommu=on
- the Disk controller fails
- Xenial seems to work for a while but then fails
- Zesty 100% crashes on boot
- An identical system seems to work, so need HW replace to finally confirm
After reboot one sees a HW report like this:
After the boot I see the HW telling me this on boot:
Embedded RAID : Smart HBA H240ar Controller - Operation Failed
- 1719-Slot 0 Drive Array - A controller failure event occurred prior
to this power-up. (Previous lock up code = 0x13)
I tried several things (In between always redeploy zesty with MAAS).
I think my debugging might be helpful, but I wanted to keep the documentation
in the bug in case you'd go another route or that others find useful
information in here.
0. I retried what I did twice, fully reproducible
That is:
0.1 install zesty
0.2 change grub default cmdline in /etc/default/grub.d/50- to add
intel_iommu=on
0.3 sudo update-grub
0.4 reboot
1. I tried a Recovery boot from the boot options in gub.
=> Failed as well
2. iLO rebooted vis "request reboot" and as well via "full system reset"
=> both Failed
3. Reboot the system as deployed by MAAS
# /proc/cmdline before that
BOOT_IMAGE=/boot/vmlinuz-4.10.0-14-generic
root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro
The orig grub.cfg is like http://paste.ubuntu.com/24305945/
It reboots as-is.
=> Reboot worked
4. without a change to anything in /etc run update-grub
$ sudo update-grub
Generating grub configuration file ...
Warning: Setting GRUB_TIMEOUT to a non-zero value when GRUB_HIDDEN_TIMEOUT
is set is no longer supported.
Found linux image: /boot/vmlinuz-4.10.0-14-generic
Found initrd image: /boot/initrd.img-4.10.0-14-generic
Adding boot menu entry for EFI firmware configuration
done
There was no diff between the new grub.cfg and the one I saved.
=> Reboot worked
5. add the intel_iommu=on arg
$ sudo sed -i
's/GRUB_CMDLINE_LINUX_DEFAULT=""/GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on"/'
/etc/default/grub.d/50-curtin-settings.cfg
$ sudo update-grub
# Diff in grub.cfg really only is the iommu setting
=> Reboot Failed
So this doesn't seem so much of a cloud-init/curtin/maas bug anymore to me
- maybe intel_iommu bheaves different?
- Check grub cfg pre/post - not change but the expected?
6. Install Xenial and do the same
=> Reboot working
7. Upgrade to Z
Since the Xenial system just worked and one can assume that almost only
kernel is working so early in the boot process I upgraded the working system
with intel_iommu=on to Zesty.
That would be 4.4.0-71-generic to 4.10.0-1
On this upgrade I finally saw my I/O errors again :-/
Note: these issues are hard to miss as they mount root as read-only.
I wonder if they only ever appear with intel_iommu=on as this is the only
combo I ever saw them,
8. Redeploy and upgrade to Z without intel_iommu=on enabled
Then enable intel_iommu=on and reboot
=> Reboot Fail
From here I rebooted into the Xenial kerenl (that since this is an update
was still there)
Here I saw:
Loading Linux 4.4.0-71-generic ...
Loading initial ramdisk ...
error: invalid video mode specification `text'.
Booting in blind mode
Hrm, as outlined above the "blind mode" might be a red herring, but since
this kernel worked before it might still be a red herring that swims in the
initrd that got regenerated on the upgrade.
=> Xenial Kernel Reboot - works !!
So "blind mode" is a red herring of some sort.
But this might allow to find some logs
=> No
This appears as if the Failing boot has never made it to the point to
actually write anything.
I see:
1. the original xenial
2. the upgraded zesty
3. NOT THE zesty+iommu
4. the xenial+iommu
$ egrep 'kernel:.*(Linux version|Command line)' /var/log/syslog
Apr 3 12:15:20 node-horsea kernel: [ 0.000000] Linux version
4.4.0-71-generic (buildd@lcy01-05) (gcc version 5.4.0 20160609 (Ubuntu
5.4.0-6ubuntu1~16.04.4) ) #92-Ubuntu SMP Fri Mar 24 12:59:01 UTC 2017 (Ubuntu
4.4.0-71.92-generic 4.4.49)
Apr 3 12:15:20 node-horsea kernel: [ 0.000000] Command line:
BOOT_IMAGE=/boot/vmlinuz-4.4.0-71-generic
root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro
Apr 3 12:47:45 node-horsea kernel: [ 0.000000] Linux version
4.10.0-14-generic (buildd@lcy01-01) (gcc version 6.3.0 20170221 (Ubuntu
6.3.0-8ubuntu1) ) #16-Ubuntu SMP Fri Mar 17 15:19:26 UTC 2017 (Ubuntu
4.10.0-14.16-generic 4.10.3)
Apr 3 12:47:45 node-horsea kernel: [ 0.000000] Command line:
BOOT_IMAGE=/boot/vmlinuz-4.10.0-14-generic
root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro
Apr 3 13:15:49 node-horsea kernel: [ 0.000000] Linux version
4.4.0-71-generic (buildd@lcy01-05) (gcc version 5.4.0 20160609 (Ubuntu
5.4.0-6ubuntu1~16.04.4) ) #92-Ubuntu SMP Fri Mar 24 12:59:01 UTC 2017 (Ubuntu
4.4.0-71.92-generic 4.4.49)
Apr 3 13:15:49 node-horsea kernel: [ 0.000000] Command line:
BOOT_IMAGE=/boot/vmlinuz-4.4.0-71-generic
root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro intel_iommu=on
9. Trying to avoiding HW replacement if not needed
I was afraid I might need the HW to be replaced to be 100% sure, but this
very much smells broken in SW to me already.
To avoid RT ticket replacing without real need I asked to free another system
up.
So I finally could free up a identical machine.
I especially checked the failing HP smart array, it has the same Product
Version and FW revision.
There things seem to work, so I might be down to replacing the HW :-/
10. get some messages of the fail:
With the following grub cmdline I got to see the fail:
GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on --- console=ttyS1,115200"
It looks just like the one I found on the running system when intel_iommu=on
is set on the Xenial kernel happening later (sometimes minutes, sometimes days,
but never without intel_iommu).
But on zesty it seems to trigger 100% on boot and by that not even get up.
I'll attach a few logs of the crashes, but the heads are
[ 33.426069] hpsa 0000:03:00.0: Acknowledging event: 0x80000000 (HP SSD
Smart Path configuration change)
[ 618.567636] DMAR: DRHD: handling fault status reg 2
[ 618.567922] DMAR: DMAR:[DMA Read] Request device [03:00.0] fault addr
ffafc000
DMAR:[fault reason 06] PTE Read access is not set
Or
[ 159.779566] hpsa 0000:03:00.0: Command timed out.
[ 159.801113] hpsa 0000:03:00.0: hpsa_send_abort_ioaccel2:
Tag:0x00000000:000000d0: unknown abort service response 0x00
While it might be a HW issue I file this still to be "findable" for anyone
else if it is no HW eventually.
But I assign myself for now to close/confirm once I have replaced HW.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1679208/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp