Public bug reported:

TL;DR
- one of our HP ProLiant DL360 Gen9 fails to boot with intel_iommu=on
- the Disk controller fails
- Xenial seems to work for a while but then fails
- Zesty 100% crashes on boot
- An identical system seems to work, so need HW replace to finally confirm

After reboot one sees a HW report like this:
After the boot I see the HW telling me this on boot:
Embedded RAID : Smart HBA H240ar Controller - Operation Failed
 - 1719-Slot 0 Drive Array  - A controller failure event occurred prior
   to this power-up. (Previous lock up code = 0x13)


I tried several things (In between always redeploy zesty with MAAS).
I think my debugging might be helpful, but I wanted to keep the documentation 
in the bug in case you'd go another route or that others find useful 
information in here.

0. I retried what I did twice, fully reproducible
   That is:
   0.1 install zesty 
   0.2 change grub default cmdline in /etc/default/grub.d/50- to add 
intel_iommu=on
   0.3 sudo update-grub
   0.4 reboot


1. I tried a Recovery boot from the boot options in gub.
   => Failed as well


2. iLO rebooted vis "request reboot" and as well via "full system reset"
   => both Failed


3. Reboot the system as deployed by MAAS
   # /proc/cmdline before that
   BOOT_IMAGE=/boot/vmlinuz-4.10.0-14-generic 
root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro
   The orig grub.cfg is like http://paste.ubuntu.com/24305945/
   It reboots as-is.
   => Reboot worked


4. without a change to anything in /etc run update-grub
   $ sudo update-grub
   Generating grub configuration file ...
   Warning: Setting GRUB_TIMEOUT to a non-zero value when GRUB_HIDDEN_TIMEOUT 
is set is no longer supported.
   Found linux image: /boot/vmlinuz-4.10.0-14-generic
   Found initrd image: /boot/initrd.img-4.10.0-14-generic
   Adding boot menu entry for EFI firmware configuration
   done

   There was no diff between the new grub.cfg and the one I saved.
   => Reboot worked


5. add the intel_iommu=on arg
  $ sudo sed -i 
's/GRUB_CMDLINE_LINUX_DEFAULT=""/GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on"/' 
/etc/default/grub.d/50-curtin-settings.cfg
  $ sudo update-grub
  # Diff in grub.cfg really only is the iommu setting
  => Reboot Failed
  So this doesn't seem so much of a cloud-init/curtin/maas bug anymore to me - 
maybe intel_iommu bheaves different?
- Check grub cfg pre/post - not change but the expected?


6. Install Xenial and do the same
   => Reboot working


7. Upgrade to Z
   Since the Xenial system just worked and one can assume that almost only 
kernel is working so early in the boot process I upgraded the working system 
with intel_iommu=on to Zesty.
   That would be 4.4.0-71-generic to 4.10.0-1
   On this upgrade I finally saw my I/O errors again :-/
   Note: these issues are hard to miss as they mount root as read-only.
   I wonder if they only ever appear with intel_iommu=on as this is the only 
combo I ever saw them,


8. Redeploy and upgrade to Z without intel_iommu=on enabled
   Then enable intel_iommu=on and reboot
   => Reboot Fail
   From here I rebooted into the Xenial kerenl (that since this is an update 
was still there)
   Here I saw:
    Loading Linux 4.4.0-71-generic ...
    Loading initial ramdisk ...
    error: invalid video mode specification `text'.
    Booting in blind mode
   Hrm, as outlined above the "blind mode" might be a red herring, but since 
this kernel worked before it might still be a red herring that swims in the 
initrd that got regenerated on the upgrade.
   => Xenial Kernel Reboot - works !!
   So "blind mode" is a red herring of some sort.
   
   But this might allow to find some logs
   => No
   This appears as if the Failing boot has never made it to the point to 
actually write anything.
   I see:
    1. the original xenial
    2. the upgraded zesty
    3. NOT THE zesty+iommu
    4. the xenial+iommu

$ egrep 'kernel:.*(Linux version|Command line)' /var/log/syslog 
Apr  3 12:15:20 node-horsea kernel: [    0.000000] Linux version 
4.4.0-71-generic (buildd@lcy01-05) (gcc version 5.4.0 20160609 (Ubuntu 
5.4.0-6ubuntu1~16.04.4) ) #92-Ubuntu SMP Fri Mar 24 12:59:01 UTC 2017 (Ubuntu 
4.4.0-71.92-generic 4.4.49)
Apr  3 12:15:20 node-horsea kernel: [    0.000000] Command line: 
BOOT_IMAGE=/boot/vmlinuz-4.4.0-71-generic 
root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro
Apr  3 12:47:45 node-horsea kernel: [    0.000000] Linux version 
4.10.0-14-generic (buildd@lcy01-01) (gcc version 6.3.0 20170221 (Ubuntu 
6.3.0-8ubuntu1) ) #16-Ubuntu SMP Fri Mar 17 15:19:26 UTC 2017 (Ubuntu 
4.10.0-14.16-generic 4.10.3)
Apr  3 12:47:45 node-horsea kernel: [    0.000000] Command line: 
BOOT_IMAGE=/boot/vmlinuz-4.10.0-14-generic 
root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro
Apr  3 13:15:49 node-horsea kernel: [    0.000000] Linux version 
4.4.0-71-generic (buildd@lcy01-05) (gcc version 5.4.0 20160609 (Ubuntu 
5.4.0-6ubuntu1~16.04.4) ) #92-Ubuntu SMP Fri Mar 24 12:59:01 UTC 2017 (Ubuntu 
4.4.0-71.92-generic 4.4.49)
Apr  3 13:15:49 node-horsea kernel: [    0.000000] Command line: 
BOOT_IMAGE=/boot/vmlinuz-4.4.0-71-generic 
root=UUID=2137c19a-d441-43fa-82e2-f2b7e3b2727b ro intel_iommu=on


9. Trying to avoiding HW replacement if not needed
I was afraid I might need the HW to be replaced to be 100% sure, but this very 
much smells broken in SW to me already.
To avoid RT ticket replacing without real need I asked to free another system 
up.

So I finally could free up a identical machine.
I especially checked the failing HP smart array, it has the same Product 
Version and FW revision.

There things seem to work, so I might be down to replacing the HW :-/


10. get some messages of the fail:
With the following grub cmdline I got to see the fail:
GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on --- console=ttyS1,115200"

It looks just like the one I found on the running system when intel_iommu=on is 
set on the Xenial kernel happening later (sometimes minutes, sometimes days, 
but never without intel_iommu).
But on zesty it seems to trigger 100% on boot and by that not even get up.

I'll attach a few logs of the crashes, but the heads are
[   33.426069] hpsa 0000:03:00.0: Acknowledging event: 0x80000000 (HP SSD Smart 
Path configuration change)
[  618.567636] DMAR: DRHD: handling fault status reg 2
[  618.567922] DMAR: DMAR:[DMA Read] Request device [03:00.0] fault addr 
ffafc000 
               DMAR:[fault reason 06] PTE Read access is not set

Or
[  159.779566] hpsa 0000:03:00.0: Command timed out.
[  159.801113] hpsa 0000:03:00.0: hpsa_send_abort_ioaccel2: 
Tag:0x00000000:000000d0: unknown abort service response 0x00


While it might be a HW issue I file this still to be "findable" for anyone else 
if it is no HW eventually.
But I assign myself for now to close/confirm once I have replaced HW.

** Affects: linux (Ubuntu)
     Importance: Undecided
     Assignee: ChristianEhrhardt (paelzer)
         Status: New

** Changed in: linux (Ubuntu)
     Assignee: (unassigned) => ChristianEhrhardt (paelzer)

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1679208

Title:
  Zesty (4.10.0-14) won't boot on HP ProLiant DL360 Gen9 with
  intel_iommu=on

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1679208/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to