Public bug reported:

We have an user reporting the following issue:

After an upgrade, grub couldn't boot any kernel. The system is not
running in UEFI mode, so "grub-pc" is the package used - also it is a HW
RAID5 setup (Dell machine). The bootloader itself was able to get
loaded, including all its base modules (hence the bootloader could
read/write from disk) - also, grub packages were up-to-date and seemed
properly installed. The following kernels were present/installed there:
4.4.0-148, 4.4.0-189, 4.4.0-190, 4.4.0-193, 4.4.0-194 .

Attempting to boot the most recent version (-194), we got the following
grub error: "error: attempt read-write outside of disk `hd0`" - even
dropping to the grub shell and manually trying to load the file
vmlinuz-4.4.0-194-generic (which was being accessed/seen by grub "ext4
module"), we got the mentioned error. Now, if we tried to boot *all* the
other kernels, we managed to load vmlinuz image for them, but not the
initrd - in this case we still get the message "attempt read-write
outside of disk" but grub allows the boot to continue, and as expected,
Linux will fail due to the lack of the ramdisk image.

After booting from a virtual ISO (Ubuntu installer), we managed to 
"update-grub", "update-initramfs" and "grub-install", not forgetting to "sync" 
after all these commands. We previously duplicated all initrds, saving them as 
initrd.img-<kernel_version>.bk . Even after all that, the exact same symptom 
was observed in grub. By doing then a complete manual test with all 
vmlinuz/initrd pairs from grub shell, we noticed that the pair 
vmlinuz-4.4.0-148/initrd-4.4.0-148.bk were both readable from grub, so we could 
boot them. For some reason, it failed (later we observed that this kernel is 
not properly installed, missing a bunch of modules in /lib/modules, like the 
tg3 network driver). Even with an impaired kernel, from the initramfs shell the 
following actions were taken that rendered the system to a bootable state:
(A) We apt-get removed kernels -189 and -190 (and their initrd backups)
(B) We moved all the remaining vmlinuz/initrd pairs (and their backups) to "/"
(C) We *copied* all of them back to /boot, with the goal of duplicating the 
files in the filesystem

We double-checked the md5 hashes of all the vmlinuz/initrd pairs and
they matched, so the *same files* are present in "/" and "/boot". We
also checked vmlinuz-4.4.0-{193,194} md5 hashes against the package
version, and they matched, so the images are good/healthy. After that
all of that (we re-executed "update-grub" and "sync)," we got
repeated/doubled entries in grub: we have one entry for the
vmlinuz/initrd pair in "/boot" and one for the pair in "/" (the original
files). The original files *still cannot boot*, grub complains with the
same error message. The duplicate files on /boot can boot normally, we
tried kernels -194 and -193 twice, both booted.

So the (very odd and interesting) problem is: grub can read some files
and others it cannot read, even we knowing that *all the duplicate files
are the same* and have proved integrity (i.e., the filesystem and the
storage controller/disks seems to be healthy). Why? Very similar
problems were reports in [0] and [1] with no really good/definitive
answer.


HYPOTHESIS:

I think this has to do with the fact that grub *cannot* read some
sectors of the underlying disks, but not due to disk corruption, but due
to logical sector accounting/math. Since it's a hardware RAID, I
understand that from Linux perspective, it is "seen" as a single device.
And even from grub perspective, it's a single disk (called 'hd0' in grub
terminology). But maybe grub is doing some low-level queries to gather
physical device information on the underlying disks, and when it
calculates the sector math, it notices the "section" to be read is
outside of the "available" area of the device, giving us this error.
Some mentions of "BIOS restrictions" in [0] or [1] could be also
considered, the BIOS or even Grub could be unable to deal with files
outside some "range" in the disk, like for security reasons - although I
doubt that, I'm more keen to the first theory.

In both theories, it ends-up being a restriction in loading a file
*depending* on its logical position in the disk. If that is true, it's a
very awkward limitation. The following data was suggested to be
collected by user, to understand the topology of the disk and the
logical position (LBA) of the files:

debugfs -R "stat /boot/vmlinuz-4.4.0-194-generic" /dev/sda2 > 
debugfs-vmlinuz194-b.out
hdparm --fibmap /boot/vmlinuz-4.4.0-194-generic > hdparm-vmlinuz194-b.out

debugfs -R "stat /vmlinuz-4.4.0-194-generic" /dev/sda2 > 
debugfs-vmlinuz194-r.out
hdparm --fibmap /vmlinuz-4.4.0-194-generic > hdparm-vmlinuz194-r.out


[0] https://askubuntu.com/q/867047
[1] https://askubuntu.com/q/416418

** Affects: grub2 (Ubuntu)
     Importance: Medium
     Assignee: Guilherme G. Piccoli (gpiccoli)
         Status: Confirmed


** Tags: sts

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1918948

Title:
  Issue in Extended Disk Data retrieval (biosdisk: int 13h/service 48h)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1918948/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to