Re: [OpenIndiana-discuss] The kiss of death

Toomas Soome via openindiana-discuss Thu, 22 Apr 2021 23:27:14 -0700


> On 23. Apr 2021, at 01:57, Reginald Beardsley via openindiana-discuss 
> <openindiana-discuss@openindiana.org> wrote:
> 
> 
> What do those mean? I have seen them numerous times even though the system 
> booted.
> 
> 
> 
> On Thursday, April 22, 2021, 05:46:30 PM CDT, Nelson H. F. Beebe 
> <be...@math.utah.edu> wrote:
> 
> 
> ZFS: i/o error - all block copies unavailable



I/O error means what it is telling us - we did encounter error while doing I/O, 
in this case, we were attempting to read disk. I/O error may come from BIOS 
INT13:

usr/src/boot/sys/boot/i386/libi386/biosdisk.c:

function bd_io() message template is:

printf("%s%d: Read %d sector(s) from %lld to %p “ … );

You should see disk name as disk0: etc.

or UEFI case in usr/src/boot/sys/boot/efi/libefi/efipart.c:

function efipart_readwrite() and the message template is:

printf("%s: rw=%d, blk=%ju size=%ju status=%lu\n” …)


With ZFS reader, we can also get I/O error because of failing checksum check. 
That is, we did issue read command to disk, we did not receive error from that 
read command, but the data checksum does not match.

Checksum error may happen in scenarios:

1. the data actually is corrupt (bitrot, partial write…)
2. the disk read did not return error, but did read wrong data or does not 
actually return any data. This is often case when we meet 2TB barrier and BIOS 
INT13 implementation is buggy.

We can see this case with large disks, the setup used to be booting fine, but 
as the rpool has been filling up, at one point in time (after pkg update), the 
vital boot files (kernel or boot_archive) are written past 2TB “line”. Then 
next boot will attempt to read kernel or boot_archive and will get error.

3. BUG in loader ZFS reader code. If by some reason the zfs reader code will 
instruct disk layer to read wrong blocks, the disk IO is most likely OK, but 
the logical data is not. I can not exclude this cause, but it is very unlikely 
case.

When we do get IO error, and the pool has redundancy (mirror, raidz or zfs set 
copies=..), we attempt to read alternate copy of the file system block, if all 
reas fail, we get the second half of this message (all block copies).

Unfortunately, the current ZFS reader code does report generic EIO from its 
internal stack and therefore this error message is very generic one. I do plan 
to fix this with decryption code addition.


> ZFS: can't read MOS of pool rpool
> 

This means, we got IO errors while we are opening pool for reading, we got pool 
label (there are 4 copies of pool label on disk), but we have error while 
attempting to read MOS (this is sort of like superblock in UFS), without MOS we 
can not continue reading this pool.

If this is virtualized system and the error does appear after reboot and the 
disk is not over 2TB one, one possible reason is VM managed failing to write 
down virtual disk blocks. This is the cache when VM manager is not implementing 
the cache flush for disks and, for example, is crashing. I have seen this 
myself with VMWare Fusion.

rgds,
toomas

_______________________________________________
openindiana-discuss mailing list
openindiana-discuss@openindiana.org
https://openindiana.org/mailman/listinfo/openindiana-discuss

Re: [OpenIndiana-discuss] The kiss of death

Reply via email to