On 5/15/26 18:52, Vjaceslavs Klimovs wrote:
> Summary
> -------
> On v6.18, starting a libvirt/QEMU guest with virtio-blk backed by an
> LVM "--type raid1" LV (drivers/md/dm-raid.c stacked on
> drivers/md/raid1.c) makes md/raid1 register read failures at LV
> sector 0 within seconds of "virsh start" and mark rimage_0 Faulty
> once max_corrected_read_errors (default 20) is exceeded. Reads
> succeed via the redirect path so guests boot, but every guest disk
> ends up degraded on every VM start. Same workload on legacy
> "--type mirror" (drivers/md/dm-raid1.c) crashes the host: a
> zero-length READ reaches the NVMe controller, is rejected with
> "Invalid Field in Command", and the dm-mirror recovery path oopses.

That sounds somewhat like
https://lore.kernel.org/all/2982107.4sosBPzcNG@electra/

Have you tried latest 7.1-rc? It contains a fix for the problem
mentioned in said thread: f7b24c7b41f23b ("md/raid1,raid10: don't fail
devices for invalid IO errors") [v7.1-rc2]

Ciao, Thorsten

> Symptom on dm-raid raid1 (post --type raid1)
> --------------------------------------------
> Per LV, at virsh start, in host dmesg:
> 
>   kernel: raid1_end_read_request: 95 callbacks suppressed
>   kernel: raid1_read_request: 95 callbacks suppressed
>   kernel: md/raid1:mdX: dm-58: rescheduling sector 0
>   kernel: md/raid1:mdX: redirecting sector 0 to other mirror: dm-58
>   kernel: md/raid1:mdX: dm-58: rescheduling sector 0
>   kernel: md/raid1:mdX: redirecting sector 0 to other mirror: dm-58
>   [... 10 rescheduling/redirecting pairs ...]
>   kernel: md/raid1:mdX: dm-58: Raid device exceeded read_error
> threshold [cur 21:max 20]
>   kernel: md/raid1:mdX: dm-58: Failing raid device
>   kernel: md/raid1:mdX: Disk failure on dm-58, disabling device.
>   kernel: md/raid1:mdX: Operation continuing on 1 devices.
> 
>   dmeventd: WARNING: Device #0 of raid1 array, vg0-iris_boot, has failed.
>   dmeventd: WARNING: Waiting for resynchronization to finish before
> initiating repair on RAID device vg0-iris_boot.
>   dmeventd: Use 'lvconvert --repair vg0/iris_boot' to replace failed device.
> 
> Subsequent "lvs -a":
> 
>   WARNING: RaidLV vg0/iris_boot needs to be refreshed!
>   See character 'r' at position 9 in the RaidLV's attributes and its SubLV(s).
> 
> dmesg | grep nvme is EMPTY on this path. The NVMe driver is not
> involved in producing the error; the failure originates between the
> virtio-blk bio submission and raid1_end_read_request().
> 
> Symptom on legacy dm-mirror (pre-conversion --type mirror)
> ----------------------------------------------------------
> Same workload on drivers/md/dm-raid1.c reaches the NVMe controller
> as a zero-length READ and panics the host through dm-mirror's
> recovery path:
> 
>   kernel: operation not supported error, dev nvme1n1, sector 935446535
> op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
>   kernel: nvme1n1: I/O Cmd(0x2) @ LBA 935446535, 0 blocks, I/O Error
> (sct 0x0 / sc 0x2)
>   [... 10+ identical bursts at same timestamp ...]
>   dmeventd: Primary mirror device 252:58 read failed.
>   dmeventd: vg0-iris_boot is now in-sync.
>   [kernel oops in dm_mirror recovery path, full trace lost to console flash]
> 
> The "phys_seg 0", "0 blocks", "sct 0x0/sc 0x2" trio (NVMe Generic,
> Invalid Field in Command, NVMe spec 4.1.1.2) is unambiguous: a bio
> with bi_iter.bi_size == 0 and bi_vcnt == 0 left the block layer and
> hit the controller. dm-raid raid1 hides this by retrying on the
> surviving leg, but the upstream-of-md trigger is identical.
> 
> Bisect
> ------
> git bisect, v6.12..v6.18, 16 deterministic GOOD/BAD steps, no skips,
> ~104 minutes:
> 
>   5ff3f74e145adc79b49668adb8de276446acf6be is the first bad commit
>   block: simplify direct io validity check
> 
>   --- a/block/fops.c
>   +++ b/block/fops.c
>   @@ -38,8 +38,8 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb)
>    static bool blkdev_dio_invalid(struct block_device *bdev, struct kiocb 
> *iocb,
>                                   struct iov_iter *iter)
>    {
>   -        return iocb->ki_pos & (bdev_logical_block_size(bdev) - 1) ||
>   -                !bdev_iter_is_aligned(bdev, iter);
>   +        return (iocb->ki_pos | iov_iter_count(iter)) &
>   +                        (bdev_logical_block_size(bdev) - 1);
>    }
> 
> The dropped bdev_iter_is_aligned() used to walk the iov_iter and
> reject per-segment misaligned/degenerate vectors at the blkdev fops
> entry point. The replacement only validates ki_pos and total length
> against the logical block size. Cases that now pass that no longer
> get rejected:
> 
>   - iter with iov_iter_count(iter) == 0  (degenerate; total length is
>     "sector-aligned" since 0 % 512 == 0)
>   - iter where total length is sector-aligned but a segment isn't
> 
> The commit message justifies the removal with "The block layer
> checks all the segments for validity later". This is true for the
> io_uring submit path (which enters __blkdev_direct_IO directly and
> does its own validation) but not for the libaio aio_read/write_iter
> or the worker-pool sync read/write_iter paths that enter via
> blkdev_{read,write}_iter() -> blkdev_dio_invalid(). For those paths,
> the segment check has no replacement.
> 
> Reproducing
> ----------------------------------------------------------
> 
> The trigger requires QEMU virtio-blk's specific submission shape AND
> a non-io_uring submit. Userspace libaio alone, userspace
> preadv-in-a-thread alone, and QEMU's raw-driver open probes (which
> qemu-img info exercises identically) are all insufficient. The
> combination that hits the bug is "guest-driven I/O through
> virtio-blk-pci with cache.direct=on and aio in {native, threads}".
> 
> #regzbot introduced: 5ff3f74e145adc79b49668adb8de276446acf6be
> 
> Thanks,
> Vjaceslavs Klimovs
> 


Reply via email to