Hi Thimo, Firstly, thank you for your bug report, we really, really appreciate it.
You are correct, the recent raid10 patches appear to cause filesystem corruption on raid10 arrays. I have spent the day reproducing, and I can confirm that the 4.15.0-126-generic, 5.4.0-56-generic and 5.8.0-31-generic kernels are affected. The kernel team are aware of the situation, and we have begun an emergency revert of the patches, and we should have new kernels available in the next few hours / day or so. The current mainline kernel is affected, so I have written to the raid subsystem maintainer, and the original author of the raid10 block discard patches, to aid with debugging and fixing the problem. You can follow the upstream thread here: https://www.spinics.net/lists/kernel/msg3765302.html As for the data corruption on your servers, I am deeply sorry for causing this regression. When I was testing the raid10 block discard patches on the Ubuntu stable kernels, I did not think to fsck each of the disks in the array, instead, I was contempt with the speed of creating new arrays, writing a basic dataset to the disks, and rebooting the server to ensure the array came up again with those same files. Since the first disk seems to be okay, there is at least a small window of opportunity for you to restore any data that you have not backed up. I will keep you informed of getting the patches reverted, and getting the root cause fixed upstream. If you have any questions, feel free to ask, and if you have any more details from your own debugging, feel free to share in this bug, or on the upstream mailing list discussion. Thanks, Matthew -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1907262 Title: raid10: discard leads to corrupted file system Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: In Progress Status in linux source package in Focal: In Progress Status in linux source package in Groovy: In Progress Bug description: Seems to be closely related to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578 After updating the Ubuntu 18.04 kernel from 4.15.0-124 to 4.15.0-126 the fstrim command triggered by fstrim.timer causes a severe number of mismatches between two RAID10 component devices. This bug affects several machines in our company with different HW configurations (All using ECC RAM). Both, NVMe and SATA SSDs are affected. How to reproduce: - Create a RAID10 LVM and filesystem on two SSDs mdadm -C -v -l10 -n2 -N "lv-raid" -R /dev/md0 /dev/nvme0n1p2 /dev/nvme1n1p2 pvcreate -ff -y /dev/md0 vgcreate -f -y VolGroup /dev/md0 lvcreate -n root -L 100G -ay -y VolGroup mkfs.ext4 /dev/VolGroup/root mount /dev/VolGroup/root /mnt - Write some data, sync and delete it dd if=/dev/zero of=/mnt/data.raw bs=4K count=1M sync rm /mnt/data.raw - Check the RAID device echo check >/sys/block/md0/md/sync_action - After finishing (see /proc/mdstat), check the mismatch_cnt (should be 0): cat /sys/block/md0/md/mismatch_cnt - Trigger the bug fstrim /mnt - Re-Check the RAID device echo check >/sys/block/md0/md/sync_action - After finishing (see /proc/mdstat), check the mismatch_cnt (probably in the range of N*10000): cat /sys/block/md0/md/mismatch_cnt After investigating this issue on several machines it *seems* that the first drive does the trim correctly while the second one goes wild. At least the number and severity of errors found by a USB stick live session fsck.ext4 suggests this. To perform the single drive evaluation the RAID10 was started using a single drive at once: mdadm --assemble /dev/md127 /dev/nvme0n1p2 mdadm --run /dev/md127 fsck.ext4 -n -f /dev/VolGroup/root vgchange -a n /dev/VolGroup mdadm --stop /dev/md127 mdadm --assemble /dev/md127 /dev/nvme1n1p2 mdadm --run /dev/md127 fsck.ext4 -n -f /dev/VolGroup/root When starting these fscks without -n, on the first device it seems the directory structure is OK while on the second device there is only the lost+found folder left. Side-note: Another machine using HWE kernel 5.4.0-56 (after using -53 before) seems to have a quite similar issue. Unfortunately the risk/regression assessment in the aforementioned bug is not complete: the workaround only mitigates the issues during FS creation. This bug on the other hand is triggered by a weekly service (fstrim) causing severe file system corruption. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp