On Fri, Jul 20, 2018 at 11:28:42PM +0200, Alexander Wetzel wrote: > Hello, > > I'm running my normal workstation with git kernels from > git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-testing.git > and just got the second file system corruption in three weeks. I do > not have issues with stable kernels, and just want to give you a > heads up that there might be something seriously broken in current > development kernels. > > The first corruption was with a kernel based on 4.18.0-rc1 > (wt-2018-06-20) and the second one today based on 4.18.0-rc4 > (wt-2018-07-09). > The first corruption definitely destroyed data, the second one has > not been looked at all, yet. > > After the reinstall I did run some scrubs, the last working one one > week ago. > > Of course this could be unrelated to the development kernels or even > btrfs, but two corruptions within weeks after years without problems > is very suspect. > And since btrfs also allowed to read corrupted data (with a stable > ubuntu kernel, see below for more details) it looks like this is > indeed an issue in btrfs, correct? > > A btrfs subvolume is used as the rootfs on a "Samsung SSD 850 EVO > mSATA 1TB" and I'm running Gentoo ~amd64 on a Thinkpad W530. Discard > is enabled as mount option and there were roughly 5 other > subvolumes. > > I'm currently backing up the full btrfs partition after the second > corruption which announced itself with the following log entries: > > [ 979.223767] BTRFS critical (device sdc2): corrupt leaf: root=2 > block=1029783552 slot=1, unexpected item end, have 16161 expect > 16250
This means that the metadata block matches the checksum in its header, but is internally inconsistent. This means that the error in the block was made before the csum was computed -- i.e., it was that way in RAM. This can happen in a couple of different ways, but the most likely cause is bad RAM. In this case, it's not a single bitflip in the metadata page itself, so it's more likely to be something writing spurious data on the page in RAM that was holding this metadata block. This is either a bug in the kernel, or a hardware problem. I would strongly recommend checking your RAM (memtest86 for a minimum of 8 hours, preferably 24). > [ 979.223808] BTRFS: error (device sdc2) in __btrfs_cow_block:1080: > errno=-5 IO failure > [ 979.223810] BTRFS info (device sdc2): forced readonly > [ 979.224599] BTRFS warning (device sdc2): Skipping commit of > aborted transaction. > [ 979.224603] BTRFS: error (device sdc2) in > cleanup_transaction:1847: errno=-5 IO failure > > I'll restore the system from a backup - and stick to stable kernels > for now - after that, but if needed I can of course also restore the > partition backup to another disk for testing. It may be a kernel issue, but it's not necessarily in btrfs. It could be a bug in some other kernel component where it does some pointer arithmetic wrong, or uses some uninitialised data as a pointer. My money's is on bad RAM, though (by a small margin). Hugo. -- Hugo Mills | Stick them with the pointy end. hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 | Jon Snow
signature.asc
Description: Digital signature