I've got similar errors when doing a balance and send/receive at the same time, in these cases send/receive would abort because of the errors, but it all worked normally after waiting for the balance to finish, no more errors, though now I try to remember to disable any scheduled send/receive jobs before a balance.
Jorge On Tue, Mar 12, 2019 at 4:03 AM Zygo Blaxell <ce3g8...@umail.furryterror.org> wrote: > > On 4.14.x and 4.20.14 kernels (probably all the ones in between too, > but I haven't tested those), I get what I call "ghost parent transid > verify failed" errors. Here's an unedited recent example from dmesg: > > [16180.649285] BTRFS error (device dm-3): parent transid verify > failed on 1218181971968 wanted 9698 found 9744 > [16864.457119] BTRFS error (device dm-3): parent transid verify > failed on 981488238592 wanted 9758 found 9763 > [16864.459153] BTRFS error (device dm-3): parent transid verify > failed on 981488238592 wanted 9758 found 9763 > [16865.409955] BTRFS error (device dm-3): parent transid verify > failed on 981488238592 wanted 9758 found 9763 > [16865.413091] BTRFS error (device dm-3): parent transid verify > failed on 981488238592 wanted 9758 found 9763 > [16865.583441] BTRFS error (device dm-3): parent transid verify > failed on 981488238592 wanted 9758 found 9763 > [16865.586602] BTRFS error (device dm-3): parent transid verify > failed on 981488238592 wanted 9758 found 9763 > [17364.870504] BTRFS error (device dm-3): parent transid verify > failed on 564765081600 wanted 9774 found 9779 > [17364.879696] BTRFS error (device dm-3): parent transid verify > failed on 564765081600 wanted 9774 found 9779 > [17364.901713] BTRFS error (device dm-3): parent transid verify > failed on 564765081600 wanted 9774 found 9779 > [17376.127722] BTRFS error (device dm-3): parent transid verify > failed on 564765081600 wanted 9774 found 9779 > [17376.129965] BTRFS error (device dm-3): parent transid verify > failed on 564765081600 wanted 9774 found 9779 > [17376.132919] BTRFS error (device dm-3): parent transid verify > failed on 564765081600 wanted 9774 found 9779 > [17376.250605] BTRFS error (device dm-3): parent transid verify > failed on 564765081600 wanted 9774 found 9779 > [17376.253633] BTRFS error (device dm-3): parent transid verify > failed on 564765081600 wanted 9774 found 9779 > [17376.577192] BTRFS error (device dm-3): parent transid verify > failed on 564765081600 wanted 9774 found 9779 > [17376.585174] BTRFS error (device dm-3): parent transid verify > failed on 564765081600 wanted 9774 found 9779 > [17380.720874] BTRFS error (device dm-3): parent transid verify > failed on 564765081600 wanted 9774 found 9779 > [17380.735643] BTRFS error (device dm-3): parent transid verify > failed on 564765081600 wanted 9774 found 9779 > [17383.282250] BTRFS error (device dm-3): parent transid verify > failed on 564765081600 wanted 9774 found 9779 > [17383.286598] BTRFS error (device dm-3): parent transid verify > failed on 564765081600 wanted 9774 found 9779 > [17433.417529] BTRFS error (device dm-3): parent transid verify > failed on 564765081600 wanted 9774 found 9779 > [17433.420685] BTRFS error (device dm-3): parent transid verify > failed on 564765081600 wanted 9774 found 9779 > [17881.816489] BTRFS error (device dm-3): parent transid verify > failed on 785410277376 wanted 9786 found 9794 > [17881.820072] BTRFS error (device dm-3): parent transid verify > failed on 785410277376 wanted 9786 found 9794 > [17881.824307] BTRFS error (device dm-3): parent transid verify > failed on 785410277376 wanted 9786 found 9794 > [19525.602623] BTRFS error (device dm-3): parent transid verify > failed on 984570740736 wanted 9841 found 9846 > [19525.605395] BTRFS error (device dm-3): parent transid verify > failed on 984570740736 wanted 9841 found 9846 > [19525.829326] BTRFS error (device dm-3): parent transid verify > failed on 984570740736 wanted 9841 found 9846 > [19525.834123] BTRFS error (device dm-3): parent transid verify > failed on 984570740736 wanted 9841 found 9846 > [19525.905553] BTRFS error (device dm-3): parent transid verify > failed on 984570740736 wanted 9841 found 9846 > [19525.909302] BTRFS error (device dm-3): parent transid verify > failed on 984570740736 wanted 9841 found 9846 > [19525.914374] BTRFS error (device dm-3): parent transid verify > failed on 984570740736 wanted 9841 found 9846 > [19525.917829] BTRFS error (device dm-3): parent transid verify > failed on 984570740736 wanted 9841 found 9846 > [19525.980067] BTRFS error (device dm-3): parent transid verify > failed on 984570740736 wanted 9841 found 9846 > [19525.983479] BTRFS error (device dm-3): parent transid verify > failed on 984570740736 wanted 9841 found 9846 > > They seem to occur while running these simultaneously: > > * rsync receiving a lot of data > > * snapshot delete to make space for more data > > * bees deduping said data--possibly on the snapshot being > deleted. > > All three seem to be needed--these messages do not appear if one of the > above is not running at the time. I can get the messages on a test VM > that is running nothing but the above three things on the test filesystem. > > Unlike the usual disaster cases when "parent transid verify failed" > messages appear, these incidents seem to be entirely harmless. They've > been appearing for years (since at least the first v4.14 kernels) > and...nothing else happens. The filesystem doesn't go read-only. > It doesn't become irreparably damaged. Nothing locks up or crashes > (at the same time). The system just keeps running normally, often for > months, emitting a few of these messages during longer snapshot deletes. > > This happens on several different host systems. There is no sign of > hardware failure. It looks like a real bug, but I don't have a simpler > reproducer, and sometimes weeks or months pass between incidents on a > given machine. > > Scrubs are always clean. There are never errors recorded in btrfs > dev stat: > > # btrfs dev stat /testfs/ > [/dev/mapper/vgtest-tvdb].write_io_errs 0 > [/dev/mapper/vgtest-tvdb].read_io_errs 0 > [/dev/mapper/vgtest-tvdb].flush_io_errs 0 > [/dev/mapper/vgtest-tvdb].corruption_errs 0 > [/dev/mapper/vgtest-tvdb].generation_errs 0 > [/dev/mapper/vgtest-tvdc].write_io_errs 0 > [/dev/mapper/vgtest-tvdc].read_io_errs 0 > [/dev/mapper/vgtest-tvdc].flush_io_errs 0 > [/dev/mapper/vgtest-tvdc].corruption_errs 0 > [/dev/mapper/vgtest-tvdc].generation_errs 0 > > dump-tree on the offending block doesn't find a metadata page in the > block right after the error occurs. Sometimes it finds a newer block > (higher transid) some hours later: > > # btrfs ins dump-tree -b 984570740736 /dev/mapper/vgtest-tvdc > btrfs-progs v4.20.1 > leaf 984570740736 items 151 free space 7525 generation 9882 owner > EXTENT_TREE > leaf 984570740736 flags 0x1(WRITTEN) backref revision 1 > [...etc] > > Mount options common to all systems exhibiting this behavior: > > rw,noatime,compress,space_cache=v2 > > both ssd and nossd > > any of: compress, compress-force, compress-force=zstd, compress=zstd > > My guess/theory: someone is maybe holding a reference to metadata > they don't hold a lock for, and the page is being updated or replaced > underneath due to the snapshot delete, triggering the verification > failure. The error doesn't get persisted to disk because by the time the > error is discovered, the page was already deleted by the snapshot delete. > No errors are getting persisted to disk, so it's probably something > that happens in a read operation rather than a write operation. Every > application reads and writes data, so if it were in plain read or write > code someone else would have noticed by now. it's probably not a common > code path. bees seems to be required to trigger the bug (or triggers it > much more often) so it may be a problem somewhere in one of the ioctls > bees uses. bees does a lot of FILE_EXTENT_SAME, TREE_SEARCH_V2, and > LOGICAL_INO ioctls. It could be any of the three, but FILE_EXTENT_SAME > does writes, and TREE_SEARCH_V2 isn't used by bees to look at items in > EXTENT_TREE, which leaves LOGICAL_INO which starts from EXTENT_TREE and > works outwards to subvol trees. So my guess is it's a bug in LOGICAL_INO.