I've got similar errors when doing a balance and send/receive at the
same time, in these cases send/receive would abort because of the
errors, but it all worked normally after waiting for the balance to
finish, no more errors, though now I try to remember to disable any
scheduled send/receive jobs before a balance.

Jorge

On Tue, Mar 12, 2019 at 4:03 AM Zygo Blaxell
<ce3g8...@umail.furryterror.org> wrote:
>
> On 4.14.x and 4.20.14 kernels (probably all the ones in between too,
> but I haven't tested those), I get what I call "ghost parent transid
> verify failed" errors.  Here's an unedited recent example from dmesg:
>
>         [16180.649285] BTRFS error (device dm-3): parent transid verify 
> failed on 1218181971968 wanted 9698 found 9744
>         [16864.457119] BTRFS error (device dm-3): parent transid verify 
> failed on 981488238592 wanted 9758 found 9763
>         [16864.459153] BTRFS error (device dm-3): parent transid verify 
> failed on 981488238592 wanted 9758 found 9763
>         [16865.409955] BTRFS error (device dm-3): parent transid verify 
> failed on 981488238592 wanted 9758 found 9763
>         [16865.413091] BTRFS error (device dm-3): parent transid verify 
> failed on 981488238592 wanted 9758 found 9763
>         [16865.583441] BTRFS error (device dm-3): parent transid verify 
> failed on 981488238592 wanted 9758 found 9763
>         [16865.586602] BTRFS error (device dm-3): parent transid verify 
> failed on 981488238592 wanted 9758 found 9763
>         [17364.870504] BTRFS error (device dm-3): parent transid verify 
> failed on 564765081600 wanted 9774 found 9779
>         [17364.879696] BTRFS error (device dm-3): parent transid verify 
> failed on 564765081600 wanted 9774 found 9779
>         [17364.901713] BTRFS error (device dm-3): parent transid verify 
> failed on 564765081600 wanted 9774 found 9779
>         [17376.127722] BTRFS error (device dm-3): parent transid verify 
> failed on 564765081600 wanted 9774 found 9779
>         [17376.129965] BTRFS error (device dm-3): parent transid verify 
> failed on 564765081600 wanted 9774 found 9779
>         [17376.132919] BTRFS error (device dm-3): parent transid verify 
> failed on 564765081600 wanted 9774 found 9779
>         [17376.250605] BTRFS error (device dm-3): parent transid verify 
> failed on 564765081600 wanted 9774 found 9779
>         [17376.253633] BTRFS error (device dm-3): parent transid verify 
> failed on 564765081600 wanted 9774 found 9779
>         [17376.577192] BTRFS error (device dm-3): parent transid verify 
> failed on 564765081600 wanted 9774 found 9779
>         [17376.585174] BTRFS error (device dm-3): parent transid verify 
> failed on 564765081600 wanted 9774 found 9779
>         [17380.720874] BTRFS error (device dm-3): parent transid verify 
> failed on 564765081600 wanted 9774 found 9779
>         [17380.735643] BTRFS error (device dm-3): parent transid verify 
> failed on 564765081600 wanted 9774 found 9779
>         [17383.282250] BTRFS error (device dm-3): parent transid verify 
> failed on 564765081600 wanted 9774 found 9779
>         [17383.286598] BTRFS error (device dm-3): parent transid verify 
> failed on 564765081600 wanted 9774 found 9779
>         [17433.417529] BTRFS error (device dm-3): parent transid verify 
> failed on 564765081600 wanted 9774 found 9779
>         [17433.420685] BTRFS error (device dm-3): parent transid verify 
> failed on 564765081600 wanted 9774 found 9779
>         [17881.816489] BTRFS error (device dm-3): parent transid verify 
> failed on 785410277376 wanted 9786 found 9794
>         [17881.820072] BTRFS error (device dm-3): parent transid verify 
> failed on 785410277376 wanted 9786 found 9794
>         [17881.824307] BTRFS error (device dm-3): parent transid verify 
> failed on 785410277376 wanted 9786 found 9794
>         [19525.602623] BTRFS error (device dm-3): parent transid verify 
> failed on 984570740736 wanted 9841 found 9846
>         [19525.605395] BTRFS error (device dm-3): parent transid verify 
> failed on 984570740736 wanted 9841 found 9846
>         [19525.829326] BTRFS error (device dm-3): parent transid verify 
> failed on 984570740736 wanted 9841 found 9846
>         [19525.834123] BTRFS error (device dm-3): parent transid verify 
> failed on 984570740736 wanted 9841 found 9846
>         [19525.905553] BTRFS error (device dm-3): parent transid verify 
> failed on 984570740736 wanted 9841 found 9846
>         [19525.909302] BTRFS error (device dm-3): parent transid verify 
> failed on 984570740736 wanted 9841 found 9846
>         [19525.914374] BTRFS error (device dm-3): parent transid verify 
> failed on 984570740736 wanted 9841 found 9846
>         [19525.917829] BTRFS error (device dm-3): parent transid verify 
> failed on 984570740736 wanted 9841 found 9846
>         [19525.980067] BTRFS error (device dm-3): parent transid verify 
> failed on 984570740736 wanted 9841 found 9846
>         [19525.983479] BTRFS error (device dm-3): parent transid verify 
> failed on 984570740736 wanted 9841 found 9846
>
> They seem to occur while running these simultaneously:
>
>         * rsync receiving a lot of data
>
>         * snapshot delete to make space for more data
>
>         * bees deduping said data--possibly on the snapshot being
>         deleted.
>
> All three seem to be needed--these messages do not appear if one of the
> above is not running at the time.  I can get the messages on a test VM
> that is running nothing but the above three things on the test filesystem.
>
> Unlike the usual disaster cases when "parent transid verify failed"
> messages appear, these incidents seem to be entirely harmless.  They've
> been appearing for years (since at least the first v4.14 kernels)
> and...nothing else happens.  The filesystem doesn't go read-only.
> It doesn't become irreparably damaged.  Nothing locks up or crashes
> (at the same time).  The system just keeps running normally, often for
> months, emitting a few of these messages during longer snapshot deletes.
>
> This happens on several different host systems.  There is no sign of
> hardware failure.  It looks like a real bug, but I don't have a simpler
> reproducer, and sometimes weeks or months pass between incidents on a
> given machine.
>
> Scrubs are always clean.  There are never errors recorded in btrfs
> dev stat:
>
>         # btrfs dev stat /testfs/
>         [/dev/mapper/vgtest-tvdb].write_io_errs   0
>         [/dev/mapper/vgtest-tvdb].read_io_errs    0
>         [/dev/mapper/vgtest-tvdb].flush_io_errs   0
>         [/dev/mapper/vgtest-tvdb].corruption_errs 0
>         [/dev/mapper/vgtest-tvdb].generation_errs 0
>         [/dev/mapper/vgtest-tvdc].write_io_errs   0
>         [/dev/mapper/vgtest-tvdc].read_io_errs    0
>         [/dev/mapper/vgtest-tvdc].flush_io_errs   0
>         [/dev/mapper/vgtest-tvdc].corruption_errs 0
>         [/dev/mapper/vgtest-tvdc].generation_errs 0
>
> dump-tree on the offending block doesn't find a metadata page in the
> block right after the error occurs.  Sometimes it finds a newer block
> (higher transid) some hours later:
>
>         # btrfs ins dump-tree -b 984570740736 /dev/mapper/vgtest-tvdc
>         btrfs-progs v4.20.1
>         leaf 984570740736 items 151 free space 7525 generation 9882 owner 
> EXTENT_TREE
>         leaf 984570740736 flags 0x1(WRITTEN) backref revision 1
>         [...etc]
>
> Mount options common to all systems exhibiting this behavior:
>
>         rw,noatime,compress,space_cache=v2
>
>         both ssd and nossd
>
>         any of: compress, compress-force, compress-force=zstd, compress=zstd
>
> My guess/theory:  someone is maybe holding a reference to metadata
> they don't hold a lock for, and the page is being updated or replaced
> underneath due to the snapshot delete, triggering the verification
> failure.  The error doesn't get persisted to disk because by the time the
> error is discovered, the page was already deleted by the snapshot delete.
> No errors are getting persisted to disk, so it's probably something
> that happens in a read operation rather than a write operation.  Every
> application reads and writes data, so if it were in plain read or write
> code someone else would have noticed by now.  it's probably not a common
> code path.  bees seems to be required to trigger the bug (or triggers it
> much more often) so it may be a problem somewhere in one of the ioctls
> bees uses.  bees does a lot of FILE_EXTENT_SAME, TREE_SEARCH_V2, and
> LOGICAL_INO ioctls.  It could be any of the three, but FILE_EXTENT_SAME
> does writes, and TREE_SEARCH_V2 isn't used by bees to look at items in
> EXTENT_TREE, which leaves LOGICAL_INO which starts from EXTENT_TREE and
> works outwards to subvol trees.  So my guess is it's a bug in LOGICAL_INO.

Reply via email to