Hmmm btrfs send is _also_ prone to use-after-free issues when running with
balance (or anything that modifies readonly snapshots).  That and the
lockdep warning on 5.0.x points to...well, maybe not the same bug as
the snapshot delete case, but either a single bug in common code,
or two similar bugs in two different places.

On Tue, Mar 12, 2019 at 05:34:36PM +0000, Jorge Bastos wrote:
> I've got similar errors when doing a balance and send/receive at the
> same time, in these cases send/receive would abort because of the
> errors, but it all worked normally after waiting for the balance to
> finish, no more errors, though now I try to remember to disable any
> scheduled send/receive jobs before a balance.
> 
> Jorge
> 
> On Tue, Mar 12, 2019 at 4:03 AM Zygo Blaxell
> <ce3g8...@umail.furryterror.org> wrote:
> >
> > On 4.14.x and 4.20.14 kernels (probably all the ones in between too,
> > but I haven't tested those), I get what I call "ghost parent transid
> > verify failed" errors.  Here's an unedited recent example from dmesg:
> >
> >         [16180.649285] BTRFS error (device dm-3): parent transid verify 
> > failed on 1218181971968 wanted 9698 found 9744
> >         [16864.457119] BTRFS error (device dm-3): parent transid verify 
> > failed on 981488238592 wanted 9758 found 9763
> >         [16864.459153] BTRFS error (device dm-3): parent transid verify 
> > failed on 981488238592 wanted 9758 found 9763
> >         [16865.409955] BTRFS error (device dm-3): parent transid verify 
> > failed on 981488238592 wanted 9758 found 9763
> >         [16865.413091] BTRFS error (device dm-3): parent transid verify 
> > failed on 981488238592 wanted 9758 found 9763
> >         [16865.583441] BTRFS error (device dm-3): parent transid verify 
> > failed on 981488238592 wanted 9758 found 9763
> >         [16865.586602] BTRFS error (device dm-3): parent transid verify 
> > failed on 981488238592 wanted 9758 found 9763
> >         [17364.870504] BTRFS error (device dm-3): parent transid verify 
> > failed on 564765081600 wanted 9774 found 9779
> >         [17364.879696] BTRFS error (device dm-3): parent transid verify 
> > failed on 564765081600 wanted 9774 found 9779
> >         [17364.901713] BTRFS error (device dm-3): parent transid verify 
> > failed on 564765081600 wanted 9774 found 9779
> >         [17376.127722] BTRFS error (device dm-3): parent transid verify 
> > failed on 564765081600 wanted 9774 found 9779
> >         [17376.129965] BTRFS error (device dm-3): parent transid verify 
> > failed on 564765081600 wanted 9774 found 9779
> >         [17376.132919] BTRFS error (device dm-3): parent transid verify 
> > failed on 564765081600 wanted 9774 found 9779
> >         [17376.250605] BTRFS error (device dm-3): parent transid verify 
> > failed on 564765081600 wanted 9774 found 9779
> >         [17376.253633] BTRFS error (device dm-3): parent transid verify 
> > failed on 564765081600 wanted 9774 found 9779
> >         [17376.577192] BTRFS error (device dm-3): parent transid verify 
> > failed on 564765081600 wanted 9774 found 9779
> >         [17376.585174] BTRFS error (device dm-3): parent transid verify 
> > failed on 564765081600 wanted 9774 found 9779
> >         [17380.720874] BTRFS error (device dm-3): parent transid verify 
> > failed on 564765081600 wanted 9774 found 9779
> >         [17380.735643] BTRFS error (device dm-3): parent transid verify 
> > failed on 564765081600 wanted 9774 found 9779
> >         [17383.282250] BTRFS error (device dm-3): parent transid verify 
> > failed on 564765081600 wanted 9774 found 9779
> >         [17383.286598] BTRFS error (device dm-3): parent transid verify 
> > failed on 564765081600 wanted 9774 found 9779
> >         [17433.417529] BTRFS error (device dm-3): parent transid verify 
> > failed on 564765081600 wanted 9774 found 9779
> >         [17433.420685] BTRFS error (device dm-3): parent transid verify 
> > failed on 564765081600 wanted 9774 found 9779
> >         [17881.816489] BTRFS error (device dm-3): parent transid verify 
> > failed on 785410277376 wanted 9786 found 9794
> >         [17881.820072] BTRFS error (device dm-3): parent transid verify 
> > failed on 785410277376 wanted 9786 found 9794
> >         [17881.824307] BTRFS error (device dm-3): parent transid verify 
> > failed on 785410277376 wanted 9786 found 9794
> >         [19525.602623] BTRFS error (device dm-3): parent transid verify 
> > failed on 984570740736 wanted 9841 found 9846
> >         [19525.605395] BTRFS error (device dm-3): parent transid verify 
> > failed on 984570740736 wanted 9841 found 9846
> >         [19525.829326] BTRFS error (device dm-3): parent transid verify 
> > failed on 984570740736 wanted 9841 found 9846
> >         [19525.834123] BTRFS error (device dm-3): parent transid verify 
> > failed on 984570740736 wanted 9841 found 9846
> >         [19525.905553] BTRFS error (device dm-3): parent transid verify 
> > failed on 984570740736 wanted 9841 found 9846
> >         [19525.909302] BTRFS error (device dm-3): parent transid verify 
> > failed on 984570740736 wanted 9841 found 9846
> >         [19525.914374] BTRFS error (device dm-3): parent transid verify 
> > failed on 984570740736 wanted 9841 found 9846
> >         [19525.917829] BTRFS error (device dm-3): parent transid verify 
> > failed on 984570740736 wanted 9841 found 9846
> >         [19525.980067] BTRFS error (device dm-3): parent transid verify 
> > failed on 984570740736 wanted 9841 found 9846
> >         [19525.983479] BTRFS error (device dm-3): parent transid verify 
> > failed on 984570740736 wanted 9841 found 9846
> >
> > They seem to occur while running these simultaneously:
> >
> >         * rsync receiving a lot of data
> >
> >         * snapshot delete to make space for more data
> >
> >         * bees deduping said data--possibly on the snapshot being
> >         deleted.
> >
> > All three seem to be needed--these messages do not appear if one of the
> > above is not running at the time.  I can get the messages on a test VM
> > that is running nothing but the above three things on the test filesystem.
> >
> > Unlike the usual disaster cases when "parent transid verify failed"
> > messages appear, these incidents seem to be entirely harmless.  They've
> > been appearing for years (since at least the first v4.14 kernels)
> > and...nothing else happens.  The filesystem doesn't go read-only.
> > It doesn't become irreparably damaged.  Nothing locks up or crashes
> > (at the same time).  The system just keeps running normally, often for
> > months, emitting a few of these messages during longer snapshot deletes.
> >
> > This happens on several different host systems.  There is no sign of
> > hardware failure.  It looks like a real bug, but I don't have a simpler
> > reproducer, and sometimes weeks or months pass between incidents on a
> > given machine.
> >
> > Scrubs are always clean.  There are never errors recorded in btrfs
> > dev stat:
> >
> >         # btrfs dev stat /testfs/
> >         [/dev/mapper/vgtest-tvdb].write_io_errs   0
> >         [/dev/mapper/vgtest-tvdb].read_io_errs    0
> >         [/dev/mapper/vgtest-tvdb].flush_io_errs   0
> >         [/dev/mapper/vgtest-tvdb].corruption_errs 0
> >         [/dev/mapper/vgtest-tvdb].generation_errs 0
> >         [/dev/mapper/vgtest-tvdc].write_io_errs   0
> >         [/dev/mapper/vgtest-tvdc].read_io_errs    0
> >         [/dev/mapper/vgtest-tvdc].flush_io_errs   0
> >         [/dev/mapper/vgtest-tvdc].corruption_errs 0
> >         [/dev/mapper/vgtest-tvdc].generation_errs 0
> >
> > dump-tree on the offending block doesn't find a metadata page in the
> > block right after the error occurs.  Sometimes it finds a newer block
> > (higher transid) some hours later:
> >
> >         # btrfs ins dump-tree -b 984570740736 /dev/mapper/vgtest-tvdc
> >         btrfs-progs v4.20.1
> >         leaf 984570740736 items 151 free space 7525 generation 9882 owner 
> > EXTENT_TREE
> >         leaf 984570740736 flags 0x1(WRITTEN) backref revision 1
> >         [...etc]
> >
> > Mount options common to all systems exhibiting this behavior:
> >
> >         rw,noatime,compress,space_cache=v2
> >
> >         both ssd and nossd
> >
> >         any of: compress, compress-force, compress-force=zstd, compress=zstd
> >
> > My guess/theory:  someone is maybe holding a reference to metadata
> > they don't hold a lock for, and the page is being updated or replaced
> > underneath due to the snapshot delete, triggering the verification
> > failure.  The error doesn't get persisted to disk because by the time the
> > error is discovered, the page was already deleted by the snapshot delete.
> > No errors are getting persisted to disk, so it's probably something
> > that happens in a read operation rather than a write operation.  Every
> > application reads and writes data, so if it were in plain read or write
> > code someone else would have noticed by now.  it's probably not a common
> > code path.  bees seems to be required to trigger the bug (or triggers it
> > much more often) so it may be a problem somewhere in one of the ioctls
> > bees uses.  bees does a lot of FILE_EXTENT_SAME, TREE_SEARCH_V2, and
> > LOGICAL_INO ioctls.  It could be any of the three, but FILE_EXTENT_SAME
> > does writes, and TREE_SEARCH_V2 isn't used by bees to look at items in
> > EXTENT_TREE, which leaves LOGICAL_INO which starts from EXTENT_TREE and
> > works outwards to subvol trees.  So my guess is it's a bug in LOGICAL_INO.

Reply via email to