Re: performance implications of btrfs filesystem sync vs btrfs subvolume snaphot

Zygo Blaxell Sat, 23 Jan 2021 09:16:03 -0800

On Sat, Jan 23, 2021 at 12:22:11PM +0000, alex...@cobios.de wrote:
> Hello everybody :)
> 
> first time participant on linux-btrfs@vger.kernel.org mailinglist, 
> hence please excuse (yet tell me about) any problems. thank you.
> 
> My question/topic is:
> Wanting to generate backups of a btrfs filesystems on a running system
> it seems that using `btrfs subvolume snapshot` would be a possible way
> to make certain that data kept in RAM (i.e. buffer/cache) would be 
> synced to the disk.  
> 
> Reading this mailing list I stumpled upon this:
> 
> >> Subject:    Re: freezes during snapshot creation/deletion -- to be 
> >> expected? (Was: Re: btrfs based backup?)
> >> From:       Zygo Blaxell <ce3g8jdj () umail ! furryterror ! org>
> >>
> >> [..]
> >>
> >> Snapshot create has unbounded running time on 5.0 kernels.  The creation
> >> process has to flush dirty buffers to the filesystem to get a clean
> >> snapshot state.  Any process that is writing data while the flush is
> >> running gets its data included in the snapshot flush, so in the worst
> >> possible case, the snapshot flush never ends (unless you run out of disk
> >> space, or whatever was writing new data stops, whichever comes first).
> >> [..]
> 
> Now I wonder that if `btrfs filesystem sync` would be a viable alternative 
> to `btrfs subvolume snapshot`, with respect of not having to risk a 
> "snapshot flush never ends" situation? 
> 
> My layman perception is that.
> 
> 1) "btrfs on-disk-persistet data is ideally alway non-corrupted". Since
> changes are commited via COW and hence in a atomic fashion, meaning that
> at worst data on disk is outdated, but never corrupt. (unless hardware or 
> blockdevice issues )
> 
> 2) btrfs filesystem sync or sync(1) should flush data out from memory
> to disk - which would once finished - lead to a "more recent" consistent
> data on disk.


> 3) btrfs subvolume snapshot implies a sync
>
> Are those perceptions roughly correct?

1 and 2 are flipped--data is written first, then metadata pointing to
the data, then superblocks pointing to the metadata; however, delalloc
can delay data writes that occurred before the current commit so they
don't reach the disk until after the current commit.  In that case,
the following commit will reference the delayed data.

So with noflushonocommit, by default, assuming nobody ever calls
fsync(), we get:

        1.  process does some writes 1, 2, 3, 4, 5...

        2.  delalloc puts writes 1, 2, 3 on disk, updates in-memory
        metadata

        3.  kernel commit starts, flushes in-memory metadata to disk for
        writes 1, 2, 3.  4 and 5 are not complete yet so no metadata
        points to them.  Inodes are updated (especially file size),
        so after a crash we will have data for 1, 2, 3, and holes for
        4 and 5 (this is why noflushoncommit is bad).

        4.  metadata written to disk, disk write barrier prevents and
        writes after this line from being reordered before this line.

        5.  superblock updated to point to root of new metadata trees,
        disk write barrier again.

        6.  meanwhile delalloc was still writing out 4 and 5 to disk.
        No metadata points to them during the write, so incomplete writes
        don't matter; however, once the writes are complete, in-memory
        metadata is updated to point to them.

        7.  also meanwhile the process wrote more data writes 6, 7, 8
        which delalloc hasn't processed yet.

        8.  kernel commit starts, flushes in-memory metadata to disk for
        writes 4, 5 which are now done, but not 6, 7, 8 because delalloc
        hasn't started them yet.  Inodes are updated, now after a crash
        we have data for 1-5 but holes for 6, 7, or 8.

With 'flushoncommit', 'btrfs sub snap', and 'sync', the commit at step
3 waits for all delalloc writes to finish first, which fills in all
the holes that 'noflushoncommit' would normally leave behind and gives
snapshots their atomic behavior; however, delalloc at step 6 and the
process at step 7 may not be blocked, and they can keep adding more
writes to the transaction while the transaction commits.

> If so I am unsure if the issue with a "neverending flush" is related to
> the btrfs filesystem sync and consequently relying on btrfs filesystem 
> sync as alternative to btrfs snapshot to prevent "a neverending flush"
> is not a possibility. 

The issue is that processes can queue up more work for delalloc writes
while a sync is running.  If they can do that faster than the sync can
flush the data to disk, and if the disk still has remaining space for
metadata and data writes, then the transaction never gets to the point
where it has finished flushing out new data, so the sync runs until the
filesystem runs out of space.

Note the filesystem can't delete anything while a commit is running
(deletes are implied by the change in free space maps after a transaction
commit, so nothing gets deleted until a transaction commit is completed).
We _are_ guaranteed to eventually run out of space and complete or abort
this transaction, even if the process is overwriting data in the same
logical locations or deleting files as it goes.

There was a patch which corrects this for the one subvol that is directly
involved in the snapshot.  So

        cat /dev/zero > /mnt/sub1/file &
        btrfs sub snap /mnt/sub1 /mnt/snap1

no longer runs forever, because all of the writes before the sub snap
starts are tagged differently from all of the writes after the sub snap
starts, and the sub snap only waits for writes with the "before" tag.
Thus, the sub snap is guaranteed to finish in bounded time (proportional
to the size of vm.dirty_bytes).  Unfortunately, the patch that does this
(v5.0-rc1: 3cd24c698004 "btrfs: use tagged writepage to mitigate livelock
of snapshot") doesn't handle this case:

        cat /dev/zero > /mnt/sub1/file &
        btrfs sub snap /mnt/sub2 /mnt/snap2

because it only tags the subvolume that is being snapshotted--it still
has the unbounded running time for writes to any _other_ subvol.
A similar problem happens for sync, except sync doesn't even have the
above mitigation.

> Tahnk yo and best regards,
> 
> Alexander Mahr

Re: performance implications of btrfs filesystem sync vs btrfs subvolume snaphot

Reply via email to