Re: Feature requests: online backup - defrag - change RAID level

General Zed Fri, 13 Sep 2019 10:02:54 -0700


Quoting General Zed <general-...@zedlx.com>:

Quoting General Zed <general-...@zedlx.com>:
Quoting Zygo Blaxell <ce3g8...@umail.furryterror.org>:
On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
Quoting Zygo Blaxell <ce3g8...@umail.furryterror.org>:
On Thu, Sep 12, 2019 at 06:57:26PM -0400, General Zed wrote:
At worst, it just has to completely write-out "all metadata",all the way up
to the super. It needs to be done just once, because what's the point of
writing it 10 times over? Then, the super is updated as thefinal commit.
This is kind of a silly discussion.  The biggest extent possible on
btrfs is 128MB, and the incremental gains of forcing 128MB extents to
be consecutive are negligible.  If you're defragging a 10GB file, you're
just going to end up doing 80 separate defrag operations.
Ok, then the max extent is 128 MB, that's fine. Someone herepreviously said
that it is 2 GB, so he has disinformed me (in order to further his false
argument).
If the 128MB limit is removed, you then hit the block group size limit,
which is some number of GB from 1 to 10 depending on number of disks
available and raid profile selection (the striping raid profiles cap
block group sizes at 10 disks, and single/raid1 profiles always use 1GB
block groups regardless of disk count).  So 2GB is _also_ a valid extent
size limit, just not the first limit that is relevant for defrag.

A lot of people get confused by 'filefrag -v' output, which coalesces
physically adjacent but distinct extents.  So if you use that tool,
it can _seem_ like there is a 2.5GB extent in a file, but it is really
20 distinct 128MB extents that start and end at adjacent addresses.
You can see the true structure in 'btrfs ins dump-tree' output.

That also brings up another reason why 10GB defrags are absurd on btrfs:
extent addresses are virtual.  There's no guarantee that a pair of extents
that meet at a block group boundary are physically adjacent, and after
operations like RAID array reorganization or free space defragmentation,
they are typically quite far apart physically.
I didn't ever said that I would force extents larger than 128 MB.
If you are defragging a 10 GB file, you'll likely have to do itin 10 steps,
because the defrag is usually allowed to only use a limited amount of disk
space while in operation. That has nothing to do with the extent size.
Defrag is literally manipulating the extent size.  Fragments and extents
are the same thing in btrfs.

Currently a 10GB defragment will work in 80 steps, but doesn't necessarily
commit metadata updates after each step, so more than 128MB of temporary
space may be used (especially if your disks are fast and empty,
and you start just after the end of the previous commit interval).
There are some opportunities to coalsce metadata updates, occupying up
to a (arbitrary) limit of 512MB of RAM (or when memory pressure forces
a flush, whichever comes first), but exploiting those opportunities
requires more space for uncommitted data.

If the filesystem starts to get low on space during a defrag, it can
inject commits to force metadata updates to happen more often, which
reduces the amount of temporary space needed (we can't delete the original
fragmented extents until their replacement extent is committed); however,
if the filesystem is so low on space that you're worried about running
out during a defrag, then you probably don't have big enough contiguous
free areas to relocate data into anyway, i.e. the defrag is just going to
push data from one fragmented location to a different fragmented location,
or bail out with "sorry, can't defrag that."
Nope.

Each defrag "cycle" consists of two parts:
    1) move-out part
    2) move-in part
The move-out part select one contiguous area of the disk. Almostany area will do, but some smart choices are better. It thenmoves-out all data from that contiguous area into whatever holesthere are left empty on the disk. The biggest problem is actuallyupdating the metadata, since the updates are not localized.
Anyway, this part can even be skipped.
The move-in part now populates the completely free contiguous areawith defragmented data.
In the case that the move-out part needs to be skipped because thedefrag estimates that the update to metatada will be too big (likein the pathological case of a disk with 156 GB of metadata), it cansucessfully defrag by performing only the move-in part. In thatcase, the move-in area is not free of data and "defragmented" datawon't be fully defragmented. Also, there should be at least 20%free disk space in this case in order to avoid defrag turningpathological.
But, these are all some pathological cases. They should beconsidered in some other discussion.
I know how to do this pathological case. Figured it out!

Yeah, always ask General Zed, he knows the best!!!
The move-in phase is not a problem, because this phase generallyaffects a low number of files.
So, let's consider the move-out phase. The main concern here is thatthe move-out area may contain so many different files and fragmentsthat the move-out forces a practically undoable metadata update.
So, the way to do it is to select files for move-out, one by one (oreven more granular, by fragments of files), while keeping track ofthe size of the necessary metadata update. When the metadata updateexceeds a certain amount (let's say 128 MB, an amount that caneasily fit into RAM), the move-out is performed with only currentlyselected files (file fragments). (The move-out often doesn't affecta whole file since only a part of each file lies within the move-outarea).
Now the defrag has to decide: whether to continue with another roundof the move-out to get a cleaner move-in area (by repeating the sameprocedure above), or should it continue with a move-in into apartialy dirty area. I can't tell you what's better right now, asthis can be determined only by experiments.
Lastly, the move-in phase is performed (can be done whether themove-in area is dirty or completely clean). Again, the same trickcan be used: files can be selected one by one until the calculatedmetadata update exceeds 128 MB. However, it is more likely that thesize of move-in area will be exhausted before this happens.
This algorithm will work even if you have only 3% free disk space left.
This algorithm will also work if you have metadata of huge size, butin that case it is better to have much more free disk space (20%) toavoid significantly slowing down the defrag operation.

I have just thought out an even better algorithm than this which getsto fully-defragged state faster, in a smaller number of disk writes.But I won't write it down unless someone says thanks for your effortso far, General Zed, and can you please tell us about your great newdefrag algorithm for low free-space conditions.

Re: Feature requests: online backup - defrag - change RAID level

Reply via email to