Re: Feature requests: online backup - defrag - change RAID level

General Zed Sun, 15 Sep 2019 11:06:07 -0700


Quoting Zygo Blaxell <ce3g8...@umail.furryterror.org>:

On Fri, Sep 13, 2019 at 09:28:49PM -0400, General Zed wrote:


Quoting Zygo Blaxell <ce3g8...@umail.furryterror.org>:

> On Fri, Sep 13, 2019 at 05:25:20AM -0400, General Zed wrote:
> >
> > Quoting General Zed <general-...@zedlx.com>:
> >
> > > Quoting Zygo Blaxell <ce3g8...@umail.furryterror.org>:
> > >
> > > > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
> > > > >
> > > > > Quoting Zygo Blaxell <ce3g8...@umail.furryterror.org>:
> > > > >
> > > > > > On Thu, Sep 12, 2019 at 06:57:26PM -0400, General Zed wrote:
> > > > > > >
> > > > > > > At worst, it just has to completely write-out "all
> > > > > > metadata", all the way up
> > > > > > > to the super. It needs to be done just once, because
> > what's the point of
> > > > > > > writing it 10 times over? Then, the super is updated as
> > > > > > the final commit.
> > > > > >

> > > > > > This is kind of a silly discussion. The biggest extentpossible on

> > > > > > btrfs is 128MB, and the incremental gains of forcing 128MB
> > extents to
> > > > > > be consecutive are negligible.  If you're defragging a 10GB
> > file, you're
> > > > > > just going to end up doing 80 separate defrag operations.
> > > > >
> > > > > Ok, then the max extent is 128 MB, that's fine. Someone here
> > > > > previously said
> > > > > that it is 2 GB, so he has disinformed me (in order to further
> > his false
> > > > > argument).
> > > >

> > > > If the 128MB limit is removed, you then hit the block groupsize limit,

> > > > which is some number of GB from 1 to 10 depending on number of disks
> > > > available and raid profile selection (the striping raid profiles cap

> > > > block group sizes at 10 disks, and single/raid1 profilesalways use 1GB> > > > block groups regardless of disk count). So 2GB is _also_ avalid extent

> > > > size limit, just not the first limit that is relevant for defrag.
> > > >

> > > > A lot of people get confused by 'filefrag -v' output, whichcoalesces

> > > > physically adjacent but distinct extents.  So if you use that tool,

> > > > it can _seem_ like there is a 2.5GB extent in a file, butit is really

> > > > 20 distinct 128MB extents that start and end at adjacent addresses.
> > > > You can see the true structure in 'btrfs ins dump-tree' output.
> > > >

> > > > That also brings up another reason why 10GB defrags areabsurd on btrfs:

> > > > extent addresses are virtual.  There's no guarantee that a pair
> > of extents

> > > > that meet at a block group boundary are physicallyadjacent, and after> > > > operations like RAID array reorganization or free spacedefragmentation,

> > > > they are typically quite far apart physically.
> > > >
> > > > > I didn't ever said that I would force extents larger than 128 MB.
> > > > >
> > > > > If you are defragging a 10 GB file, you'll likely have to do it
> > > > > in 10 steps,
> > > > > because the defrag is usually allowed to only use a limited
> > amount of disk

> > > > > space while in operation. That has nothing to do with theextent size.

> > > >

> > > > Defrag is literally manipulating the extent size.Fragments and extents

> > > > are the same thing in btrfs.
> > > >
> > > > Currently a 10GB defragment will work in 80 steps, but doesn't
> > necessarily

> > > > commit metadata updates after each step, so more than 128MBof temporary

> > > > space may be used (especially if your disks are fast and empty,
> > > > and you start just after the end of the previous commit interval).

> > > > There are some opportunities to coalsce metadata updates,occupying up> > > > to a (arbitrary) limit of 512MB of RAM (or when memorypressure forces

> > > > a flush, whichever comes first), but exploiting those opportunities
> > > > requires more space for uncommitted data.
> > > >
> > > > If the filesystem starts to get low on space during a defrag, it can
> > > > inject commits to force metadata updates to happen more often, which
> > > > reduces the amount of temporary space needed (we can't delete
> > the original
> > > > fragmented extents until their replacement extent is committed);
> > however,

> > > > if the filesystem is so low on space that you're worriedabout running> > > > out during a defrag, then you probably don't have bigenough contiguous

> > > > free areas to relocate data into anyway, i.e. the defrag is just
> > going to
> > > > push data from one fragmented location to a different fragmented
> > location,
> > > > or bail out with "sorry, can't defrag that."
> > >
> > > Nope.
> > >
> > > Each defrag "cycle" consists of two parts:
> > >      1) move-out part
> > >      2) move-in part
> > >
> > > The move-out part select one contiguous area of the disk. Almost any
> > > area will do, but some smart choices are better. It then moves-out all

> > > data from that contiguous area into whatever holes there areleft empty

> > > on the disk. The biggest problem is actually updating the metadata,
> > > since the updates are not localized.
> > > Anyway, this part can even be skipped.
> > >

> > > The move-in part now populates the completely free contiguousarea with

> > > defragmented data.
> > >
> > > In the case that the move-out part needs to be skipped because the
> > > defrag estimates that the update to metatada will be too big (like in
> > > the pathological case of a disk with 156 GB of metadata), it can
> > > sucessfully defrag by performing only the move-in part. In that case,
> > > the move-in area is not free of data and "defragmented" data won't be
> > > fully defragmented. Also, there should be at least 20% free disk space
> > > in this case in order to avoid defrag turning pathological.
> > >

> > > But, these are all some pathological cases. They should beconsidered in

> > > some other discussion.
> >
> > I know how to do this pathological case. Figured it out!
> >
> > Yeah, always ask General Zed, he knows the best!!!
> >

> > The move-in phase is not a problem, because this phasegenerally affects a

> > low number of files.
> >
> > So, let's consider the move-out phase. The main concern here is that the
> > move-out area may contain so many different files and fragments that the
> > move-out forces a practically undoable metadata update.
> >

> > So, the way to do it is to select files for move-out, one byone (or even> > more granular, by fragments of files), while keeping track ofthe size of> > the necessary metadata update. When the metadata update exceedsa certain

> > amount (let's say 128 MB, an amount that can easily fit into RAM), the

> > move-out is performed with only currently selected files (filefragments).> > (The move-out often doesn't affect a whole file since only apart of each

> > file lies within the move-out area).
>
> This move-out phase sounds like a reinvention of btrfs balance.  Balance
> already does something similar, and python-btrfs gives you a script to
> target block groups with high free space fragmentation for balancing.
> It moves extents (and their references) away from their block group.
> You get GB-sized (or multi-GB-sized) contiguous free space areas into
> which you can then allocate big extents.

Perhaps btrfs balance needs to perform something similar, but I can assure
you that a balance cannot replace the defrag.


Correct, balance is only half of the solution.

The balance is required for two things on btrfs:  "move-out" phase of
free space defragmentation, and to ensure at least one unallocated block
group exists on the filesystem in case metadata expansion is required.

A btrfs can operate without defrag for...well, forever, defrag is not
necessary at all.  I have dozens of multi-year-old btrfs filesystems of
assorted sizes that have never run defrag even once.

By contrast, running out of unallocated space is a significant problem
that should be corrected with the same urgency as RAID entering degraded
mode.  I generally recommend running 'btrfs balance start -dlimit=1' about
once per day to force one block group to always be empty.

Filesystems that don't maintain unallocated space can run into problems
if metadata runs out of space.  These problems can be inconvenient to
recover from.

The point and the purpose of "move out" is to create a clean contiguous free
space area, so that defragmented files can be written into it.

> > Now the defrag has to decide: whether to continue with anotherround of the
> > move-out to get a cleaner move-in area (by repeating the same procedure
> > above), or should it continue with a move-in into a partialydirty area. I> > can't tell you what's better right now, as this can bedetermined only by
> > experiments.
> >
> > Lastly, the move-in phase is performed (can be done whether themove-in area> > is dirty or completely clean). Again, the same trick can beused: files can> > be selected one by one until the calculated metadata updateexceeds 128 MB.> > However, it is more likely that the size of move-in area willbe exhausted
> > before this happens.
> >
> > This algorithm will work even if you have only 3% free disk space left.
>
> I was thinking more like "you have less than 1GB free on a 1TB filesystem
> and you want to defrag 128MB things", i.e. <0.1% free space.  If you don't
> have all the metadata block group free space you need allocated already
> by that point, you can run out of metadata space and the filesystem goes
> read-only.  Happens quite often to people.  They don't like it very much.

The defrag should abort whenever it detects such adverse conditions as 0.1%
free disk space. In fact, it should probably abort as soon as it detects
less than 3% free disk space. This is normal and expected. If the user has a
partition with less than 3% free disk space, he/she should not defrag it
until he/she frees some space, perhaps by deleting unnecessary data or by
moving out some data to other partitions.


3% of 45TB is 1.35TB...seems a little harsh.  Recall no extent can be
larger than 128MB, so we're talking about enough space for ten thousand
of defrag's worst-case output extents.  A limit based on absolute numbers
might make more sense, though the only way to really know what the limit is
on any given filesystem is to try to reach it.


Nah.

The free space minimum limit must, unfortunately, be based on absolutepercentages. There is no better way. The problem is that, in order fordefrag to work, it has to (partially) consolidate some of the freespace, in order to produce a contiguous free area which will be thedestination for defrag data.

In order to be able to produce this contiguous free space area, it isof utmost importance that there is sufficient free space left on thepartition. Otherwise, this free space consolidation operation willtake too much time (too much disk I/O). There is no good way around itthe common cases of free space fragmentation.

If you reduce the free space minimum limit below 3%, you are likely tospend 2x more I/O in consolidating free space than what is needed toactually defrag the data. I mean, the defrag will still work, but Ithink that the slowdown is unacceptable.

I mean, the user should just free some space! The filesystems shouldnot be left with less than 10% free space, that's simply badmanagement from the user's part, and the user should accept theconsequences.

Re: Feature requests: online backup - defrag - change RAID level

Reply via email to