Quoting Zygo Blaxell <ce3g8...@umail.furryterror.org>:
> On Fri, Sep 13, 2019 at 05:25:20AM -0400, General Zed wrote:
> >
> > Quoting General Zed <general-...@zedlx.com>:
> >
> > > Quoting Zygo Blaxell <ce3g8...@umail.furryterror.org>:
> > >
> > > > On Thu, Sep 12, 2019 at 08:26:04PM -0400, General Zed wrote:
> > > > >
> > > > > Quoting Zygo Blaxell <ce3g8...@umail.furryterror.org>:
> > > > >
> > > > > > On Thu, Sep 12, 2019 at 06:57:26PM -0400, General Zed wrote:
> > > > > > >
> > > > > > > At worst, it just has to completely write-out "all
> > > > > > metadata", all the way up
> > > > > > > to the super. It needs to be done just once, because
> > what's the point of
> > > > > > > writing it 10 times over? Then, the super is updated as
> > > > > > the final commit.
> > > > > >
> > > > > > This is kind of a silly discussion. The biggest extent
possible on
> > > > > > btrfs is 128MB, and the incremental gains of forcing 128MB
> > extents to
> > > > > > be consecutive are negligible. If you're defragging a 10GB
> > file, you're
> > > > > > just going to end up doing 80 separate defrag operations.
> > > > >
> > > > > Ok, then the max extent is 128 MB, that's fine. Someone here
> > > > > previously said
> > > > > that it is 2 GB, so he has disinformed me (in order to further
> > his false
> > > > > argument).
> > > >
> > > > If the 128MB limit is removed, you then hit the block group
size limit,
> > > > which is some number of GB from 1 to 10 depending on number of disks
> > > > available and raid profile selection (the striping raid profiles cap
> > > > block group sizes at 10 disks, and single/raid1 profiles
always use 1GB
> > > > block groups regardless of disk count). So 2GB is _also_ a
valid extent
> > > > size limit, just not the first limit that is relevant for defrag.
> > > >
> > > > A lot of people get confused by 'filefrag -v' output, which
coalesces
> > > > physically adjacent but distinct extents. So if you use that tool,
> > > > it can _seem_ like there is a 2.5GB extent in a file, but
it is really
> > > > 20 distinct 128MB extents that start and end at adjacent addresses.
> > > > You can see the true structure in 'btrfs ins dump-tree' output.
> > > >
> > > > That also brings up another reason why 10GB defrags are
absurd on btrfs:
> > > > extent addresses are virtual. There's no guarantee that a pair
> > of extents
> > > > that meet at a block group boundary are physically
adjacent, and after
> > > > operations like RAID array reorganization or free space
defragmentation,
> > > > they are typically quite far apart physically.
> > > >
> > > > > I didn't ever said that I would force extents larger than 128 MB.
> > > > >
> > > > > If you are defragging a 10 GB file, you'll likely have to do it
> > > > > in 10 steps,
> > > > > because the defrag is usually allowed to only use a limited
> > amount of disk
> > > > > space while in operation. That has nothing to do with the
extent size.
> > > >
> > > > Defrag is literally manipulating the extent size.
Fragments and extents
> > > > are the same thing in btrfs.
> > > >
> > > > Currently a 10GB defragment will work in 80 steps, but doesn't
> > necessarily
> > > > commit metadata updates after each step, so more than 128MB
of temporary
> > > > space may be used (especially if your disks are fast and empty,
> > > > and you start just after the end of the previous commit interval).
> > > > There are some opportunities to coalsce metadata updates,
occupying up
> > > > to a (arbitrary) limit of 512MB of RAM (or when memory
pressure forces
> > > > a flush, whichever comes first), but exploiting those opportunities
> > > > requires more space for uncommitted data.
> > > >
> > > > If the filesystem starts to get low on space during a defrag, it can
> > > > inject commits to force metadata updates to happen more often, which
> > > > reduces the amount of temporary space needed (we can't delete
> > the original
> > > > fragmented extents until their replacement extent is committed);
> > however,
> > > > if the filesystem is so low on space that you're worried
about running
> > > > out during a defrag, then you probably don't have big
enough contiguous
> > > > free areas to relocate data into anyway, i.e. the defrag is just
> > going to
> > > > push data from one fragmented location to a different fragmented
> > location,
> > > > or bail out with "sorry, can't defrag that."
> > >
> > > Nope.
> > >
> > > Each defrag "cycle" consists of two parts:
> > > 1) move-out part
> > > 2) move-in part
> > >
> > > The move-out part select one contiguous area of the disk. Almost any
> > > area will do, but some smart choices are better. It then moves-out all
> > > data from that contiguous area into whatever holes there are
left empty
> > > on the disk. The biggest problem is actually updating the metadata,
> > > since the updates are not localized.
> > > Anyway, this part can even be skipped.
> > >
> > > The move-in part now populates the completely free contiguous
area with
> > > defragmented data.
> > >
> > > In the case that the move-out part needs to be skipped because the
> > > defrag estimates that the update to metatada will be too big (like in
> > > the pathological case of a disk with 156 GB of metadata), it can
> > > sucessfully defrag by performing only the move-in part. In that case,
> > > the move-in area is not free of data and "defragmented" data won't be
> > > fully defragmented. Also, there should be at least 20% free disk space
> > > in this case in order to avoid defrag turning pathological.
> > >
> > > But, these are all some pathological cases. They should be
considered in
> > > some other discussion.
> >
> > I know how to do this pathological case. Figured it out!
> >
> > Yeah, always ask General Zed, he knows the best!!!
> >
> > The move-in phase is not a problem, because this phase
generally affects a
> > low number of files.
> >
> > So, let's consider the move-out phase. The main concern here is that the
> > move-out area may contain so many different files and fragments that the
> > move-out forces a practically undoable metadata update.
> >
> > So, the way to do it is to select files for move-out, one by
one (or even
> > more granular, by fragments of files), while keeping track of
the size of
> > the necessary metadata update. When the metadata update exceeds
a certain
> > amount (let's say 128 MB, an amount that can easily fit into RAM), the
> > move-out is performed with only currently selected files (file
fragments).
> > (The move-out often doesn't affect a whole file since only a
part of each
> > file lies within the move-out area).
>
> This move-out phase sounds like a reinvention of btrfs balance. Balance
> already does something similar, and python-btrfs gives you a script to
> target block groups with high free space fragmentation for balancing.
> It moves extents (and their references) away from their block group.
> You get GB-sized (or multi-GB-sized) contiguous free space areas into
> which you can then allocate big extents.
Perhaps btrfs balance needs to perform something similar, but I can assure
you that a balance cannot replace the defrag.